Deep learning for speech enhancement applied to radio communications using non-conventional sound capture devices

Capturing a clean voice signal in extremely noisy environments is a challenging task. When traditional microphones fail to achieve a sufficient speech-to-noise ratio, body-conducted microphones offer an alternative by significantly attenuating external noise. However, this comes at the cost of reduced audio bandwidth. This family of non-conventional sound capture devices, which includes bone vibration pickups, skin accelerometers, and in-ear and throat microphones, captures the body's internal vibrations, making them highly noise-resilient. However, since the body acts as a low-pass filter, high-frequency components of speech cannot be transmitted to the audio sensors. Additionally, physiological noises, such as heartbeat, breathing, and swallowing are also recorded, potentially interfering with speech clarity. This creates the need for a post-processing speech enhancement algorithm.

This thesis, grounded in deep learning, aims to develop such an algorithm for real-time radio communication systems, with a focus on deployment in resource-constrained environments. Given that body-conducted microphones are non-conventional and deep learning relies heavily on data, it was necessary to construct a dedicated dataset for effective training and relevant testing for the target application, while also making it available to the community. We then focus on designing a speech enhancement algorithm tailored to the unique characteristics of these sensors. Our approach leverages deep learning to process the signal directly in the relevant frequency range while remaining in the waveform domain, performing joint bandwidth extension and denoising through adversarial training. The enhancement performance is evaluated from multiple perspectives, including speech quality, intelligibility, and speaker identity preservation. Additionally, computational constraints such as latency and memory consumption are considered to ensure suitability for edge deployment.

Finally, recent advances in neural audio codecs are explored as potential foundation models for the body-conducted speech enhancement task. Their architecture closely aligns with the proposed extreme bandwidth extension network and they offer highly compressed representations suitable for radio communications.

Keywords: Deep Learning, Speech enhancement, Bandwidth extension, Body-conduction microphones, Signal processing, Robust communication