LeCun shares new results from Nature sub-journal on EEG synthesis of natural speech with open source code available

The latest progress in brain-computer interface was published in the Nature sub-journal, and LeCun, one of the three giants of deep learning, came to forward it.

This time it isSpeech synthesis using neural signalshelping people with aphasia due to neurological defects regain the ability to communicate.

Advertisement

Specifically, a research team from New York University developed a newDifferentiable speech synthesizera lightweight convolutional neural network can be used to encode speech into a series of interpretable speech parameters (such as pitch, loudness, formant frequency, etc.), and the speech can be resynthesized through a differentiable speech synthesizer.

By mapping neural signals to these speech parameters, the researchers built a neural speech decoding system that is highly interpretable and applicable to small-data-size situations, producing natural-sounding speech.

The researchers collected data from a total of 48 subjects and attempted speech decoding, providing verification for future high-accuracy brain-computer interface applications.

The results show that the framework can handle high and low spatial sampling densities, and can process EEG signals from the left and right hemispheres, showing strong speech decoding potential.

Advertisement

Speech decoding of neural signals is difficult!

Previously, Musk's Neuralink company has successfully implanted electrodes in a subject's brain, which can complete simple cursor control to achieve functions such as typing.

However, neuro-speech decoding is generally considered to be more complex.

Most attempts to develop neuro-speech decoders and other high-precision brain-computer interface models rely on a special kind of data: electrocorticogram (ECoG) recorded subject data, often collected from epilepsy patients during treatment. .

Using electrodes implanted in patients with epilepsy to collect cerebral cortex data during speech production, these data have high spatial and temporal resolution and have helped researchers achieve a series of remarkable results in the field of speech decoding.

However, speech decoding of neural signals still faces two major challenges:

  • for training personalized neural-to-speech decoding modelsData is very limited in timeusually only takes about ten minutes, and deep learning models often require a large amount of training data to drive.

  • Human pronunciation is very diverseeven if the same person speaks the same word repeatedly, the speaking speed, intonation and pitch will change, which adds complexity to the representation space built by the model.

Early attempts to decode neural signals into speech mainly relied on linear models, which generally did not require large training data sets and were highly interpretable, but had low accuracy.

Recently, based on deep neural networks, especially the use of convolutional and recurrent neural network architectures, many attempts have been made in the two key dimensions of simulating the intermediate latent representation of speech and the quality of synthesized speech. For example, there are studies that decode cerebral cortex activity into mouth movement space and then convert it into speech. Although the decoding performance is powerful, the reconstructed voice sounds unnatural.

On the other hand, some methods successfully reconstruct natural-sounding speech by using wavenet vocoder, generative adversarial network (GAN), etc., but their accuracy is limited.

A recent study published in Nature achieved both accurate and natural speech in a patient with an implanted device by using quantized HuBERT features as an intermediate representation space and a pre-trained speech synthesizer to convert these features into speech. Speech waveform.

However, HuBERT features cannot represent speaker-specific acoustic information and can only generate a fixed and uniform speaker voice, so additional models are needed to convert this universal voice into a patient-specific voice. Furthermore, this study and most previous attempts adopted a non-causal architecture, which may limit its use in practical applications of brain-computer interfaces that require temporal causal operations.

Building a Differentiable Speech Synthesizer

The research team of NYU Video Lab and Flinker Lab introduced a new decoding framework from electroencephalogram (ECoG) signals to speech, constructing a low dimension latent representation by using only speech signals. Speech encoding and decoding model generation.

Neural Speech Decoding Framework

Specifically, the framework consists of two parts:

  • part of it is ECoG decoderit can convert ECoG signals into acoustic speech parameters that we can understand (such as pitch, whether it is pronounced, loudness, and formant frequency, etc.);

  • the other part isspeech synthesizerwhich converts these speech parameters into spectrograms.

The researchers built a differentiable speech synthesizer, which allows the speech synthesizer to participate in the training process and jointly optimize to reduce spectrogram reconstruction errors during the training of the ECoG decoder.

This low-dimensional latent space has strong interpretability, coupled with a lightweight pre-trained speech encoder to generate reference speech parameters, helping researchers build an efficient neural speech decoding framework and overcome the challenges of neural speech decoding. The problem of very scarce domain data.

The framework can produce natural speech that is very close to the speaker's own voice, and the ECoG decoder part can be plugged into different deep learning model architectures and also supports causal operations.

The researchers collected and processed ECoG data from 48 neurosurgery patients, using multiple deep learning architectures (including convolution, recurrent neural network, and Transformer) as ECoG decoders.

The framework exhibits high accuracy on various models, with the best performance achieved with the convolutional (ResNet) architecture. The framework proposed by the researchers in this article can achieve high accuracy only through causal operations and a relatively low sampling rate (low-density, 10mm spacing).

They also demonstrated efficient speech decoding from both the left and right hemispheres of the brain, extending the application of neural speech decoding to the right hemisphere.

Differentiable speech synthesizer architecture

Differentiable speech synthesizer (speech synthesizer) makes the speech re-synthesis task very efficient, and can use very small speech to synthesize high-fidelity audio that matches the original sound.

The principle of the differentiable speech synthesizer draws on the principle of the human generative system and divides speech into two parts: Voice (used to model vowels) and Unvoice (used to model consonants).

The Voice part can first use the fundamental frequency signal to generate harmonics, and filter it with a filter composed of the formant peaks of F1-F6 to obtain the spectral characteristics of the vowel part.

For the Unvoice part, the researchers filtered the white noise with corresponding filters to obtain the corresponding spectrum. A learnable parameter can control the mixing ratio of the two parts at each moment. After that, the loudness signal is amplified and the background noise is added. to get the final speech spectrum.

Speech coder and ECoG decoder

Research result

1. Speech decoding results with temporal causality

First, the researchers directly compared the differences in speech decoding performance of different model architectures convolution (ResNet), loop (LSTM) and Transformer (3D Swin).

It is worth noting that these models can perform non-causal or causal operations on time.

The causality of decoding models has significant implications for brain-computer interface (BCI) applications: causal models only use past and current neural signals to generate speech, while acausal models also use future neural signals, which is not feasible in real-time applications .

Therefore, they focused on comparing the performance of the same model when performing non-causal and causal operations.

It was found that even the causal version of the ResNet model is comparable to the non-causal version, with no significant difference between the two. Similarly, the performance of the causal and non-causal versions of the Swin model is similar, but the performance of the causal version of the LSTM model is significantly lower than the non-causal version.

The researchers demonstrated average decoding accuracy (N=48) for several key speech parameters, including sound weight (used to distinguish vowels from consonants), loudness, pitch f0, first formant f1, and second formant f2 . Accurate reconstruction of these speech parameters, especially pitch, sound weight, and the first two formants, is critical to achieving accurate speech decoding and reconstruction that naturally mimics the participant's voice.

The results show that both non-causal and causal models can obtain reasonable decoding results, which provides positive guidance for future research and applications.

2. Research on speech decoding of left and right brain neural signals and spatial sampling rate

The researchers further compared the speech decoding results of the left and right brain hemispheres. Most studies have focused on the left hemisphere, which dominates speech and language functions, while less attention has been paid to decoding language information from the right hemisphere.

In response to this, they compared the decoding performance of the participants' left and right brain hemispheres to verify the possibility of using the right hemisphere for speech recovery.

Among the 48 subjects collected in the study, 16 subjects had ECoG signals collected from the right brain.

By comparing the performance of ResNet and Swin decoders, it was found that the right brain hemisphere can also stably decode speech, and the decoding effect is smaller than that of the left brain hemisphere.

This means that for patients with damage to the left hemisphere and loss of language ability, using neural signals from the right hemisphere to restore language may be a feasible solution.

Next, they also explored the impact of electrode sampling density on speech decoding effects.

Previous studies mostly used higher density electrode grids (0.4 mm), while the density of electrode grids commonly used in clinical practice is lower (LD 1 cm). Five participants used hybrid-type (HB) electrode grids, which are primarily low-density sampling but incorporate additional electrodes. The remaining forty-three participants were sampled at low density. The decoding performance of these hybrid samples (HB) is similar to traditional low-density samples (LD).

This shows that the model can learn speech information from the cerebral cortex with different spatial sampling densities, which also implies that the sampling density commonly used in clinical practice may be sufficient for future brain-computer interface applications.

3. Research on the contribution of different brain areas of the left and right brain to speech decoding

The researchers also examined the contribution of speech-related areas of the brain in the speech decoding process, which provides an important reference for future implantation of speech restoration devices in the left and right brain hemispheres.

Occlusion analysis was used to assess the contribution of different brain regions to speech decoding.

By comparing the causal and non-causal models of ResNet and Swin decoders, it is found that the auditory cortex contributes more in the non-causal model. This supports that in real-time speech decoding applications, causal models must be used, because in real-time speech decoding, We cannot take advantage of neurofeedback signals.

In addition, the contribution of the sensorimotor cortex, especially the abdominal region, is similar whether in the right or left hemisphere, suggesting that implanting neural prostheses in the right hemisphere may be a feasible solution.

In summary, this research has made a series of progress in brain-computer interface, but the researchers also mentioned some limitations of the current model. For example, the decoding process requires speech training data paired with ECoG recordings, which may be difficult for aphasic patients. not applicable.

In the future, they hope to develop model architectures that can handle non-grid data and better utilize multi-patient, multi-modal EEG data.

For the field of brain-computer interface, current research is still at a very early stage. With the iteration of hardware technology and the rapid progress of deep learning technology, the brain-computer interface ideas appearing in science fiction movies will become closer to reality.

References:

  • Paper link: articles/s42256-024-00824-8

  • GitHub link: flinkerlab/neural_speech_decoding

  • More examples of generated speech: nsd/

Advertisement