The more realistic speech synthesis is, what are the problems that will be faced?

Oct 7, 2022

| Share on:

arts/

Voice is a unique identification of a person, and with the rapid development of artificial intelligence, there are more and more alternative options for this unique identification - speech synthesis, as an important branch of artificial intelligence, aims to input text , through artificial intelligence algorithms, to synthesize audio that is as natural as human speech.

Now, machines can easily and accurately imitate human speech, and are widely used in audio and video creation scenarios, and even machines can clone the voice of a specific person. Feed the algorithm a snippet of someone's voice, and the algorithm will learn the person's way of speaking, and then combine this way of speaking with other human voices, but problems may arise.

Speech synthesis will surpass expectations

Speech synthesis involves the creation of specific sound models that not only translate words into sounds, but also into sounds that approximate the intonation and rhythm of a human being. Although speech synthesis is not a recent technology, it is even a technology that is widely used in all walks of life, as well as in the production and life of people's society, but the future that speech synthesis can bring is still beyond people's imagination.

For example, in terms of dubbing, in the past few decades, many classic TVB film and television films are inseparable from dubbing. In addition, the most realistic sound synthesis in animation and other films and televisions is mostly achieved by recording the voices of voice actors, and then cutting their voices into different segments, and "splicing" these sounds together like a puzzle. form a whole sound. Speech synthesis is expected to replace the tedious and boring dubbing work of the past. Dubbing will no longer be the norm for professionals, anyone can autonomously and simply clone their own voice, and the cloned voice is almost lifelike.

Until recently, voice cloning, or "voice banking" as it used to be called, was a bespoke business for those at risk of losing their speech ability due to cancer or surgery. In the past, imitating and synthesizing speech was time-consuming and expensive. The process involves recording many short sentences, each with a different emotional focus and repeated multiple times according to different contexts (statements, questions, commands, etc.) in order to cover all possible pronunciations. Belgian voice banking company Acapela Group charges 3,000 euros ($3,200) for the eight-hour recording process. Other companies charge more and require customers to spend days in the studio.

Now, the neural network can be trained on unsorted data of the target sound, resulting in a complete piece of audio in a simple, fast, and easy way. When one exports the cloned audio from the device, the timbre and sound quality are almost uncompressed and unaffected.

However, the speech synthesis currently used or expected by people is only a part of the speech synthesis scene, and looking at the broader future of speech synthesis, it will become a brand-new communication device for human beings .

Many engineers are currently working on developing sophisticated systems that connect the human brain and computers, and this work is constantly improving. Although current systems are primarily based on gaze and visual attention—which is difficult for many patients to do—systems for decoding auditory attention and motor imagery are also under development.

In the future, using such a device, a quadriplegic patient could successfully control a robotic arm with his mind. If such a device is implanted in the language area of the brain, a speech synthesizer may one day be able to convey what a patient wants to say . Going a step further, stroke patients who are completely paralyzed may be able to "speak" through a speech synthesizer that can recognize brain patterns in individual languages.

In April 2019, Edward Chang, a Chinese-American professor at the University of California, San Francisco, and his colleagues developed a decoder that converts brain activity into speech. Brain signals related to lip and tongue movements were used to synthesize the speech the subjects wanted to express.

It can be said that where human voices have reached in the past, speech synthesis is now reaching step by step. The application of speech synthesis is more and more deeply integrated into people's production and life, and it is also inadvertently changing people's lives.

About the real game

Nearly mature speech synthesis can now easily and accurately mimic human speech, but problems may arise.

In 2014, Val Kilmer, a well-known actor in the movie "Batman Forever", was forced to undergo a tracheotomy due to throat cancer, resulting in impaired voice. So far, Val Kilmer has passed by many excellent films, and his acting career has almost fallen to the bottom. And using speech synthesis technology can create Val Kilmer's "original sound". In August 2021, a startup called Sonantic claimed that they had created Val Kilmer's "original voice" using artificial intelligence voice cloning technology.

This kind of voice cloning technology is not complicated to use. People only need to hold the prepared lines and record them carefully into the microphone for about 30 minutes to complete the first step of the cloning process. In the process of recording, if you read a typo, or the pronunciation is not very clear, just stop and re-record this part.

After all recordings are done, the resulting audio file is exported and processed, and the cloned sound is ready in a few hours. At this moment, people can enter various words to express in the interactive interface, and artificial intelligence cloning technology can generate their own "realistic voice" in a relatively short period of time.

CandyVoice, a new Paris-based company, has developed a mobile app that lets you speak about 160 French or English phrases into it, and the program can recombine fragments of those pronunciations, and say whatever words you type later, sound like our own. The sound is quite similar. This app actually clones our voice. The stitched voice still sounds a bit synthetic, but CandyVoice boss Jean-Luc Crébouw believes that improvements in the company's algorithms will make it more natural.

There is also a similar software, Festvox, developed by Carnegie Mellon University's Language Technology Institute for English and the four widely spoken Indian languages. Baidu, on the other hand, said the software it developed can simulate a person's voice with just 50 sentences.

However, under the increasingly realistic speech synthesis, there are more and more concerns and doubts - the more real the fake, the greater the cost of identifying fakes . From speech synthesis to video synthesis, one of the serious consequences is a serious challenge to the authenticity of information.

Since the advent of photography, video, and ray scanning technology, the objectivity of visual texts has been slowly established in law, journalism and other social fields, becoming the existence of truth, or the most powerful evidence for constructing truth. "Seeing is believing" became the most popular expression of this epistemological authority. In this sense, visual objectivity comes from a specific professional authority system, and sound exists as a unique identity of a person.

However, the technical advantages and safari characteristics of synthesis make this professional authority system encounter unprecedented challenges. With the help of the visual text produced by this system, deepfakes have replaced different or even opposite text contents and meanings, resulting in the self-subversion of the text, which fundamentally subverts the production system of objectivity or truth. After the invention of PS, there is no truth if there are pictures; and the emergence of deep forgery technology has made the video also become a mirror image, and the sound is no longer credible, which is already full of fake news for the Internet. , which will undoubtedly lead to further breakdown of trust.

For example, in 2021, a bank manager received a phone call from a company director: the company had arranged a takeover, a huge amount of money would be transferred from the account, and he wanted him to approve the process, along with an email from the lawyer involved. , to confirm the amount and transfer to the account.

This transaction is legal and compliant, and there is no problem with the process. Besides, it was the call from the boss himself, and he transferred the US$35 million as required. Until the transfer was completed, the Dubai executive never imagined that the familiar voice of the boss on the other end of the phone was actually synthesized using voice cloning technology. The scam was reported by Forbes, but the names of the victims and other further details were not disclosed. At least 17 people are estimated to have been involved in the sophisticated scheme, and since the beginning of last year, the swindled funds have been sent to bank accounts around the world.

In general, the possibility of speech synthesis is real and is clearly seen by people, but the risk of speech synthesis also needs people's attention - you know, a world that has lost its "realism" will be better than no speech synthesis. The world is more terrifying.

The more realistic speech synthesis is, what are the problems that will be faced?

Speech synthesis will surpass expectations

The most familiar speech synthesis applications are artificial intelligence-based voice telephony, voice navigation, voice assistants, and dubbing.

About the real game

Related Posts