Communications of the ACM, May 2022, Vol. 65 No. 5, Pages 30-31
By Gregory Mone
“If you just chain together automatic transcription, translation, and speech synthesis, you end up accumulating too many errors.”
In a critical episode of The Mandalorian, a TV series set in the Star Wars universe, a mysterious Jedi fights his way through a horde of evil robots. As the heroes of the show wait anxiously to learn the identity of their cloaked savior, he lowers his hood, and—spoiler alert— they meet a young Luke Skywalker.
Actually, what we see is an animated, de-aged version of the Jedi. Then Luke speaks, in a voice that sounds very much like the 1980s-era rendition of the character, thanks to the use of an advanced machine learning model developed by the voice technology startup Respeecher. “No one noticed that it was generated by a machine,” says Dmytro Bielievtsov, chief technology officer at Respeecher. “That’s the good part.”
Respeecher is one of several companies developing systems that use neural networks to model the voice of a particular speaker, then apply that model and create speech that sounds like that individual, even if the person has never actually uttered the words being spoken. The potential for deepfake-type uses is unsettling, so Respeecher is careful to secure approval from individuals before applying the technology to their voices. The company, and others like it, also are working on digital watermarking and other techniques to indicate a sample is synthesized.
There are many positive applications for such voice cloning systems. “If you know that you might lose your voice because of surgery or a medical condition, then you could record it in advance, create a model of your voice, and have the synthesized speech sound like you,” observes Simon King, a professor of speech processing at the U.K.’s University of Edinburgh.
Some companies are pushing the technology even further, developing systems that automatically dub dialogue into other languages while retaining the voice characteristics of the original speaker. Although many challenges remain, advances in speech recognition, translation, and synthesis have accelerated progress in the area, suggesting we might be hearing more subtly synthesized voices in the years to come.
From Fiction to Fact
Researchers have been working to develop automatic speech-to-speech translation for at least three decades, according to computer scientist Alan Black of the Language Technologies Institute at Carnegie Mellon University. In the early 2000s, the U.S. Defense Advanced Research Projects Agency (DARPA) funded a project with the goal of developing a universal translator. Black says the teams involved made significant progress translating from English to Arabic and Iraqi dialects, but there were limitations, and it never achieved the sleek functionality of the universal translator popularized in Star Trek.
About the Author:
Gregory Mone is a science writer and the author, most recently, of the novel Atlantis: The Accidental Invasion.
- Liu, Z. and Mak, B.
Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers, ICASSP 2020, Nov. 26, 2019, https://arxiv.org/abs/1911.11601
- King, S.
Measuring a decade of progress in Text-to-Speech. Loquens, January 2014, https://doi.org/10.3989/loquens.2014.006
- Van den Oord, A. and Dieleman, S.
Wavenet: A generative model for raw audio, DeepMind Blog, Sept. 8, 2016, https://bit.ly/3pXZNzm
- Wang, Y. et al.
Tacotron: Towards end-to-end speech synthesis, InterSpeech 2017; https://arxiv.org/abs/1703.10135