Computer voices - hardly distinguishable from the human original!

Computer applications that read texts aloud are already a great help in everyday life, especially for blind or visually impaired people. Even when driving, people have long since become accustomed to the friendly voices from the navigation system, which save drivers from dangerous distractions. But of course, the new technology also harbors dangers. The Institute for Information Systems at Hof University of Applied Sciences is conducting a study on the acceptance of artificially generated voices and developing its own models for the German market.

*Prof. Dr. Rene Peinl, head of the Institute for Information Systems; Image: Hof University of Applied Sciences;*

The quality of so-called speech synthesis has improved considerably in recent years. If voices sounded rather tinny or choppy for a long time, the sound is gradually giving way to an increasing naturalness and unobtrusive speech dynamics. This also makes listening to longer texts more pleasant.

Rapid improvement in speech quality

“This has been achieved in international research through the use of deep neural networks. In the English-speaking world in particular, it is already almost impossible to distinguish between a real person and a program,” says Prof. Dr. Rene Peinl, head of the Institute for Information Systems at Hof University of Applied Sciences. Accordingly, there are now a number of freely available models that speak English very naturally if sufficient training data is used. Speech generation usually takes place in two stages. First, a so-called Mel spectrogram is generated, which is a representation of the speech frequencies. From this, a vocoder then generates the actual audio signal. Both stages are neural networks that must be trained separately.

Acceptance on the test bench

The DAMMIT program at Hof University of Applied Sciences, which focuses on technology transfer between universities and small and medium-sized enterprises for digital transformation, is analyzing how high user acceptance is for computer-generated voices. Test subjects are read text content of medium length – for example, messages half a screen page long. The steady improvement in the quality of speech synthesis that has taken place in recent years increases the convenience and possible uses of the technology on the one hand, but also harbors dangers on the other, since machine voices that sound human can of course also be used for fraud or criminal acts.

Many possible applications

The automated reading aloud of texts can currently be found in more and more areas of application. Being able to take in information even though the eyes have to focus on another target is an invaluable advantage: “Speech synthesis is of course an essential part of accessibility for people with visual impairments. In very practical terms, however, orders can be verbalized for forklift drivers, among others, which can be very helpful and time-saving in their workflow. Or you can have the daily news read out to you in your personal favorite voice. In general, speech synthesis is also an important part of voice-controlled applications such as smart speakers, e.g. Amazon’s Alexa,” Prof. Dr. Peinl explains some possible applications.

*Prof. Dr. Rene Peinl; Image: Hof University of Applied Sciences*

Market demand is growing

The demand for automatically generated but human-sounding voices is probably only just beginning. One example of this can be found on the campus of Hof University of Applied Sciences and there in the Einstein 1 business incubator: The start-up company ahearo offers a service that allows people to listen to content that is otherwise only available as text, also as an audio podcast. Until now, these texts have been read in by human speakers. “Such a production is of course cost-intensive and also reaches its limits due to the limited availability of professional speakers. The collaboration with Hof University of Applied Sciences therefore opens up completely new possibilities for us,” says Johannes Garbarek, founder and CEO of ahearo

High speed and low cost

“For ahearo and other companies looking for a cost-effective and fast way to incorporate high-quality speech synthesis into their products, we are developing a solution for generating German speech from text,” says Prof. Dr. Peinl. Freely available, self-generated audio data provided by ahearo is used to train the speech synthesis models in the best possible way. The evaluation is based on objectively measurable values as well as on subjective assessments of the test persons.

Encouraging interim results

The results obtained so far are encouraging and give reason to hope that the software will soon be used in practice: “Short sentences are already read out very well in our model. The challenges are still pauses and stresses in more complex sentences, as well as abbreviations, compound words and proper names,” explains researcher Peinl. A small anecdote shows that the computer program sometimes has the same problems as humans: “For example, we have the word “early summer meningoencephalitis (FSME)” in our test texts. It’s no wonder that not only we, but also the computer, have difficulties with such word monstrosities,” says Professor Dr. Peinl.

Support

The results of the study, as well as the software developed in the course of the research, will be published and made accessible. The project is funded by the ERDF program Bavaria 2014-2020, by the European Union through the Regional Development Fund and by the Bavarian State Ministry of Science and the Arts. Another project partner is smartlytic GmbH, a software development and data analysis company based on the Hof University campus.

Back to start page