The System Integration research group at the Institute for Information Systems (iisys) at Hof University of Applied Sciences is concerned with ways of networking information systems at the technical level and relating the content of the various systems. As part of the project “Digital Transformation of Medium-Sized Businesses with Artificial Intelligence (DAMMIT)”, the staff led by Prof. Dr. René Peinl, head of the research group and scientific director of the institute, also conduct research on the topics of speech recognition and speech synthesis. The data set for speech synthesis developed by the research group is now being used at Nvidia, one of the five most important players worldwide in the field of artificial intelligence (AI).
We congratulate Prof. Peinl and his colleagues on this success and took this opportunity to talk to him about the research projects around speech recognition and synthesis.
What does it mean for research at Hof University of Applied Sciences that Nvidia is using the “Hof” dataset?
It is a great recognition to see that one of the five most important AI companies worldwide is working with our research results.
Can you understand how this came about?
Not in detail, but our research has been on Nvidia’s radar for quite some time. One of my colleagues has already received a job offer at Nvidia. Presumably they have read our publication (scientific paper).
Is anyone allowed to use this dataset without being asked?
Yes, our data set is based on public data from Librivox and we have also made our “refined” data available again under a free license for reuse on the Internet. Open source software is also becoming more and more popular beyond AI, be it Linux, Android, VLC Media Player, LibreOffice or Blender. With AI, we also need open data and published pre-trained AI models. There is also very good progress here. In speech recognition, for example, we use a model where Google contributes the software, Facebook prepares the speech data, and Nvidia trains the model, which we in turn refine further. “Standing on the shoulder of giants” is a common slogan that describes this situation quite well.
How specifically does Nvidia use the data and what is it supposed to achieve?
Unlike us, Nvidia has created a so-called multi-speaker model for speech synthesis. For this, speech data from several people (not too many) and with not quite as many hours per person are used. In total you need at least 100h. In addition to the speech data, the model gets an identification number for each speaker. Then it can speak with different voices by entering the speaker ID in addition to the text to be “read out”. However, often the multi-speaker models are inferior in quality to models trained with many hours (>25h) of a single speaker. Unfortunately, this is also the case here, so that the results from Nvidia cannot keep up with our Voices Bernd and Hocus Pocus.
Since when has research been carried out in Hof in the field of speech synthesis?
We have been working on speech synthesis since 2019. The impetus was provided by the settlement of the startup ahearo at the digital incubator Einstein1.
How can the university benefit from this research success?
It helps the university’s research gain more international visibility.
What is the state of research at iisys? What is planned? Is there a major goal you are pursuing with your research in this area?
The big goal is to develop a digital voice assistant that works without data leakage to servers in the cloud, especially those of global corporations, and is suitable for enterprise use. This requires speech recognition, text understanding, and speech synthesis (text-to-speech), and we are conducting intensive research on all three parts.
We thank Prof. Peinl for the interview!