Live speech to text during lecturesMay 20, 2010 at 4:49 pm | Posted in event, lecture recording | 3 Comments
Tags: conference, synote, voice-to-text
Last week I attended the final meeting of the EU Net4Voice project in Bologna, which presented its outcomes and debated future directions. My involvement has been as a member of the steering group for the Synote system, developed by Mike Wald and ECS.
The project explored the use of commercial PC-based speech-to-text systems such as Dragon Dictate to create a transcript of a live lecture and display it. The primary aim was to support deaf students and hopefully avoid their need for human assistants. There are quite a few challenges to overcome:
- the software needs to be trained to recognise the tutor’s voice to gain high accuracy (around 95%);
- the software is designed for dictation rather than natural speech, so accuracy drops sharply during lectures to around 80%
- that 20% failure rate can include many of the key words needed to make sense of the text;
- the software waits until you pause before displaying recognised text, which is OK for dictation, but causes an irritating lag with continuous speech.
However, provided the tutor adjusts their lecturing style, reasonable results can be achieved, especially if the system is trained to recognise unusual words, names etc that will be used in the lecture. Some deaf students found the words a distration, as they were good lip-readers, while others found it really useful. Interestingly, regular students also made use of the live transcript and used it to check their understanding of what had just been said.
The project also looked at the use of hosted speech-to-text systems based on IBM ViaVoice to produce a transcript that can be viewed after the lecture – and this is where the Synote system excels. It not only produces the transcript, but also synchronises it with the recording of the lecture, the slides etc. We’ll be looking to see how we can integrate Synote with our lecture capture pilot. Again, the main challenge is accuracy – ViaVoice achieves around 80% with natural speech but requires no training. The transcript can be manually edited while retaining synchronisation – but the question is who does the editing?
During the conference a live transcript was displayed above the speakers, and it was excellent. Not only was it very accurate but is missed out ers and ums and even corrected their grammar (many speakers were Italian speaking rather good English). How did they do it? It turned out that two professional translators were taking it in turn to ‘revoice’ the speakers into Dragon Dictate.
Unfortunately, getting computers to recognise and transcribe natural speech with 100% accuracy is currently science fiction. It would require the computer to recognise the context and meaning of the audio, to filter out all the redundancy and errors, to actually UNDERSTAND the speech. And that is true artificial intelligence.
I suspect that a cheaper technological solution will be to get underpaid but educated workers to touch-type transcripts from audio files sent to them across the Internet – and that of course is exactly the charge levelled at the voicemail-to-text company SpinVox – although they claim that the majority of calls are automatically translated. That seems reasonable to me, actually – most voicemails are very predictable, with only a few requiring comparatively expensive human attention.