A computer program can lip-read four times more accurately than a human expert according to Oxford University researchers.
They worked with Google’s DeepMind project to develop an AI system named Watch, Attend and Spell. The project involved a neural network using image recognition tools to analyse 5,000 hours of TV news footage making up 118,000 sentences using 17,500 different words, comparing the mouth movements with the text from subtitles.
The accuracy improved over time, largely because the system learned more about the context of individual words. As the BBC notes, one example was that in news footage “Prime” often immediately preceded “Minister”.
Once the development was complete, the researchers ran tests on new silent footage. They found the software recognized 50 percent of the words correctly, compared with just a 12 percent success rate by a lip-reading expert. However, the accuracy is likely to only be so high with specific types of speech, namely the language and style used by newsreaders, rather than general conversation. The system would also need to be speeded up to cope with real-time “translation” rather than working on recorded footage.
While there’s plenty more work to be done, long-term uses could include more accurate transcription of video where multiple people are speaking over one another; dubbing speech in silent archive film; and improving speech recognition on smart phones in noisy environments.