Meta Says Its AI Can Lip Read to Boost Speech Recognition / Digital Information World

mTags

Meta

Artificial intelligence

Addition date

Jan 11, 2022 8:24 AM

Column

Raphael Thys

Link

https://www.digitalinformationworld.com/2022/01/meta-says-its-ai-can-lip-read-to-boost.html

LAN

The main technique that is used during face to face communication is speech, but this involves a lot more than just listening to the words that people say. Reading someone’s lips can also be a crucial aspect of this since it can help you parse the meaning of their words in situations where you might not be able to hear them all that clearly, and that is something that Meta seems to be taking into account when it comes to their AI.

A lot of studies have revealed that it would be a lot more difficult to understand whatever it is that someone is trying to say if you can’t see the manner in which their mouth is moving. Meta has developed a new framework called AV-HuBERT that will take both factors into account because of the fact that this is the sort of thing that could potentially end up vastly improving its speech recognition potential, although it should be said that this is only a test at this point.

What Meta is basically trying to do is to see if anything can be gained by allowing AI to read lips as well as listen to audio recordings and the like. Previously, voice and speech recognition software has operated on an audio only basis. Monitoring the movement of lips could add another form of input that may very well boost the ability of AI to understand human beings and to contextualize their words thereby enabling said AI to perform tasks in a much more efficient manner after it has been fully trained.

With all of that having been said and now out of the way, it is important to note that the results that have come in for AV-HuBERT seem to be rather positive with all things having been considered and taken into account. Meta claims that their framework has attained a 75% more accurate understanding of transcriptions than even the very best audiovisual frameworks that are currently being used, and what’s more is that according to Meta’s claims they only needed 10% of the data to get these superior results.

Most situations where you might want to commune with your AI would be quite noisy, such as when you are out on the street or if you are at a party where everyone is talking and where loud music is being played. This framework would be able to understand you in these situations which makes it surpass existing AI by a large margin, and the fact that it needs a lot less data can help make it useful for languages that don’t have a huge amount of recordings that can be fed into the algorithm.

There has already been a lot of innovation in this regard. For example, DeepMind, which is owned by Alphabet, used thousands of hours of TV show content to train itself and it was able to translate words with 50% accuracy using nothing but lip reading. Oxford University has done a fair bit of advancement in this area as well, and Meta’s contributions are likely going to take this type of tech to a whole new level. It will be exciting to see where things go from here.