In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are
either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units
called "visemes" for example). Here we review these and other assumptions and show the surprising result that
computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors.
However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit
for recognition does need further examination. We conclude that visemes, which were defined over a century
ago, are unlikely to be optimal for a modern computer lip-reading system.
Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all humans, suffer from unreliability. Here we report the results of a long-term study in automatic lip-reading with the objective of converting video-to-text (V2T). The V2T problem is surprising in that some aspects that look tricky, such as real-time tracking of the lips on poor-quality interlaced video from hand-held cameras, but prove to be relatively tractable. Whereas the problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Here we review the problem of automatic lip-reading for crime fighting and identify the critical parts of the problem.
A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded
on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered
but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we
are investigating the possibilities and limitations of applying this technique under realistic conditions. We have
adopted a step-by-step approach and are developing a capability when prior video information is available for the
suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text
(S2T) which also has applications in security and law-enforcement.
KEYWORDS: Video, Video processing, Visualization, Mouth, Laser induced plasma spectroscopy, RGB color model, Information visualization, Signal to noise ratio, Optical tracking, Modeling
Accurate lip-reading techniques would be of enormous benefit for agencies involved in counter-terrorism and other law-enforcement areas. Unfortunately, there are very few skilled lip-readers, and it is apparently a difficult skill to transmit, so the area is under-resourced. In this paper we investigate the possibility of making the lip-reading task more amenable to a wider range of operators by enhancing lip movements in video sequences using active appearance models. These are generative, parametric models commonly used to track faces in images and video sequences. The parametric nature of the model allows a face in an image to be encoded in terms of a few tens of parameters, while the generative nature allows faces to be re-synthesised using the parameters. The aim of this study is to determine if exaggerating lip-motions in video sequences by amplifying the parameters of the model improves lip-reading ability. We also present results of lip-reading tests undertaken by experienced (but non-expert) adult subjects who claim to use lip-reading in their speech recognition process. The results, which are comparisons of word error-rates on unprocessed and processed video, are mixed. We find that there appears to be the potential to improve the word error rate but, for the method to improve the intelligibility there is need for more sophisticated tracking and visual modelling. Our technique can also act as an expression or visual gesture amplifier and so has applications to animation and the presentation of information via avatars or synthetic humans.
Proceedings Volume Editor (7)
This will count as one of your downloads.
You will have access to both the presentation and article (if available).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.