No Access

VISUAL SPEECH RECOGNITION USING DYNAMIC FEATURES AND SUPPORT VECTOR MACHINES

WAI CHEE YAU

School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V, Melbourne, Victoria 3001, Australia

Search for more papers by this author

DINESH KANT KUMAR

School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V, Melbourne, Victoria 3001, Australia

Search for more papers by this author

, and

SRIDHAR POOSAPADI ARJUNAN

School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V, Melbourne, Victoria 3001, Australia

Search for more papers by this author

https://doi.org/10.1142/S0219467808003167Cited by:8 (Source: Crossref)

Abstract

This paper presents a vision based technique to identify the unspoken phones using a small camera that is located on the headset of the speaker. The system is based on temporal integration of the video data to generate motion history image (MHI). The paper proposes the use of global features to classify the MHI and compares the use of image moments with Discrete Cosine Transform (DCT). A comparison between Zernike moments (ZM) with DCT indicates that while the accuracy of classification for both techniques is very comparable (96% for ZM and 94% for DCT) when there is no relative motion between the camera and the mouth, ZM is resilient to rotation of the camera and continues to gives good results despite rotation but DCT is sensitive to rotation.

Based on the accuracy of the system and its resilience to movement artefacts such as rotation, the authors propose the use of such a system for human computer interface. Such a system could be invaluable when it is important to communicate without making a sound, such as giving passwords when in an open office or in public spaces.

Keywords:

Remember to check out the Check out our Most Cited Articles!
Check out these titles on Image Analysis

References

T. Chen, IEEE Signal Processing Magazine 18, 9 (2001), DOI: 10.1109/79.911195. Crossref, Google Scholar
H. McGurk and J. MacDonald, Nature 264, 746 (1976), DOI: 10.1038/264746a0. Crossref, Web of Science, Google Scholar
M. N. Kaynaket al., IEEE Transactions on Systems, Man and Cybernetics, Part A 34, 564 (2001), DOI: 10.1109/TSMCA.2004.826274. Crossref, Web of Science, Google Scholar
G. Potamianoset al., Towards practical deployment of audio-visual speech recognition, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing3 (2004) pp. 777–780. Google Scholar
T. J. Hazen, IEEE Transactions on Speech and Audio Processing (2006). Google Scholar
E. D. Petajan , Automatic lip-reading to enhance speech recognition , GLOBECOM '84 ( 1984 ) . Google Scholar
L. Liang et al. , Speaker independent audio-visual continuous speech recognition , ICME '02 ( 2002 ) . Google Scholar
J. F. G. Perez et al. , Lip reading for robust speech recognition on embedded devices , ICASSP '05 ( 2005 ) . Google Scholar
A. J. Goldschen , O. N. Garcia and E. Petajan , Continuous optical automatic speech recognition by lipreading , 28th Annual Asilomar Conf. on Signal Systems and Computer ( 1994 ) . Google Scholar
A. F. Bobick and J. W. Davis, IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 257 (2001), DOI: 10.1109/34.910878. Crossref, Web of Science, Google Scholar
W. C. Yau , D. K. Kumar and S. P. Arjunan , Visual speech recognition method using translation, scale and rotation invariant features , IEEE International Conference on Advanced Video and Signal based Surveillance ( 2006 ) . Google Scholar
M. R. Teague, Journal of the Optical Society of America 70, 920 (1980). Crossref, Web of Science, Google Scholar
D. Zhang and G. Lu, Pattern Recognition Letters 37, 1 (2004). Crossref, Web of Science, Google Scholar
C. H. Teh and R. T. Chin, IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 496 (1988), DOI: 10.1109/34.3913. Crossref, Web of Science, Google Scholar
A. Khontazad and Y. H. Hon, IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489 (1990). Web of Science, Google Scholar
R. Mukundan and K. R. Ramakrishnan , Moment Functions in Image Analysis: Theory and Applications ( World Scientific , 1998 ) . Crossref, Google Scholar
H. K. Ekenel and R. Stiefelhagen , Local appearance based face recognition using discrete cosine transform , 13th European Signal Processing Conference ( 2005 ) . Google Scholar
X. Hong et al. , A PCA based visual DCT feature extraction method for lip-reading , Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing ( 2006 ) . Google Scholar
G. Potamianos et al. , A cascade image transform for speaker independent automatic speechreading , IEEE Int. Conf. on Multimedia and Expo ( 2000 ) . Google Scholar
C. M. Bishop , Neural Networks for Pattern Recognition ( Oxford University Press , 1995 ) . Crossref, Google Scholar
V. Vapnik , Statistical Learning Theory ( Wiley , 1998 ) . Google Scholar
C. J. C. Burges, Data Mining and Knowledge Discovery 2(2), 955 (1998), DOI: 10.1023/A:1009715923555. Google Scholar
M. Gordan, C. Kotropoulos and I. Pitas, Application of support vector machines classifiers to visual speech recognition, International Conference on Image Processing3 (2002) p. 129. Google Scholar
L. R. Rabiner, Proc. IEEE 77(2), 257 (1989), DOI: 10.1109/5.18626. Crossref, Web of Science, Google Scholar
S. Lee and D. Yook, Intelligent Data Engineering and Automated Learning (2002) p. 557. Google Scholar
G. Potamianoset al., Proc. of IEEE 91, (2003). Google Scholar
S. W. Foo and L. Dong , Lecture Notes in Computer Science ( Springer-Verlag , 2002 ) . Google Scholar
C. C. Chang and C. J. Lin, "LIBSVM: A library for support vector machines," Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm (2001) . Google Scholar