PAPERS: Speech UnderstandingNo Access

SVM-BASED PHONEME CLASSIFICATION AND LIP SHAPE REFINEMENT IN REAL-TIME LIP-SYNCH SYSTEM

Department of Electronics and Computer Engineering, Korea University, 1-5ka Anam-dong, Sungbuk-ku, Seoul, 136-701, Korea

Search for more papers by this author

and

DAVID K. HAN

Department of Mechanical Engineering, University of Maryland, College Park, MD, USA

Search for more papers by this author

https://doi.org/10.1142/S0218001406005113Cited by:3 (Source: Crossref)

Abstract

In this paper, we present a real time lip-synch system that activates 2-D avatar's lip motion in synch with incoming speech utterance. To achieve the real time operation of the system, the processing time was minimized by "merge and split" procedures resulting in coarse-to-fine phoneme classification. At each stage of phoneme classification, the support vector machine (SVM) method was applied to reduce the computational load while maintaining the desired accuracy. The coarse-to-fine phoneme classification, is accomplished via two_stages of feature extraction: in the first stage, each speech frame is acoustically analyzed for three classes of lip opening using Mel Frequency Cepstral Coefficients (MFCC) as a feature; in the second stage, each frame is further refined for detailed lip shape using formant information. The method was implemented in 2-D lip animation and it was demonstrated that the system was effective in accomplishing real-time lip-synch. This approach was tested on a PC using the Microsoft Visual Studio with an Intel Pentium IV 1.4 Giga Hz CPU and 384 MB RAM. It was observed that the methods of phoneme merging and SVM achieved about twice the speed in recognition than the method employing the Hidden Markov Model (HMM). A typical latency time per a single frame observed using the proposed method was in the order of 18.22 milliseconds while an HMM method under identical conditions resulted about 30.67 milliseconds.

Keywords:

References

R. Andre-Obrecht, IEEE Trans. Acoust. Speech Sign. Process. 36(1), 29 (1998), DOI: 10.1109/29.1486. Crossref, Google Scholar
P. Clarkson and P. J. Moreno, On the use of support vector machines for phonetic classification, Proc. Int. Conf. Acoust. Speech, Sign. Process.2 (1999) pp. 585–588. Google Scholar
S. Curinga , F. Lavagetto and F. Vignoli , Lip movements synthesis using time delay neural networks , Proc. EUSIPCO 96 ( 1996 ) . Google Scholar
S. E. Golowich and D. X. Sun, A support vector/hidden Markov model approach to phoneme recognition, ASA Proc. Statistical Computing Section (1998) pp. 125–130. Google Scholar
B. Kimet al., J. Acoust. Soc. Korea 16(4), 35 (1997). Google Scholar
M.-T. Lin, C.-K. Lee and C.-Y. Lin, Comput. Speech Lang. 13, 207 (1999). Crossref, Web of Science, Google Scholar
D. V. McAllister et al. , Lip synchronization for Animation , Proc. SIGGRAPH 97 ( 1997 ) , DOI: 10.1145/259081.259312 . Google Scholar
S. Morishima , Real-time talking head driven by voice and its application to communication and entertainment , Proc. AVSP 98, Int. Conf. Auditory-Visual Speech Processing . Google Scholar
L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition (Prentice Hall, 1993) pp. 321–322. Google Scholar
H. Shimodairaet al., Support vector machine with dynamic time-alignment kernel for speech recognition, Eurospeech 2001 (2001) pp. 1841–1844. Google Scholar
SVMlight, http://svmlight.joachims.org/ . Google Scholar
M. Tamuraet al., Visual speech synthesis based in parameter generation from HMM: speech driven and text-and-speech driven approaches, Proc. AVSP 98, Int. Conf. Auditory-Visual Speech Processing pp. 221–226. Google Scholar
A. Von Brandt, Detecting and estimating parameters jumps using ladder algorithms and likelihood ratio test, Proc. ICASSP (1983) pp. 1017–1020. Google Scholar
E. Yamamoto, S. Nakamura and K. Shikano, Speech Commun. 26(1–2), 105 (1998). Crossref, Web of Science, Google Scholar
S. Young, The HTK(Hidden Markov Model Tool Kit) Book (Cambridge University, Engineering Department, 2000) pp. 2–13. Google Scholar