FUSION OF MANUAL AND NON-MANUAL INFORMATION IN AMERICAN SIGN LANGUAGE RECOGNITION
We present a bottom-up approach to continuous American sign language recognition without wearable aids, but with simple low-level processes operating on images and building realistic representations that are fed into intermediate level processes, to form sign hypotheses. At the intermediate level, we construct representations for both manual and non-manual aspects, such as hand movements, facial expressions and head nods. The manual aspects are represented using Relational Distributions that capture the statistical distribution of the relationships among the low-level primitives from the body parts. These relational distributions, which can be constructed without the need for part level tracking, are efficiently represented as points in the Space of Probability Functions (SoPF). Manual dynamics are thus represented as tracks in this space. The dynamics of facial expressions along with a sign are also represented as tracks, but in the expression subspace, constructed using principal component analysis (PCA). Head motions are represented as 2D image tracks. The integration of manual with non-manual information is sequential, with non-manual information refining the manual information based hypotheses set. We show that with just image-based manual information, the correct detection rate is around 88%. However, with the use of facial information, accuracy increases to 92%. Thus face contributes valuable information towards ASL recognition. ‘Negation’ in sentences is correctly detected in 90% of the cases using just 2D head motion information.