Special Issue: Best Papers from 2004 Conference on Chinese Language Computing; Guest Editor: Shi-Kuo ChangNo Access

Chinese Unknown Word Identification Based on Local Bigram Model

Information Retrieval Laboratory, School of Computer Science and Technology, Harbin Institute of Technology, P. O. Box 321, HIT, Harbin, P.R. China, 150001, P. R. China

Search for more papers by this author

and

TING LIU

Information Retrieval Laboratory, School of Computer Science and Technology, Harbin Institute of Technology, P. O. Box 321, HIT, Harbin, P.R. China, 150001, P. R. China

Search for more papers by this author

https://doi.org/10.1142/S0219427905001286Cited by:3 (Source: Crossref)

Abstract

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.

Keywords:

References

H. Zhang and Q. Liu, Journal of Chinese Information Processing 16(5), 77 (2002). Google Scholar
W. Che, T. Liu and S. Li, A new Chinese natural language understanding architecture based on multilayer search mechanism, in Proceedings of Third SIGHAN Workshop on Chinese Language Processing, 2004, pp. 134–140 . Google Scholar
Y. Lvet al., Journal of Chinese Information Processing 15(1), 28 (2000). Google Scholar
H. Tan, J. Zhang and K. Liu, Journal of Software 12(11), 1608 (2001). Google Scholar
E. Brill, Computational Linguistics 21(4), 418 (1995). Google Scholar
H. Zhang, Q. Liu, H. Zhang, et al., Automatic recognition of Chinese unknown words based on role tagging, in Proceedings of First SIGHAN affiliated with 19th COLING, 2002, pp. 71–77 . Google Scholar
J. Sun, J. Gao, L. Zhang, et al., Chinese named entity identification using class-based language model, in Proceedings of COLING, 2002 . Google Scholar
L. Zhang, Study on Chinese proofreading oriented language modeling, PhD Dissertation, 2001 . Google Scholar
H. Zhanget al., Computational Linguistics and Chinese Language Processing 8(2), 29 (2003). Google Scholar
G. Fu and K. Luke, Chinese unknown word identification using class-based LM, in Proceedings of IJCNLP2004, pp. 262–269 . Google Scholar
M. Sun and B. K. Tsou, A review and evaluation on automatic segmentation of Chinese, Contemporary Linguistics, 2001, 22–32 . Google Scholar
S. Yuet al., Journal of Chinese Language and Computing 13(2), 121 (2003). Google Scholar
R. Sproat and T. Emerson, The first international Chinese word segmentation Bakeoff, Second SIGHAN Workshop affiliated with 41th ACL, Sapporo, Japan, 2003, pp. 133–143 . Google Scholar
H. Zhang, Q. Liu, X. Cheng, et al., Chinese lexical analysis using hierarchical hidden markov model, Second SIGHAN Workshop affiliated with 41th ACL, Sapporo, Japan, 2003, pp. 184–187 . Google Scholar
G. Fu and K. Luke, A two-stage statistical word segmentation system for Chinese, Second SIGHAN Workshop affiliated with 41th ACL, Sapporo, Japan, 2003, pp. 156–159 . Google Scholar