No Access

Improving Domain Dictionary-Based Text Categorization Using Self-Partition Model

WENLIANG CHEN

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China

Search for more papers by this author

JINGBO ZHU

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China

Search for more papers by this author

MUHUA ZHU

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China

Search for more papers by this author

LI ZHANG

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China

Search for more papers by this author

, and

TIANSHUN YAO

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China

Search for more papers by this author

https://doi.org/10.1142/S0219427905001304Cited by:1 (Source: Crossref)

Abstract

In this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model (SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The experimental results show that the proposed text representation-based text categorization system performs better than the Domain Dictionary-based text categorization system. It also performs better than the system based on Bag-of-Words when the number of features is small and the training corpus size is small.

This research was supported in part by the National Natural Science Foundation of China and Microsoft Asia Research (No. 60203019), the National Natural Science Foundation of China (No. 60473140) and the Key Project of the Chinese Ministry of Education (No. 104065).

Keywords:

References

L. D. Baker and A. K. McCallum, Distributional clustering of words for text classification, in Proc. 21st Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 96–103 . Google Scholar
R. Bekkerman et al. , On feature distributional clustering for text categorization , Proc. SIGIR-01, 24th ACM Int. Conf. on Research and Development in Information Retrieval , eds. W. B. Croft et al. ( ACM Press , New York, USA ) . Google Scholar
W. Chen, X. Chang, H. Wang, J. Zhu and T. Yao, Automatic word clustering for text categorization using global information, in First Asia Information Retrieval Symposium (AIRS 2004), 2004, pp. 1–6 . Google Scholar
C. L. C. E. Board, China Library Categorization, 4th edn. (Beijing Library Press, Beijing, 1999). Google Scholar
L. Lee, Similarity-Based Approaches to Natural Language Processing, Ph.D. thesis, Harvard University, Cambridge, MA, 1997 . Google Scholar
S. Lee and M. Shishibori, Passage segmentation based on topic matter, Computer Processing of Oriental Languages 15(3), pp. 305–340 . Google Scholar
A. McCallum and K. Nigam, A comparison of event models for naïve Bayes text classification, in AAAI–98 Workshop on Learning for Text Categorization, 1998 . Google Scholar
F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words, in 30th Annual Meeting of the ACL, 1993, pp. 183–190 . Google Scholar
F. Sebastiani, ACM Computing Surveys 34, 1 (2002). Crossref, Google Scholar
Scott, Sam and Stan Matwin, Text classification using WordNet hypernyms, in Proc. COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998 . Google Scholar
Y. Yang and J. P. Pedersen, A comparative study on feature selection in text categorization in Proc. 14th Int. Conf. on Machine Learning (ICML'97), Jr. Doughals H. Fisher (ed.), Nashville, TN, July 8–12, 1997 . Google Scholar
Y. Yang and X. Liu, Proc. SIGIR-99, 22nd ACM Int. Conf. on Research and Development in Information Retrieval, eds. M. A. Hearst, F. Gey and R. Tong (ACM Press, New York, USA, 1999) pp. 42–49. Crossref, Google Scholar
T. S. Yao et al. , Natural Language Processing — A research of making computers understand human languages ( Tsinghua University Press , 2002 ) . Google Scholar
J. Zhu and T. Yao, Journal of Chinese Information Processing 16(3), (2002). Google Scholar