SPECIAL ISSUE: Personalization Techniques and Recommender Systems; Edited by G. Uchyigit and M. Y. MaNo Access

A NEW FEATURE SELECTION METHOD FOR TEXT CLASSIFICATION

GULDEN UCHYIGIT

Department of Computing, Imperial College, 180 Queen's Gate, South Kensington, London SW7 2AZ, UK

Search for more papers by this author

and

KEITH CLARK

Department of Computing, Imperial College, 180 Queen's Gate, South Kensington, London SW7 2AZ, UK

Search for more papers by this author

https://doi.org/10.1142/S0218001407005466Cited by:9 (Source: Crossref)

Abstract

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced.

In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ²) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F₁ and F₂ scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Keywords:

References

M. Caropresso, S. Matwin and F. Sebastiani, Text Databases and Document Management: Theory and Practice, ed. A. G. Chin (Idea Group Publishing, 2001) pp. 78–102. Google Scholar
S. Dutoitet al., J. Amer. Stat. Assoc. 97, 77 (2002). Crossref, Web of Science, Google Scholar
S. T. Dumais and H. Chen , Hierarchical Classification of Web Content SIGIR' 2000 . Google Scholar
S. T. Dumaiset al., Inductive Learning Algorithms and Representations for Text (ACM, 1998) pp. 148–155. Google Scholar
K. Fuka and R. Hanka , Feature set reduction for document classification problems , IJCAI Workshop on Text Learning: Beyond Supervision ( 2001 ) . Google Scholar
G. Forma, J. Mach. Learn. Res. 3, (2003). Google Scholar
R. Fano , Transmission of Information ( MIT Press , 1996 ) . Google Scholar
E. Gabrilovich and S. Markovitch , Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , Twenty-first Int. Conf. Machine Learning ( ACM Press , New York, USA , 2004 ) , DOI: 10.1145/1015330.1015388 . Google Scholar
L. Galavotti, F. Sebastiani and M. Simi, Experiments on the use of feature selection and negative evidence in automated text categorization, Proc. ECDL-00, 4th European Conf. Research and Advanced Technology for Digital Libraries, eds. J. L. Borbinha and T. Baker (Springer Verlag, Heidelberg, 2000) pp. 59–68. Google Scholar
T. Joachims, A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization, ICML: Int. Conf. Machine Learning (1997) pp. 143–151. Google Scholar
K. Lang , NewsWeeder: learning to filter netnews , 12th Int. Conf. Machine Learning ( 1995 ) . Google Scholar
D. Mladenic et al. , Feature selection using linear classifier weights: interaction with classification models , ACM SIGIR ( 2004 ) , DOI: 10.1145/1008992.1009034 . Google Scholar
D. Mladenic Machine Learning on Non-Homogeneous, Distributed Text Data, Ph.D. thesis, University of Ljubljana, Slovenia (October, 1998) . Google Scholar
D. Mladenicet al., Feature selection using linear classifier weights: interaction with classification models, Proc. 27th Ann. Int. Conf. Research and Development in Information Retrieval (2004) pp. 234–241, DOI: 10.1145/1008992.1009034. Google Scholar
T. M. Mitchel , Machine Learning ( McGraw-Hill International , 1997 ) . Google Scholar
H. Ng, W. Goh and K. Low, Feature selection, perceptron learning, and a usability case study for text categorization, SIGIR '97: Proc. 20th Ann. Int. Conf. ACM SIGIR Conf. Research and Development in Information Retrieval (1997) pp. 67–73, DOI: 10.1145/258525.258537. Google Scholar
M. Pazzani and D. Billsus, J. Mach. Learn. 27, 313 (1997), DOI: 10.1023/A:1007369909943. Crossref, Web of Science, Google Scholar
J. R. Quinlan , C4.5: Programs for Machine Learning ( Morgan Kaufmann , 1993 ) . Google Scholar
M. E. Ruiz and P. Srinivasan, J. Inform. Retr. 5(1), 87 (2002), DOI: 10.1023/A:1012782908347. Crossref, Web of Science, Google Scholar
M. Rogati and Y. Yang, High-Performing Feature Selection for Text Classification, CIKM (2002) pp. 659–666, DOI: 10.1145/584792.584911. Google Scholar
G. Uchyigit and K. L. Clark , Hierarchical agglomerative clustering for agent-based dynamic collaborative filtering , Fifth Int. Conf. Intelligent Data Engineering and Automated Learning (IDEAL'04) ( Springer-Verlag , 2004 ) . Google Scholar
G. Uchyigit and K. L. Clark, A multi-agent architecture for dynamic collaborative filtering, Proc. Int. Conf. Enterprise Information Systems (2003) pp. 363–368. Google Scholar
Van Rijsbergen , Information Retrieval , 2nd edn. ( Butterworths , London , 1979 ) . Google Scholar
K. Ward Church and P. Hanks, Word Association Norms, Mutual Information and Lexicography, ACL 27, Vancouver Canada (1998), pp. 76–83 . Google Scholar
Y. Yang and J. Pedersen, A comparative study on feature selection in text categorization, Proc. ICML-97, 14th Int. Conf. Machine Learning, ed. D. H. Fisher (Morgan Kaufmann Publishers, US, 1997) pp. 412–420. Google Scholar
Z. Zheng and R. Srihari , Optimally Combining Positive and Negative Features for Text Categorization , ICML-KDD'2003 Workshop: Learning from Imbalanced Data Sets II ( 2003 ) . Google Scholar
G. H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset selection problem, Proceedings of the Eleventh International Conference on Machine Learning (1994) pp. 121–129. Google Scholar
H. Almuallim and T. G. Dietterich, Learning with many irrelevant features, Proceedings of the Ninth National Conference on Artificial Intelligence (1991) pp. 547–552. Google Scholar