Regular PapersNo Access

A HYBRID FEATURE SELECTION METHOD FOR TEXT CATEGORIZATION

E. MONTAÑÉS

Artificial Intelligence Center, University of Oviedo, Campus de Viesques, 33271-Gijón (Asturias), Spain

Search for more papers by this author

J. R. QUEVEDO

Artificial Intelligence Center, University of Oviedo, Campus de Viesques, 33271-Gijón (Asturias), Spain

Search for more papers by this author

E. F. COMBARRO

Artificial Intelligence Center, University of Oviedo, Campus de Viesques, 33271-Gijón (Asturias), Spain

Search for more papers by this author

I. DÍAZ

Artificial Intelligence Center, University of Oviedo, Campus de Viesques, 33271-Gijón (Asturias), Spain

Search for more papers by this author

, and

J. RANILLA

Artificial Intelligence Center, University of Oviedo, Campus de Viesques, 33271-Gijón (Asturias), Spain

Search for more papers by this author

https://doi.org/10.1142/S0218488507004492Cited by:4 (Source: Crossref)

Abstract

Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.

Keywords:

References

D. W. Aha and R. L. Bankert, Feature selection for case-based classification of cloud types: An empirical comparison, Proc. Workshop on Case-Based Reasoning AAAI-94 (1994) pp. 106–112. Google Scholar
C. Apte, F. Damerau and S. Weiss, Information Systems 12(3), 233 (1994). Web of Science, Google Scholar
Jerzy Balaet al., Hybrid learning using genetic algorithms and decision trees for pattern classification, IJCAI1 (1995) pp. 719–724. Google Scholar
R. Caruana and D. Freitag, Greedy attribute selection, Proc. 11th International Conference on Machine Learning ICML-94 (1994) pp. 28–36. Google Scholar
Soumen Chakrabarti, Shourya Roy, and Mahesh Soundalgekar, Fast and accurate text classification via multiple linear discriminant projections, 2003 . Google Scholar
K. J. Cherkauer and J. W. Shavlik, Growing simpler decision trees to facilitate knowledge discovery, Proc. 2nd International Conference on Knowledge Discovery and Data Mining KDD-96 (1996) pp. 315–318. Google Scholar
E. F. Combarroet al., A comparison of the performance of svm and arni on text categorization whit new filtering measures on an unbalanced collection, Proc. International Work-Conference on Artificial and Natural Neural Network IWANN-20032687 (Springer-Verlag, 2003) pp. 742–749. Google Scholar
S. Deerwesteret al., Journal of the American Society for Information Science 41(6), 391 (1990). Crossref, Web of Science, Google Scholar
Inderjit S. Dhillon, Subramanyam Mallela and Rahul Kumar, Journal of Machine Learning Research 3, 1265 (2003), DOI: 10.1162/153244303322753661. Google Scholar
I. Díazet al., Journal of the American Society for Information Science and Technology JASIST 55(7), 579 (2004). Crossref, Web of Science, Google Scholar
George Forman, Journal of Machine Learning Research 3, 1289 (2003), DOI: 10.1162/153244303322753670. Web of Science, Google Scholar
T. Joachims, Text categorization with support vector machines: learning with many relevant features, Proc. 10th European Conference on Machine Learning ECML-98, eds. Claire Nédellec and Céline Rouveirol (Springer-Verlag, 1998) pp. 137–142. Google Scholar
G. H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset selection problem, Proc. 11th International Conference on Machine Learning ICML-94 (1994) pp. 121–129. Google Scholar
R. Kohavi and G. H. John, Artificial Intelligence 97(12), 273 (1997), DOI: 10.1016/S0004-3702(97)00043-X. Crossref, Web of Science, Google Scholar
H. Liu and R. Setiono, A probabilistic approach to feature selection — a filter solution, Proc. 13th International Conference on Machine Learning ICML-96 (1996) pp. 319–327. Google Scholar
D. Mladenic and M. Grobelnik, Feature selection for unbalanced class distribution and naive bayes, Proc. 16th International Conference on Machine Learning ICML-99 (1999) pp. 258–267. Google Scholar
E. Montañéset al., IEEE Intelligent Systems 20(3), 40 (2005). Crossref, Web of Science, Google Scholar
E. Montañéset al., Measures of rule quality for feature selection in text categorization, Proc. 5th International Symposium on Intelligent Data Analysis Berlin IDA-2003 (Springer-Verlag, 2003) pp. 589–598. Google Scholar
E. Montañés, J. R. Quevedo and I. Díaz, A wrapper approach with support vector machines for text categorization, Proc. International Work-conference on Artificial and Natural Neural Network, IWANN-2003 (Springer-Verlag, 2003) pp. 230–237. Google Scholar
S. Muggleton, New Generation Computing, Special issue on Inductive Logic Programming 13(3-4), 245 (1995). Crossref, Web of Science, Google Scholar
National Library of Medicine, Medical subject headings (mesh), www.nlm.nih.gov/mesh/2002/index.html, 1993 . Google Scholar
M. F. Porter, Program (Automated Library and Information Systems) 14(3), 130 (1980), DOI: 10.1108/eb046814. Crossref, Web of Science, Google Scholar
J. R. Quevedo , E. Monta˜és and M. A. Alonso , Feature selection on modelling continuous systems by examples , Proc. International Conference on Computational Intelligence for Modelling, Control and Automation CIMCA-2003 ( 2003 ) . Google Scholar
Bhavani Raskutti, Herman Ferrá and Adam Kowalczyk, Second order features for maximising text clasification performance, 12th European Conference on Machine Learning, eds. L. De Raedt and P. Flach (2001) pp. 419–430. Google Scholar
G. Salton and M. J. McGill , An introduction to modern information retrieval ( McGraw-Hill , 1983 ) . Google Scholar
F. Sebastiani, ACM Computing Survey 34(1), 2002. Google Scholar
David B. Skalak, Prototype and feature selection by sampling and random mutation hill climbing algorithms, International Conference on Machine Learning (1994) pp. 293–301. Google Scholar
M. R. Spiegel , Statistics ( McGraw-Hill , New York , 1971 ) . Google Scholar
H. Vafaie and K. De Jong, Robust feature selection algorithms, Proc. 5th Intl. Conf. on Tools with Artificial Intelligence (1993) pp. 356–363. Google Scholar
V. Vapnik , The Nature of Statistical Learning Theory ( Springer-Verlag , 1995 ) . Crossref, Google Scholar
Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorisation, Proc. 14th International Conference on Machine Learning ICML-97 (1997) pp. 412–420. Google Scholar
Yiming Yang and Xin Liu, A re-examination of text categorization methods, Proc. 22nd ACM International Conference on Research and Development in Information Retrieval SIGIR-99, eds. Marti A. Hearst, Fredric Gey and Richard Tong (ACM Press, New York, US, 1999) pp. 42–49. Google Scholar