Regular PapersNo Access

AUTHORSHIP ATTRIBUTION BASED ON FEATURE SET SUBSPACING ENSEMBLES

EFSTATHIOS STAMATATOS

Department of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Samos – 83200, Greece

https://doi.org/10.1142/S0218213006002965Cited by:47 (Source: Crossref)

Abstract

Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.

Keywords:

Remember to check out the Most Cited Articles!
Check out Notable Titles in Artificial Intelligence.

References

F. Mosteller and D. Wallace , Applied Bayesian and Classical Inference: The Case of the Federalist Papers ( Addison-Wesley , Reading, MA , 1984 ) . Crossref, Google Scholar
C. Labbé and D. Labbé, Journal of Quantitative Linguistics 8, 213 (2001). Crossref, Google Scholar
O. de Velet al., SIGMOD Record 30(4), 55 (2001). Crossref, Web of Science, Google Scholar
S. Argamon, M. Saric and S. Stein, Style mining of electronic messages for multiple authorship discrimination: First results, Proc. of the 9th ACM SIGKDD pp. 475–480. Google Scholar
A. Abbasi and H. Chen, IEEE Intelligent Systems 20(5), 67 (2005). Crossref, Web of Science, Google Scholar
H. van Halteren, Linguistic profiling for author recognition and verification, Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (2004) pp. 199–206. Google Scholar
C. Chaski, Forensic Linguistics 8(1), 1 (2001). Google Scholar
D. Holmes, Literary and Linguistic Computing 13(3), 111 (1998). Crossref, Google Scholar
J. Rudman, Computers and the Humanities 31, 351 (1998). Crossref, Google Scholar
A. Q. Morton, Journal of the Royal Statistical Society, Series A 128, 169 (1965). Crossref, Web of Science, Google Scholar
H. Sichel, Journal of the American Statistical Association 70, 542 (1975). Web of Science, Google Scholar
H. Baayen, H. Van Halteren and F. Tweedie, Literary and Linguistic Computing 11(3), 121 (1996). Crossref, Google Scholar
E. Stamatatos, N. Fakotakis and G. Kokkinakis, Computational Linguistics 26(4), 471 (2000). Crossref, Web of Science, Google Scholar
M. Koppel and J. Schler , Exploiting stylistic idiosyncrasies for authorship attribution , Proc. of the IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis ( 2003 ) . Google Scholar
J. F. Burrows, Literary and Linguistic Computing 2, 61 (1987). Crossref, Google Scholar
F. Sebastiani, ACM Computing Surveys 34(1), 1 (2002). Crossref, Web of Science, Google Scholar
R. Kohavi and G. John, Artificial Intelligence 97, 273 (1997). Crossref, Web of Science, Google Scholar
T. Joachims , Text categorization with support vector machines: Learning with many relevant features , Proc. of the European Conference on Machine Learning ( 1998 ) . Google Scholar
V. Vapnik , The nature of statistical learning theory ( Springer , New York , 1995 ) . Crossref, Google Scholar
D. Opitz and J. Shavlik, Combining Artificial Neural Nets, ed. A. Sharkley (1999) pp. 79–99. Google Scholar
L. Breiman, Machine Learning 45(1), 5 (2001). Crossref, Web of Science, Google Scholar
T. Lim, W. Loh and Y. Shih, Machine Learning 40(3), 203 (2000). Crossref, Web of Science, Google Scholar
D. Taxet al., Pattern Recognition 33, 1475 (2000). Crossref, Web of Science, Google Scholar
L. Kuncheva and C. Whitaker, Machine Learning 51, 181 (2003). Crossref, Web of Science, Google Scholar
F. Peng et al. , Language independent authorship attribution using character level language models , Proc. of the 10th Conference of the European Chapter of the Association for Computational Linguistics ( 2003 ) . Google Scholar
V. Keselj et al. , N-gram-based author profiles for authorship attribution , Proc. of the Conference of the Pacific Association for Computational Linguistics ( 2003 ) . Google Scholar
F. Peng, D. Shuurmans and S. Wang, Information Retrieval Journal 7(1), 317 (2004). Crossref, Web of Science, Google Scholar
I. H. Witten and E. Frank , Data mining: Practical machine learning tools with Java implementations ( Morgan Kaufmann , San Francisco , 2000 ) . Google Scholar
J. Diederichet al., Applied intelligence 19(1/2), 109 (2003). Crossref, Web of Science, Google Scholar
B. Parmanto, P. W. Munro and H. R. Doyle, Advances in Neural Information Processing Systems 8, eds. D. S Touretzky, M. C. Mozer and M. E. Heselmo (1996) pp. 882–888. Google Scholar
E. Stamatatos, N. Fakotakis and G. Kokkinakis, Computers and the Humanities 35(2), 193 (2001). Crossref, Google Scholar
P. Juola, Ad-hoc authorship attribution competition, Proc. of the Joint Int. Conference ALLC/ACH (2004) pp. 175–176. Google Scholar
G. Zenobi and P. Cunningham, Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error, Proc. of 12th European Conference on Machine Learning (2001) pp. 576–587. Google Scholar