No Access

High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

Saida Ishak Boushaki

LRIA, University of Science and Technology Houari, Boumediene, Bab Ezzouar 16123, Algeria

Department of Informatics, University of M’Hamed Bougara Boumerdes, Boumerdes 35000, Algeria

E-mail Address: saida_ib@univ-boumerdes.dz

Search for more papers by this author

Nadjet Kamel

LRIA, University of Science and Technology Houari, Boumediene, Bab Ezzouar 16123, Algeria

Université Ferhat Abbas Setif 1, Sétif 19000, Algeria

E-mail Address: nkamel@univ-setif.dz

Search for more papers by this author

, and

Omar Bendjeghaba

LREEI, University M’Hamed Bougara, Boumerdes, Boumerdes 35000, Algeria

E-mail Address: bendjeghaba@univ-boumerdes.dz

Search for more papers by this author

https://doi.org/10.1142/S0219649218500338Cited by:12 (Source: Crossref)

Abstract

The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.

Keywords:

References

Abualigah, L, A Khader, E Hanandeh and A Gandomi [2017] A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Applied Soft Computing, 60, 423–435. Crossref, Web of Science, Google Scholar
Al-Mofareji, H, M Kamel and M Dahab [2017] WeDoCWT: A new method for web document clustering using discrete wavelet transforms. Journal of Information & Knowledge Management, 16 (1), 19 pp., https://doi.org/10.1142/S0219649217500046 1750004. Link, Web of Science, Google Scholar
Al-Sultan, K and M Khan [1996] Computational experience on four algorithms for the hard clustering problem. Pattern Recognition Letters, 17 (3), 295–308, https://doi.org/10.1016/0167-8655(95)00122-0. Crossref, Web of Science, Google Scholar
Amayri, O and N Bouguila [2013] On online high-dimensional spherical data clustering and feature selection. Engineering Applications of Artificial Intelligence, 26 (4), 1386–1398. Crossref, Web of Science, Google Scholar
Ampazis, N and S Perantonis [2004] LSISOM – A latent semantic indexing approach to self-organizing maps of document collections. Neural Processing Letters, 19, 157–173. Crossref, Web of Science, Google Scholar
Aswani Kumar, C, A Gupta, M Batool and S Trehan [2005] An information retrieval model based on latent semantic indexing with intelligent preprocessing. Journal of Information & Knowledge Management, 4 (4), 279–285. Link, Google Scholar
Azaryuon, K and B Fakhar [2013] A novel document clustering algorithm based on ant colony optimization algorithm. Journal of Mathematics and Computer Science, 7, 171–180. Crossref, Google Scholar
Balbi, S [2012] Beyond the curse of multidimensionality: High dimensional clustering in text mining. Statistica Applicata-Italian Journal of Applied Statistics, 22 (1), 53–63. Google Scholar
Carullo, M, E Binaghi and I Gallo [2009] An online document clustering technique for short web contents. Pattern Recognition Letters, 30, 870–876. Crossref, Web of Science, Google Scholar
Charikar, M, C Chekuri, T Feder and R Motwani [1997] Incremental clustering and dynamic information retrieval. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso, Texas, pp. 626–635. Google Scholar
Chung, S and D McLeod [2005] Dynamic pattern mining: An incremental data clustering approach. In Journal on Data Semantics II, Notes in Computer Science, Vol. 3360, S Spaccapietra, E Bertino, S Jajodia, R King, D McLeod, ME Orlowska and L Strous (eds.), pp. 85–112. Heidelberg: Springer. Crossref, Google Scholar
Civicioglu, P and E Besdok [2013] A conceptual comparison of the Cuckoo-search, particle swarm optimization, differential evolution and artificial bee colony algorithms. Artificial Intelligence Review, 39 (4), 315–346. Crossref, Web of Science, Google Scholar
Civicioglu, P and E Besdok [2014] Comparative analysis of the cuckoo search algorithm. In Cuckoo Search and Firefly Algorithm, Theory and Applications, X-S Yang (ed.), Studies in Computational Intelligence, Vol. 516, pp. 85–113. Switzerland: Springer International Publishing. Crossref, Google Scholar
Cobos, C, H Muñoz-Collazos, R Urbano-Muñoz, M Mendoza, E León and E Herrera-Viedma [2014] Clustering of web search results based on the cuckoo search algorithm and balanced Bayesian information criterion. Information Sciences, 281, 248–264. Crossref, Web of Science, Google Scholar
Cui, X and T Potok [2005] Document clustering analysis based on hybrid PSO+K-means algorithm. Journal of Computer Sciences, 4, 27–33. Google Scholar
Deerwester, S, S Dumais and T Landauer [1990] Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407. Crossref, Web of Science, Google Scholar
Dhillon, IS, Y Guan and J Kogan [2002] Refining clusters in high dimensional text data. In Proceeding of the Workshop on Clustering High Dimensional Data, Second SIAM International Conference on Data Mining. Philadelphia: SIAM, pp. 71–82. Google Scholar
Djenouri, Y, A Belhadi and R Belkebir [2018] Bees swarm optimization guided by data mining techniques for document information retrieval. Expert Systems with Applications, 94, 126–136. Crossref, Web of Science, Google Scholar
Dubes, R [1993] Cluster analysis and related issue. In Handbook of Pattern Recognition and Computer Vision, C Dans, L Chen, L Pau and P Wang (eds.), pp. 3–32. Singapore: World Scientific. Link, Google Scholar
Fan, W and N Bouguila [2012] Online learning of a dirichlet process mixture of generalized dirichlet distributions for simultaneous clustering and localized feature selection. In Proceedings of the 4th Asian Conference on Machine Learning, JMLR: Workshop and Conference Proceedings, Vol. 25, Singapore, pp. 113–128. Google Scholar
Fazli, C [1993] Incremental clustering for dynamic information processing. ACM Transactions on Information Systems: A Publication of the Association for Computing Machinery, 11 (2), 143–164. Crossref, Web of Science, Google Scholar
Gentzkow, M, B Kelly and M Taddy (2017). Text as Data. Working Paper 23276, National Bureau of Economic Research. Google Scholar
Hammouda, K and M Kamel [2003] Incremental document clustering using cluster similarity histograms. In Proceedings of the International Conference on Web Intelligence, Halifax, Canada, pp. 597–601. Google Scholar
Han, E, D Boley, M Gini, R Gross, K Hastings, G Karypis, V Kumar, B Mobasher and J Moores [2008] Webace: A web agent for document categorization and exploration. In Proceedings of the 2nd International Conference on Autonomous Agents, pp. 408–4015. New York: ACM. Google Scholar
Hasanzadeh, E, M Poyan Rad and H Alinejad Rokny [2012] Text clustering on latent semantic indexing with particle swarm optimization (PSO) algorithm, International Journal of the Physical Sciences, 7 (1), 116–120. Google Scholar
Hruschka, E, R Campello, A Freitas and A DeCarvalho [2009] A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, 39 (2), 133–155. Crossref, Web of Science, Google Scholar
Huang, A [2008] Similarity measures for text document clustering. In NZCSRSC 2008, Christchurch, New Zealand, pp. 49–56. Google Scholar
Húsek, D, J Pokorný and H Řezanková [2009] Web data clustering. In Foundations of Computational Intelligence: Bio-Inspired Data Mining, A Dans, A Abraham, A Hassanien and A De Carvalho (eds.), Studies in computational intelligence, Vol. 204, Berlin: Springer. Volume. 4. Crossref, Google Scholar
Ishak Boushaki, S, N Kamel and O Bendjeghaba [2014a] A new algorithm for data clustering based on cuckoo search optimization. In Genetic and Evolutionary Computing, Proceedings of the Seventh International Conference on Genetic and Evolutionary Computing, V Snášel, J Pan and P Krömer (eds.), Advances in Intelligent Systems and Computing, Vol. 238, pp. 55–64. Prague, Czech Republic: Springer International Publishing, https://doi.org/10.1007/978-3-319-01796-9_6. Google Scholar
Ishak Boushaki, S, N Kamel and O Bendjeghaba [2014b] A new hybrid algorithm for document clustering based on cuckoo search and K-means. In Recent Advances on Soft Computing and Data Mining, Proceedings of Proceedings of the First International Conference on Soft Computing and Data Mining, T Herawan, R Ghazali and M Mat Deris (eds.), Advances in Intelligence Systems and computing, Vol. 287, pp. 59–68. UniversitiTun Hussein Onn Malaysia, Johor, Malaysia: Springer International Publishing Switzerland, https://doi.org/10.1007/978-3-319-07692-8_6. Google Scholar
Ishak Boushaki, S, N Kamel and O Bendjeghaba [2015] Improved cuckoo search algorithm for document clustering. In Proceedings of 5th IFIP TC 5 International Conference, Computer Science and Its Applications. May 20, Saida Algeria: Springer International Publish, pp. 217–228, https://doi.org/10.1007/978-3-319-19578-0_18. Google Scholar
Ishak Boushaki, S, N Kamel and O Bendjeghaba [2018] A new quantum chaotic cuckoo search algorithm for data clustering. Expert Systems with Applications, 96 (15), 358–372, https://doi.org/10.1016/j.eswa.2017.12.001. Crossref, Web of Science, Google Scholar
Jain, A . [2010] Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666. Crossref, Web of Science, Google Scholar
Jain, K, M Murthy and P Flynn [1999] Data clustering: A review. ACM Computing Surveys, 31 (3), 264–323. Crossref, Web of Science, Google Scholar
Jasim Mohammed, A, Y Yusof and H Husni [2016] Discovering optimal clusters using firefly algorithm. International Journal of Data Mining, Modelling and Management, 8 (4), 330–347. Crossref, Google Scholar
Jensi, R and J Wiselin [2013] A survey on optimization approaches to text document clustering. International Journal on Computational Science & Applications, 3 (6), 31–44, https://doi.org/10.5121/ijcsa.2013.3604. Crossref, Google Scholar
Karol, S and V Mangat [2013] Evaluation of text document clustering approach based on particle swarm optimization. Central European Journal of Computer Science, 3 (2), 69–90, https://doi.org/10.2478/s13537-013-0104-2. Google Scholar
Klawonn, F, F Hoppner and B Jayaram [2012] What are clusters in high dimensions and are they difficult to find? In International Workshop on Clustering High-Dimensional Data, pp. 14–33. Berlin, Heidelberg: Springer. Google Scholar
Mahdav, IM and H Abolhassani [2009] Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, 18, 370–391, https://doi.org/10.1007/s10618-008-0123-0. Crossref, Web of Science, Google Scholar
Maulik, U and S Bandyopadhyay [2002] Performance evaluation of some clustering algorithms and validity indices. IEEE Transractions on Pattern Analysis and Machine Intelligence, 24 (12), 1650–1654. Crossref, Web of Science, Google Scholar
Nanda, S and G Panda [2014] A survey on nature inspired metaheuristic algorithms for partitional clustering. Swarm and Evolutionary Computation, 16, 1–18, https://doi.org/10.1016/j.swevo.2013.11.003. Crossref, Web of Science, Google Scholar
Patel, D and M Zaveri [2011] A review on web pages clustering techniques. In NeCoM 2011, WeST 2011, WiMoN 2011, Communications in Computer and Information Science, Vol. 197, DC Wyld, M Wozniak, N Chaki, N Meghanathan and D Nagamalai (eds.), pp. 700–710. Heidelberg: Springer. Crossref, Google Scholar
Porter, M [1980] An algorithm for suffix stripping. Program, 14 (3), 130–137. Crossref, Google Scholar
Rendón, E, I Abundez, A Arizmendi and E Quiroz [2011] Internal versus external cluster validation indexes. International Journal, 5 (1), 27–34. Google Scholar
Rosario, B [2000] Latent Semantic Indexing: An Overview. Berkeley: University of California. Google Scholar
Sahoo, N, J Callan, R Krishnan, G Duncan and R Padman [2006] Incremental hierarchical clustering of text documents. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, ACM, pp. 357–366. Google Scholar
Salton, G, A Wong and C Yang [1975] A vector space model for automatic indexing. Communications of the ACM, 18 (11), 613–620. Crossref, Web of Science, Google Scholar
Selim, S and M Ismail [1984] K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions Pattern Analysis and Machine Intelligence, 6 (1), 81–87, https://doi.org/10.1109/TPAMI.1984.4767478. Crossref, Web of Science, Google Scholar
Shaw, G and Y Xu [2009] Enhancing an incremental clustering algorithm for web page collections. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3, New York: IEEE, pp. 81–84. Google Scholar
Song, W and S Cheol Park [2009] Genetic algorithm for text clustering based on latent semantic indexing. Computers and Mathematics with Applications, 57, 1901–1907. Crossref, Web of Science, Google Scholar
Song, W, Y Qiao, S Park and X Qian [2015] A hybrid evolutionary computation approach with its application for optimizing text document clustering. Expert Systems with Applications, 42 (5), 2517–2524. Crossref, Web of Science, Google Scholar
Sowmya, P, R Supreetha and A Ushadevi [2016] Survey on algorithms used for text document clustering. International Journal of Advances in Electronics and Computer Science, 3, 2393–2835. Google Scholar
Srinivas, S and C Aswani Kumar [2006] Optimising the heuristics in latent semantic indexing for effective information retrieval. Journal of Information & Knowledge Management, 5 (2), 97–105. Link, Google Scholar
Steinbach, M, G Karypis and V Kumar [2000] A comparison of document clustering techniques. In KDD Workshop on Text Mining. Google Scholar
Steinbach, M, L Ertoz and V Kumar [2003] Challenges of clustering high dimensional data. In New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition, L Wille (ed.), pp. 273–307. Berlin: Springer-Verlag. Google Scholar
TREC: Text retrival conference (n.d.). Available at http://trec.nist.gov/. Accessed on 4 January 2017. Google Scholar
Vaijayanthi, P, X-S Yang, AM Natarajan and R Murugadoss [2014] High dimensional data clustering using cuckoo search optimization algorithm. International Journal of Advanced Computer Engineering and Communication Technology, 3 (3), 31–35. Google Scholar
Wong, W and A Fu [2000] Incremental document clustering for webpage classification. In Proceedings of the International Conference on Information Society, Japan, November 5–8, pp. 101–110. Google Scholar
Xing, B and W-J Gao [2014] Cuckoo inspired algorithms. In Innovative Computational Intelligence: A Rough Guide to 134 Clever Algorithms, Intelligent Systems Reference Library, Part II, Chapter 7, Vol. 62, pp. 105–121. Switzerland: Springer International Publishing. Crossref, Google Scholar
Xu, R and D Wunsch II [2005] Survey of clustering algorithms. IEEE Transraction on Neural Networks and Learning Systems, 16 (3), 645–678. Crossref, Web of Science, Google Scholar
Yang, X-S and S Deb [2009] Cuckoo search via levy flights. In Proceedings of World Congress on Nature & Biologically Inspired Computing, World Congress, Coimbatore, India: IEEE Publications, pp. 210–214. Google Scholar
Yang, X-S and S Deb [2010] Engineering optimisation by cuckoo search. International Journal of Mathematical Modelling and Numerical Optimisation, 1 (4), 330–343. Crossref, Google Scholar
Yang, X-S and S Deb [2013] Multiobjective cuckoo search for design optimization. Computers & Operations Research, 40, 1616–1624. Crossref, Web of Science, Google Scholar
Yang, Y, J Zhang and J Carbonell [1999] Learning approaches for detecting and tracking new event. IEEE Intelligent Systems, 14, 32–43. Crossref, Web of Science, Google Scholar
Yang, X-S [2014] Cuckoo Search and Firefly Algorithm: Theory and Applications, X-S Yang (ed.) Switzerland: Springer International Publishing, https://doi.org/10.1007/978-3-319-02141-6. Crossref, Google Scholar
Yao, Z and B Choi [2005] Automatically discovering the number of clusters in web page datasets. In Proceedings of the 2005 International Conference on Data Mining, Las Vegas, Nevada, USA, pp. 3–9. Google Scholar
Zamir, O and O Etzioni [1998] Web document clustering: A feasibility demonstration. In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Melbourne, Australia, pp. 46–54. Google Scholar
Zaw, M and E Mon [2013] Web document clustering using cuckoo search clustering algorithm based on levy flight. International Journal of Innovation and Applied Studies, 4 (1), 182–188. Google Scholar
Zhan, J and H Tong Loh [2007] Using latent semantic indexing to improve the accuracy of document clustering. Journal of Information & Knowledge Management, 6 (3), 181–188. Link, Google Scholar
Zhang, T, R Ramakrishnan and M Livny [1996] BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data. Montreal, Canada, pp. 103–114. Google Scholar
Zhao, Y and G Karypis [2002] Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the International Conference on Information and Knowledge Management, McLean, USA, pp. 515–524. Google Scholar
Zhao, Y and G Karypis [2004] Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55 (3). Crossref, Web of Science, Google Scholar
Zhong, S [2005] Efficient online spherical K-means clustering. In Proceedings of the IEEE International Joint Conference on Neural Networks, Vol. 5, Montreal, Canada, pp. 3180–3185. Google Scholar