You do not have any saved searches
This paper presents a text categorization system, capable of analyzing HTML/text documents collected from the Web. The system is a component of a more extensive intelligent agent for adaptive information filtering on the Web. It is based on a hybrid case-based architecture, where two multilayer perceptrons are integrated into a case-based reasoner. An empirical evaluation of the system was performed by means of a confidence interval technique. The experimental results obtained are encouraging and support the choice of a hybrid case-based approach to text categorization.
Feature selection has been extensively applied in statistical pattern recognition as a mechanism for cleaning up the set of features that are used to represent data and as a way of improving the performance of classifiers. Four schemes commonly used for feature selection are Exponential Searches, Stochastic Searches, Sequential Searches, and Best Individual Features. The most popular scheme used in text categorization is Best Individual Features as the extremely high dimensionality of text feature spaces render the other three feature selection schemes time prohibitive.
This paper proposes five new metrics for selecting Best Individual Features for use in text categorization. Their effectiveness have been empirically tested on two well- known data collections, Reuters-21578 and 20 Newsgroups. Experimental results show that the performance of two of the five new metrics, Bayesian Rule and F-one Value, is not significantly below that of a good traditional text categorization selection metric, Document Frequency. The performance of another two of these five new metrics, Low Loss Dimensionality Reduction and Relative Frequency Difference, is equal to or better than that of conventional good feature selection metrics such as Mutual Information and Chi-square Statistic.
Many techniques and algorithms for automatic text categorization had been devised and proposed in the literature. However, there is still much space for researchers in this area to improve existing algorithms or come up with new techniques for text categorization (TC). Polynomial Networks (PNs) were never used before in TC. This can be attributed to the huge datasets used in TC, as well as the technique itself which has high computational demands. In this paper, we investigate and propose using PNs in TC. The proposed PN classifier has achieved a competitive classification performance in our experiments. More importantly, this high performance is achieved in one shot training (noniteratively) and using just 0.25%–0.5% of the corpora features. Experiments are conducted on the two benchmark datasets in TC: Reuters-21578 and the 20 Newsgroups. Five well-known classifiers are experimented on the same data and feature subsets: the state-of-the-art Support Vector Machines (SVM), Logistic Regression (LR), the k-nearest-neighbor (kNN), Naive Bayes (NB), and the Radial Basis Function (RBF) networks.
At present, Internet data is the world’s largest data resource database. In order to realize fast and automatic intelligent classification, it is of great significance to develop automatic classification of public security intelligent data systems. This paper studies the actual needs of public security information text classification, analyzes the text automatic classification technology support vector machine (SVM) theory and designs and implements SVM-based public security information, and also realizes the classification system of public security information. Automatic classification provides support for subsequent text mining systems and text searches and designs the performance of the system. After optimization and testing, the system was found to have good practical application results.
Feature engineering is one aspect of knowledge engineering. Besides feature selection, the appropriate assignment of feature values is also crucial to the performance of many software applications, such as text categorization (TC) and speech recognition. In this work, we develop a general method to enhance TC performance by the use of context-dependent feature values (aka term weights), which are obtained by a novel adaptation of a context-dependent adjustment procedure previously shown to be effective in information retrieval. The motivation of our approach is that the general method can be used with different text representations and in combination of other TC techniques. Experiments on several test collections show that our context-dependent feature values can improve TC over traditional context-independent unigram feature values, using a strong classifier like Support Vector Machine (SVM), which past works have found to be hard to improve. We also show that the relative performance improvement of our method over the context-independent baseline is comparable to the levels attained by recent word embedding methods in the literature, while an advantage of our approach is that it does not require the substantial training needed to learn word embedding representations.
Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.
Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.
Effective feature selection methods are important for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. Extensive research has been done to improve the performance of individual feature selection methods. However, it is always a challenge to come up with an individual feature selection method which would outperform other methods in most cases. In this paper, we explore the possibility of improving the overall performance by combining multiple individual feature selection methods. In particular, we propose a method of combining multiple feature selection methods by using an information fusion paradigm, called Combinatorial Fusion Analysis (CFA). A rank-score function and its associated graph, called rank-score graph, are adopted to measure the diversity of different feature selection methods. Our experimental results demonstrated that a combination of multiple feature selection methods can outperform a single method only if each individual feature selection method has unique scoring behavior and relatively high performance. Moreover, it is shown that the rank-score function and rank-score graph are useful for the selection of a combination of feature selection methods.
The Category Discrimination Method (CDM) is a new machine learning algo rithm designed specifically for text categorization. The motivation is there are sta tistical problems associated with natural language text when it is applied as input to existing machine learning algorithms (too much noise, too many features, skewed distribution).
The bases of the CDM are research results about the way that humans learn categories and concepts vis-à-vis contrasting concepts. The essential formula is cue validity borrowed from cognitive psychology, and used to select from all possible single word-based features the best predictors of a, given category.
The, hypothesis that CDM’s performance. will exceed two non-domain specific al gorithms, Bayesian classification and decision tree learners, is empirically tested.
This paper discusses the notion of Uncertainty, which has a prominent place in the theory and experimental practice of modern Physics. It argues that the awareness of Uncertainty may also be of tremendous importance to the field of Information Retrieval, and in particular Text Categorization.
As an application of Uncertainty in Text Categorization, a new criterion for Term Selection is described, which is based on the Uncertainty in Term Frequency across categories. This criterion allows to distinguish between low-quality (or "noisy") and high-quality ("stiff") terms.
We describe an experiment investigating the effect of eliminating noisy and stiff terms in the context of text classification. In the experiment we applied the Rocchio and Winnow classification algorithms to a collection of newspaper items, a mono-classified subset of the well-known Reuters 21578 corpus.
This investigation shows that both the local elimination of noisy terms and the global elimination of stiff terms can be used for Term Selection in Text Categorization.
Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.
The bag-of-words technique is often used to present a document in text categorization. However, for a large set of documents where the dimension of the bag-of-words vector is very high, text categorization becomes a serious challenge as a result of sparse data, over-fitting, and irrelevant features. A filter feature selection method reduces the number of features by eliminating irrelevant features from the bag-of-words vector. In this paper, we analyze the weak points and strong points of two filter feature selection approaches which are the frequency-based approach and the cluster-based approach. Thanks to the analysis, we propose hybrid filter feature selection methods, named the Frequency-Cluster Feature Selection (FCFS) and the Detailed Frequency-Cluster Feature Selection (DtFCFS), to further improve the performance of the filter feature selection process in text categorization. The FCFS is a combination of the Frequency-based approach and the Cluster-based approach, while the DtFCFS, a detailed version of the FCFS, is a comprehensively hybrid clusterbased method. We do experiments with four benchmark datasets (the Reuters-21578 and Newsgroup dataset for news classification, the Ohsumed dataset for medical document classification, and the LingSpam dataset for email classification) to compare the proposed methods with six related wellknown methods such as the Comprehensive Measurement Feature Selection (CMFS), the Optimal Orthogonal Centroid Feature Selection (OCFS), the Crossed Centroid Feature Selection (CIIC), the Information Gain (IG), the Chi-square (CHI), and the Deviation from Poisson Feature Selection (DFPFS). In terms of the Micro-F1, the Macro-F1, and the dimension reduction rate, the DtFCFS is superior to the other methods, while the FCFS shows competitive and even superior performance to the good methods, especially for the Macro-F1.
In recent times, bacterial Antimicrobial Resistance (AMR) analyses becomes a hot study topic. The AMR comprises information related to the antibiotic product name, class name, subclass name, type, subtype, gene type, etc., which can fight against the illness. However, the tagging language used to determine the data is of free context. These contexts often contain ambiguous data, which leads to a hugely challenging issue in retrieving, organizing, merging, and finding the relevant data. Manually reading this text and labelling is not time-consuming. Hence, topic modeling overcomes these challenges and provides efficient results in categorizing the topic and in determining the data. In this view, this research work designs an ensemble of artificial intelligence for categorizing the AMR gene data and determine the relationship between the antibiotics. The proposed model includes a weighted voting based ensemble model by the incorporation of Latent Dirichlet Allocation (LDA) and Hierarchical Recurrent Neural Networks (HRNN), shows the novelty of the work. It is used for determining the amount of “topics” that cluster utilizing a multidimensional scaling approach. In addition, the proposed model involves the data pre-processing stage to get rid of stop words, punctuations, lower casing, etc. Moreover, an explanatory data analysis uses word cloud which assures the proper functionality and to proceed with the model training process. Besides, three approaches namely perplexity, Harmonic mean, and Random initialization of K are employed to determine the number of topics. For experimental validation, an openly accessible Bacterial AMR reference gene database is employed. The experimental results reported that the perplexity provided the optimal number of topics from the AMR gene data of more than 6500 samples. Therefore, the proposed model helps to find the appropriate antibiotic for bacterial and viral spread and discover how to increase the proper antibiotic in human bodies
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. In this paper, we consider how to apply general text categorization techniques to Web site classification tasks. Two relevant issues concern, first, the case where the classifying object is not one document, as with a home page, but a Web site, which is a collection of Web pages. Second, real world Web directories have a complex hierarchical structure, in which leaf and non-leaf categories are directly assigned to Web sites, unlike the hierarchical structure treated in most previous research. On the first issue, this paper proposes the use of Web pages linked by a home page in addition to the home page itself. To accomplish this, we propose a Web site classification method based on connectivity analysis, as well as content analysis of Web sites. On the second issue, the hierarchical structure of classifiers is transformed into a flattened structure, but the classifier for each category uses features of its next-of-kin categories to take advantage of the hierarchical relationship. In experiments on a Korean commercial Web directory, the proposed classification method achieved an amazing improvement of microaveraging breakeven point by 36.6%, compared with an ordinary classifier.
In this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model (SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The experimental results show that the proposed text representation-based text categorization system performs better than the Domain Dictionary-based text categorization system. It also performs better than the system based on Bag-of-Words when the number of features is small and the training corpus size is small.
Nowadays, documents are increasingly associated with multi-level category hierarchies rather than a flat category scheme. As the volume and diversity of documents grow, so do the size and complexity of the corresponding category hierarchies. To be able to access such hierarchically classified documents in real-time, we need fast automatic methods to navigate these hierarchies. Today’s data domains are also very different from each other, such as medicine and politics. These distinct domains can be handled by different classifiers. A document representation system which incorporates the inherent category structure of the data should also add useful semantic content to the data vectors and thus lead to better separability of classes. In this paper, we present a scalable meta-classifier to tackle today’s problem of multi-level data classification in the presence of large datasets. To speed up the classification process, we use a search-based method to detect the level-1 category of a test document. For this purpose, we use a category–hierarchy-based vector representation. We evaluate the meta-classifier by scaling to both longer documents as well as to a larger category set and show it to be robust in both cases. We test the architecture of our meta-classifier using six different base classifiers (Random forest, C4.5, multilayer perceptron, naïve Bayes, BayesNet (BN) and PART). We observe that even though there is a very small variation in the performance of different architectures, all of them perform much better than the corresponding single baseline classifiers. We conclude that there is substantial potential in this meta-classifier architecture, rather than the classifiers themselves, which successfully improves classification performance.
Text Categorization (TC) has become one of the major techniques for organizing and managing online information. Several studies proposed the so-called associative classification for databases and few of these studies are proposed to classify text documents into predefined categories based on their contents. In this paper a new approach is proposed for Arabic text categorization. The approach facilitates the discovery of association rules for building a classification model for Arabic text categorization. An apriori based algorithm is employed for association rule mining. To validate the proposed approach, several experiments were applied on a collection of Arabic documents. Three classification methods using association rules were compared in terms of their classification accuracy; the methods are: ordered decision list, weighted rules, and majority voting. The results showed that the majority voting method is the best in most of experiments achieving an accuracy of up to 87%. On the other hand, the weighted rule method was the worst in all experiments. Generally, the results of the experiments showed that association rule mining is a suitable method for building good classification models to categorize Arabic text.
Extracting information from paper documents opens a variety of innovative applications by supporting people in their daily processing of documents. In this chapter, a system that interprets text on paper documents given the restricted domain of a certain application is presented. The system consists of four components. The Document Image Analysis component transforms the text of the scanned document image into an electronic format represented by a sequence of word hypotheses. Based on this sequence, three components extract the information necessary for automatic processing of documents. First, the information being enclosed in structured text is extracted, such as the sender and recipient of business letters, or title and author of scientific papers. Second, the text body of a message is mapped to a certain pre-defined category. In the final step, this text is analyzed and the information which is relevant for the current application is extracted. It is shown that for a real-world application the paper documents can be completely interpreted, resulting in an automatically generated answering letter. The system is fast, fault tolerant with respect to misspelling or recognition errors, and readily adaptable to new applications.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.