Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    TOPICVIEW: VISUAL ANALYSIS OF TOPIC MODELS AND THEIR IMPACT ON DOCUMENT CLUSTERING

    We present a new approach for analyzing topic models using visual analytics. We have developed TopicView, an application for visually comparing and exploring multiple models of text corpora, as a prototype for this type of analysis tool. TopicView uses multiple linked views to visually analyze conceptual and topical content, document relationships identified by models, and the impact of models on the results of document clustering. As case studies, we examine models created using two standard approaches: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Conceptual content is compared through the combination of (i) a bipartite graph matching LSA concepts with LDA topics based on the cosine similarities of model factors and (ii) a table containing the terms for each LSA concept and LDA topic listed in decreasing order of importance. Document relationships are examined through the combination of (i) side-by-side document similarity graphs, (ii) a table listing the weights for each document's contribution to each concept/topic, and (iii) a full text reader for documents selected in either of the graphs or the table. The impact of LSA and LDA models on document clustering applications is explored through similar means, using proximities between documents and cluster exemplars for graph layout edge weighting and table entries. We demonstrate the utility of TopicView's visual approach to model assessment by comparing LSA and LDA models of several example corpora.

  • articleNo Access

    Determination of Context Window Size

    Context windows are important for a variety of natural language analysis and processing. A trade-off exists between the task performance and the size of the context. Lucassen and Mercer used mutual information to determine the size of the context for English text. We apply the same technique to determine the Context window size for Chinese text. In addition, we use the association score, proposed by Church. The association score is directly related to the prediction ability of units in the context. To reduce the effects of spurious associations, the association score values at the N% quartile is used, instead of the maximum, and the association score derived from low frequency occurrences (i.e. <5) are discarded. A window size of 9 characters was found to be large enough for most associations between characters themselves, and between words themselves. An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information. We conclude that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models) whereas the association score is more suitable for non-exhaustive contextual models (e.g. identification of collocation).

  • articleNo Access

    DERIVING PATHWAY MAPS FROM AUTOMATED TEXT ANALYSIS USING A GRAMMAR-BASED APPROACH

    We demonstrate how automated text analysis can be used to support the large-scale analysis of metabolic and regulatory pathways by deriving pathway maps from textual descriptions found in the scientific literature. The main assumption is that correct syntactic analysis combined with domain-specific heuristics provides a good basis for relation extraction. Our method uses an algorithm that searches through the syntactic trees produced by a parser based on a Referent Grammar formalism, identifies relations mentioned in the sentence, and classifies them with respect to their semantic class and epistemic status (facts, counterfactuals, hypotheses). The semantic categories used in the classification are based on the relation set used in KEGG (Kyoto Encyclopedia of Genes and Genomes), so that pathway maps using KEGG notation can be automatically generated. We present the current version of the relation extraction algorithm and an evaluation based on a corpus of abstracts obtained from PubMed. The results indicate that the method is able to combine a reasonable coverage with high accuracy. We found that 61% of all sentences were parsed, and 97% of the parse trees were judged to be correct. The extraction algorithm was tested on a sample of 300 parse trees and was found to produce correct extractions in 90.5% of the cases.

  • articleNo Access

    DISEASE-RELATED CONCEPT MINING BY KNOWLEDGE-BASED TWO-DIMENSIONAL GENE MAPPING

    There is a strong need to systematically organize and comprehend the rapidly expanding stores of biomedical knowledge to formulate hypotheses on disease mechanisms. However, no method is available that automatically structuralizes fragmentary knowledge along with domain-specific expressions for a large-scale integration. A method presented here, cross-subspace analysis (CSA), produces a holistic view of over 3,000 human genes with a two-dimensional (2D) arrangement. The genes are plotted in relation to functions determined by machine learning from the occurrence patterns of various biomedical terms in MEDLINE abstracts. By focusing on the 2D distributions of gene plots that share the same biomedical concepts, as defined by databases such as Gene Ontology, relevant biomedical concepts can be computationally extracted. In an analysis where myocardial infarction and ischemic stroke were taken as examples, we found valid relations with lifestyle, diet-related metabolism, and host immune responses, all of which are known risk factors for the diseases. These results demonstrate that systematizing accumulated gene knowledge can lead to hypothesis generation and knowledge discovery, regardless of the area of inquiry or discipline.

  • articleNo Access

    AUTOMATIC ANALYSIS OF CORPORATE SUSTAINABILITY REPORTS AND INTELLIGENT SCORING

    As more and more corporations and business entities have been publishing corporate sustainability reports, the current manual process of analyzing the reports is becoming obsolete and tedious. Development of an intelligent software tool to perform the report analysis task would be an ideal solution to this long standing problem. In this paper we argue that, given sufficient quality training using a custom corpus, corporate sustainability reports can be analyzed in mass numbers using a supervised learning based text mining software. We also discuss our methodologies of improving the accuracy of our classifier as well as the feature selector in order to gain better performance and more stability. Additionally, the achieved results of executing the developed software on one hundred reports are discussed in order to prove our claims.

  • articleNo Access

    IDENTIFICATION OF NAMES AND ACTIONS OF PRINCIPAL OBJECTS IN TV PROGRAM SEGMENTS USING CLOSED CAPTIONS

    This paper proposes a method for automatically extracting principal video objects that appear in TV program segments and their actions using linguistic analysis of closed captions. We focus on features based on the text style of the closed captions by using Quinlan's C4.5 decision-tree learning algorithm. We extract a noun describing a video object and a verb describing an action for each video shot. To show the effectiveness of the method, we conducted experiments on the extraction of video segments in which animals appear and perform actions in twenty episodes of a Nature program. We obtained F-values of 0.609 on the extraction of video segments in which animals appear and 0.699 on extracting the action of "eating." We applied our method to a further 20 episodes, and generated a multimedia encyclopedia of animals. This provided a total of 387 video clips of 105 kinds of animals and 261 video clips of 56 kinds of actions.

  • articleFree Access

    Underreaction to News in the US Stock Market

    Using a score that quantifies the tone of news articles, I construct a weekly measure of qualitative information that predicts returns over the next 13 weeks. A portfolio long stocks with past positive tone and short stocks with past negative tone has an average return of 16.54 basis points per week (8.60% per year). The findings suggest the market underreacts to the content of news articles. The underreaction is not constrained to small stocks, low analyst-coverage stocks, low institutional ownership, or loser stocks. The findings also suggest the tone of news articles is different from sentiment which is assumed to have no permanent impact on stock prices.

  • articleNo Access

    THE SPEECHES OF THE EUROPEAN CENTRAL BANK’s PRESIDENTS: AN NLP STUDY

    This paper introduces natural language processing into the study of central banking. It studies the evolution of the ECB’s communication through time, considering its three subsequent presidents (W. Duisenberg, J. C. Trichet and M. Draghi) and the pre- and post-2008 financial crisis era. It helps understand the history of the ECB since its inception. From a methodological standpoint, we study the evolution of the ECB’s speeches. The speech analysis is based on text classification and sentiment/polarity analyses. For that purpose, we have built a unique dataset of the ECB’s speeches. We have coded algorithms to run the text analysis through time. They help us capture the evolution in the ECB’s understanding of the actual economic situation and also measure — for instance — the stress level at the ECB through a polarity analysis through time.

  • articleNo Access

    WEATHER AND CLIMATE AS EVENTS: CONTRIBUTIONS TO THE PUBLIC IDEA OF CLIMATE CHANGE

    One aspect of the psychology of weather and climate concerns the multiple meanings that may be associated with weather and climate as event. Atmospheric scientists and journalists have increasingly described both weather and climate as event. In this paper, the authors documented the increasing use of weather and climate as events in the scholarly literature of the American Meteorological Society and in newspaper articles over time. The authors also conducted pathfinder network scaling analyses with event-related terms to assess the meanings of events in academic and journalistic writing. The analyses suggested four contexts of event meanings: (1) study of ordinary weather or climate occurrences, (2) the study and attribution of severe and extreme weather, (3) societal impacts of weather, and (4) the public lexicon. Communicating about weather and climates as event contributes to the development and evolution of the public idea of climate change. The burgeoning of event in discourse contributes to the public idea of climate change in at least three ways: (1) events contribute specificity to the more general idea of climate change; (2) events contribute experientiality of climate change, and (3) events contribute exemplification to the public idea of climate change to the extent that weather events can be attributed to climate change.

  • chapterNo Access

    Chapter 17: Advanced Methods: Operationalizing Social Network Services Data—Deep Content Analysis to Comprehend Brand Presence

    The increasing popularity of Social Network Services (SNSs) within the past decade resulted in a dramatic change in the social and economic systems. The enhancement of the technology that facilitates communication in SNSs is creating not only new opportunities for innovation but also new challenges for many incumbent enterprises. As an initial step towards understanding SNSs in business ecosystems, this research attempts to operationalize SNS’s data by utilizing advancements in text analysis for meaningful assumptions and insights. The methods for operationalizing of SNS’s data will be put into practice in a case study to investigate the user perceptions within a competitive market of smartphone brands. The study indicates how users react to content from different profile types (Personal, Professional, Corporate, News and Viral). The results will shed light on the variety of the users’ profile layers participating in SNSs. Further, the study examines the impact on the sentimentality of SNSs’ content generated by users.

  • chapterNo Access

    The Study of Sentiment Analysis on E-Learning Course Forum Using Web Crawler — A Case of RFID Technology and Certification

    This study was divided into three parts. The first part was to design and create a Chinese word segmenting custom dictionary specifically for the RFID Technology and Certification course to improve the accuracy of the word segmentation. The second part was to apply the text sentiment analysis to figure out the positive and negative emotions of students in the discussion of the course so that the teachers could quickly browse the students’ discussion, and find out any situation requiring further notice. The third part discovered that the relevance of the discussion to the course content was positively correlated with the emotions shown in the discussion.

  • chapterNo Access

    The Study of Text Analysis for Basic Arithmetic Based on Machine Learning

    First, we compared nine machine learning methods and found out the methods applicable to the text analysis of the four arithmetic application questions. Then these methods were used to design and test the system. According to the test results of this study, logistic regression is the most suitable machine learning method for the text analysis of four arithmetic applications. The problem-solving system of research and development could reach the problem-solving rate of 76.5%.

  • chapterNo Access

    Supporting PDF Accessibility Evaluation: Early Results from the FixRep Project

    The aim of this paper is to present results from a pilot study exploring automated formal metadata extraction in accessibility evaluation. Information about some types of accessibility may make up part of the formal metadata for a document. As the importance of document accessibility has become more widely accepted and relevant legislation has been identified and characterised, the possibility of storing information about document accessibility as part of the formal metadata held by the system has become more attractive. This is useful in order to provide a starting-point for an accessibility assessment. This study reviews accessibility issues linked to the PDF format in use. We demonstrate a prototype created during the FixRep project, that aims to support capture, storage and reuse of accessibility information where available, and to approach the problem of reconstructing required data from available sources. Finally, we discuss practical use cases for a service based around this prototype.