You do not have any saved searches
We present a new approach for analyzing topic models using visual analytics. We have developed TopicView, an application for visually comparing and exploring multiple models of text corpora, as a prototype for this type of analysis tool. TopicView uses multiple linked views to visually analyze conceptual and topical content, document relationships identified by models, and the impact of models on the results of document clustering. As case studies, we examine models created using two standard approaches: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Conceptual content is compared through the combination of (i) a bipartite graph matching LSA concepts with LDA topics based on the cosine similarities of model factors and (ii) a table containing the terms for each LSA concept and LDA topic listed in decreasing order of importance. Document relationships are examined through the combination of (i) side-by-side document similarity graphs, (ii) a table listing the weights for each document's contribution to each concept/topic, and (iii) a full text reader for documents selected in either of the graphs or the table. The impact of LSA and LDA models on document clustering applications is explored through similar means, using proximities between documents and cluster exemplars for graph layout edge weighting and table entries. We demonstrate the utility of TopicView's visual approach to model assessment by comparing LSA and LDA models of several example corpora.
In recent times, bacterial Antimicrobial Resistance (AMR) analyses becomes a hot study topic. The AMR comprises information related to the antibiotic product name, class name, subclass name, type, subtype, gene type, etc., which can fight against the illness. However, the tagging language used to determine the data is of free context. These contexts often contain ambiguous data, which leads to a hugely challenging issue in retrieving, organizing, merging, and finding the relevant data. Manually reading this text and labelling is not time-consuming. Hence, topic modeling overcomes these challenges and provides efficient results in categorizing the topic and in determining the data. In this view, this research work designs an ensemble of artificial intelligence for categorizing the AMR gene data and determine the relationship between the antibiotics. The proposed model includes a weighted voting based ensemble model by the incorporation of Latent Dirichlet Allocation (LDA) and Hierarchical Recurrent Neural Networks (HRNN), shows the novelty of the work. It is used for determining the amount of “topics” that cluster utilizing a multidimensional scaling approach. In addition, the proposed model involves the data pre-processing stage to get rid of stop words, punctuations, lower casing, etc. Moreover, an explanatory data analysis uses word cloud which assures the proper functionality and to proceed with the model training process. Besides, three approaches namely perplexity, Harmonic mean, and Random initialization of K are employed to determine the number of topics. For experimental validation, an openly accessible Bacterial AMR reference gene database is employed. The experimental results reported that the perplexity provided the optimal number of topics from the AMR gene data of more than 6500 samples. Therefore, the proposed model helps to find the appropriate antibiotic for bacterial and viral spread and discover how to increase the proper antibiotic in human bodies
Sentiment analysis has the potential to significantly impact several fields, such as trade, politics, and opinion extraction. Topic modeling is an intriguing concept used in emotion detection. Latent Dirichlet Allocation is an important algorithm in this subject. It investigates the semantic associations between terms in a text document and takes into account the influence of a subject on a word. Joint Sentiment-Topic model is a framework based on Latent Dirichlet Allocation method that investigates the influence of subjects and emotions on words. The emotion parameter is insufficient, and additional factors may be valuable in performance enhancement. This study presents two novel topic models that extend and improve Joint Sentiment-Topic model through a new parameter (the author’s view). The proposed methods care about the author’s inherent characteristics, which is the most important factor in writing a comment. The proposed models consider the effect of the author’s view on words in a text document. The author’s view means that the author creates an opinion in his mind about a product/thing before selecting the words for expressing the opinion. The new parameter has an immense effect on model accuracy regarding evaluation results. The first proposed method is author’s View-based Joint Sentiment-Topic model for Multi-domain. According to the evaluation results, the highest accuracy value in the first method is equal to 85%. It also has a lower perplexity value than other methods. The second proposed method is Author’s View-based Joint Sentiment-Topic model for Single-domain. According to the evaluation results, it achieves the highest accuracy with 95%. The proposed methods perform better than baseline methods with different topic number settings, especially the second method with 95% accuracy. The second method is a version of the first one, which outperforms baseline methods in terms of accuracy. These results demonstrate that the parameter of the author’s view improves sentiment classification at the document level. While not requiring labeled data, the proposed methods are more accurate than discriminative models such as Support Vector Machine (SVM) and logistic regression, based on the evaluation section’s outcomes. The proposed methods are simple with a low number of parameters. While providing a broad perception of connections between different words in documents of a single collection (single-domain) or multiple collections (multi-domain), the proposed methods have prepared solutions for two different situations (single-domain and multi-domain). The first proposed method is suitable for multi-domain datasets, but the second proposed method is suitable for single-domain datasets. While detecting emotion at the document level, the proposed models improve evaluation results compared to the baseline models. Eight datasets with different sizes have been used in implementations. For evaluations, this study uses sentiment analysis at the document level, perplexity, and topic coherency. Also, to see if the outcomes of the suggested models are statistically different from those of other algorithms, the Friedman test, a statistical analysis, is employed.
This paper proposes a novel concept we call musical commonness, which is the similarity of a song to a set of songs; in other words, its typicality. This commonness can be used to retrieve representative songs from a set of songs (e.g. songs released in the 80s or 90s). Previous research on musical similarity has compared two songs but has not evaluated the similarity of a song to a set of songs. The methods presented here for estimating the similarity and commonness of polyphonic musical audio signals are based on a unified framework of probabilistic generative modeling of four musical elements (vocal timbre, musical timbre, rhythm, and chord progression). To estimate the commonness, we use a generative model trained from a song set instead of estimating musical similarities of all possible song-pairs by using a model trained from each song. In experimental evaluation, we used two song-sets: 3278 Japanese popular music songs and 415 English songs. Twenty estimated song-pair similarities for each element and each song-set were compared with ratings by a musician. The comparison with the results of the expert ratings suggests that the proposed methods can estimate musical similarity appropriately. Estimated musical commonnesses are evaluated on basis of the Pearson product-moment correlation coefficients between the estimated commonness of each song and the number of songs having high similarity with the song. Results of commonness evaluation show that a song having higher commonness is similar to songs of a song set.
In the present work Latent Semantic Analysis of textual data was applied on texts related to courage, in order to compare and contrast results and evaluate the opportunity of integrating different data sets. To better understand the definition of courage in Italian context, 1199 participants were involved in the present study and was asked to answer to the following question “Courage is…”. The participants’ definitions of courage were analyzed with the Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), in order to study the fundamental concepts arising from the population. An analogous comparison with Twitter posts has been also carried out to analyze if the public opinion emerging from social media provides a challenging and rich context to explore computational models of natural language.
Nowadays, Twitter has become one of the fastest-growing microblogging services; consequently, analyzing this rich and continuously user-generated content can reveal unprecedentedly valuable knowledge. In this paper, we propose a novel two-stage system to detect and track events from tweets by integrating a Latent Dirichlet Allocation (LDA)-based approach and an efficient density–contour-based spatio-temporal clustering approach. In the proposed system, we first divide the geotagged tweet stream into temporal time windows; next, events are identified as topics in tweets using an LDA-based topic discovery step; then, each tweet is assigned an event label; next, a density–contour-based spatio-temporal clustering approach is employed to identify spatio-temporal event clusters. In our approach, topic continuity is established by calculating KL-divergences between topics and spatio-temporal continuity is established by a family of newly formulated spatial cluster distance functions. Moreover, the proposed density–contour clustering approach considers two types of densities: “absolute” density and “relative” density to identify event clusters where either there is a high density of event tweets or there is a high percentage of event tweets. We evaluate our approach using real-world data collected from Twitter, and the experimental results show that the proposed system can not only detect and track events effectively but also discover interesting patterns from geotagged tweets.
Designing efficacious semantics for the dynamic interaction and searches has proven to be concretely challenging because of the dynamically of the semantic searches, method of browsing and visualization interfaces for high volume information. This has a direct impact on enhancing the capabilities of the web. To surmount the challenges of providing meaning to high volume unstructured datasets, Natural language processing techniques and implements have been proven to be propitious, however, the reactivity of these techniques should be studied and predicated on the objective of providing meaning to the unstructured data. This paper demonstrates the working of five NLP techniques namely, bag-of-words, TF-IDF, NER, LSA, and LDA. The experiment provides the kindred attribute accomplishment or the identification of the meaning of this unstructured data varies from one technique to another. However, NLP techniques can be efficient as they provide insights into the data and make it human-readable. This will in turn avail in building better human–machine intractable browsing and applications.
In this research, we exploit a novel approach for propagation processes on a network related to textual information by using topic modeling and pretopology theory. We first introduce the textual agent’s network in which each agent represents a node which contains specific properties, particularly the agent’s interest. Agent’s interest is illustrated through the topic’s probability distribution which is estimated based on textual information using topic modeling. Based on textual agent’s network, we proposed two information diffusion models. The first model, namely Textual-Homo-IC, is an expanded model of independent cascade model in which the probability of infection is formed on homophily that is measured based on agent’s interest similarity. In addition to expressing the Textual-Homo-IC model on the static network, we also reveal it on dynamic agent’s network where there is transformation of not only the structure but also the node’s properties during the spreading process. We conducted experiments on two collected datasets from NIPS and a social network platform, Twitter, and have attained satisfactory results. On the other hand, we continue to exploit the dissemination process on a multi-relational agent’s network by integrating the pseudo-closure function from pretopology theory to the cascade model. By using pseudo-closure or stochastic pseudo-closure functions to define the set of neighbors, we can capture more complex kind of neighbors of a set. In this study, we propose the second model, namely Textual-Homo-PCM, an expanded model of pretopological cascade model, a general model for information diffusion process that can take place in more complex networks such as multi-relational networks or stochastic graphs. In Textual-Homo-PCM, pretopology theory will be applied to determine the neighborhood set on multi-relational agent’s network through pseudo-closure functions. Besides, threshold rule based on homophily will be used for activation. Experiments are implemented for simulating Textual-Homo-PCM and we obtained expected results. The work in this paper is an extended version of our paper [T. K. T. Ho, Q. V. Bui and M. Bui, Homophily independent cascade diffusion model based on textual information, in Computational Collective Intelligence, eds. N. T. Nguyen, E. Pimenidis, Z. Khan and B. Trawiski, Lecture Notes in Computer Science, Vol. 11055 (Springer International Publishing, 2018), pp. 134–145] presented in ICCCI 2018 conference.
The internationalization of the portfolio company is a key strategy used by private equity (PE) investors to create value and produce returns. In recent years, the focus on the strategies for value-creation through operational improvement has become essential to achieve the exponential growth required to the portfolio company, given the low multiples and the market risk of leverage. In this paper, we define the key types of contribution that a PE investor can provide in order to support the internationalization process and their effects on the portfolio company’s performance. The research is based on a survey administered to 47 PE fund managers, which covers 156 deals involving Italian companies. The results offer insight into the contribution to the corporate governance, strategy and management that PE provides in addition to the monetary support. The findings show that the non-financial support given to the portfolio companies has a positive impact on the performance and that the most impactful contribution the PE can give is the support to the relational network when the company strategy involves a foreign direct investment.
Traditional journal analyses of topic trends in IS journals have manually coded target articles from chosen time periods. However, some research efforts have been made to apply automatic bibliometric approaches, such as cluster analysis and probabilistic models, to find topics in academic articles in other research areas. The purpose of this study is thus to investigate research topic trends in Engineering Management from 1998 through 2017 using an LDA analysis model. By investigating topics in EM journals, we provide partial but meaningful trends in EM research topics. The trend analysis shows that there are hot topics with increasing numbers of articles, steady topics that remain constant, and cold topics with decreasing numbers of articles.
Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts; 2) extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.
Although dietary supplements are widely used and generally are considered safe, some supplements have been identified as causative agents for adverse reactions, some of which may even be fatal. The Food and Drug Administration (FDA) is responsible for monitoring supplements and ensuring that supplements are safe. However, current surveillance protocols are not always effective. Leveraging user-generated textual data, in the form of Amazon.com reviews for nutritional supplements, we use natural language processing techniques to develop a system for the monitoring of dietary supplements. We use topic modeling techniques, specifically a variation of Latent Dirichlet Allocation (LDA), and background knowledge in the form of an adverse reaction dictionary to score products based on their potential danger to the public. Our approach generates topics that semantically capture adverse reactions from a document set consisting of reviews posted by users of specific products, and based on these topics, we propose a scoring mechanism to categorize products as “high potential danger”, “average potential danger” and “low potential danger.” We evaluate our system by comparing the system categorization with human annotators, and we find that the our system agrees with the annotators 69.4% of the time. With these results, we demonstrate that our methods show promise and that our system represents a proof of concept as a viable low-cost, active approach for dietary supplement monitoring.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.