The digital media social behavior decision analysis method based on large-scale emotional data is a complex process involving natural language processing, data mining, machine learning, in-depth learning and other fields. In digital media social behavior data collection, NLP and CLIP are mainly used to process and analyze user-generated content, including multimedia data such as text, images and videos. These data come from various social media platforms, such as Weibo, WeChat, Jitterbug, Instagram, Twitter and so on. Based on the NLP and CLIP model, this paper analyzes the user’s social behavior data (such as publishing content, interaction records and attention relationship), and constructs the user’s accurate portrait. The content of the portrait includes the user’s interest, preference, emotional tendency and so on, which realizes the group decision-making judgment for the user. At the same time, by analyzing the influence of users on social media (such as the number of fans, the number of re-tweets and the number of likes), the impact of their comments on other people’s emotions and behavior is evaluated. Finally, the sentiment tendency of the user’s comment text is judged by a predefined vocabulary list (such as a sentiment dictionary) and grammar rules. The experimental results show that the model based on NLP and CLIP proposed in this paper can scientifically analyze the behavior decisions of digital media users. Compared with the existing models, it greatly improves the computing speed and decision evaluation efficiency on the basis of limited computing resources. After experimental testing, the optimal solution probability of the model proposed in this paper is 0.96, which achieves the level of excellent models in the industry without the need for large-scale computational resources.
Nowadays, all the records in various languages are accessible with their advanced structures. For simple recovery of these digitized records, these reports should be ordered into a class as indicated by their content. Text Categorization is an area of Text Mining which helps to overcome this challenge. Text Classification is a demonstration of allotting classes to records. This paper investigates Text Classification works done in foreign Languages, regional languages and a list of books’ content. Messages available in different languages force the difficulties of NLP approaches. This study shows that supervised ML algorithms such as Logistic regression, Naive Bayes classifier, kk-Nearest-Neighbor classifier, Decision Tree and SVMs performed better for Text Classification tasks. The automated document classification technique is useful in our day-to-day life to find out the type of language and different department books based on their text content. We have been using different foreign and regional languages here to classify such as Tamil, Telugu, Kannada, Bengali, English, Spanish, French, Russian and German. Here, we utilize one versus all SVMs for multi-characterization with 3-crease Cross Validation in all cases and see that SVMs outperform different classifiers. This implementation is done by using hybrid classifiers and it depicts analyses with delicate edge straight SVMs as well as bit-based SVMs.
Utilizing deep learning for data mining in biomedicine often falls short in leveraging prior knowledge and adapting to the complexities of biomedical literature mining. Entity recognition, a fundamental task in information extraction, also provides data support for Natural Language Processing (NLP) downstream tasks. Bovine Viral Diarrhea Virus (BVDV) results in considerable economic losses in the cattle industry due to calf diarrhea, bovine respiratory syndrome, and cow abortion. This study aims to extract information on BVDV from relevant literature and build a knowledge base. It enhances feature extraction in the BioBERT pre-trained model using the Machine Reading Comprehension (MRC) framework for information fusion and bi-directionally extracts corpus information through the Bi-LSTM network, followed by a CRF layer for decoding and prediction. The results show the construction of a BVDV Corpus with 22 biomedical entities and introduce the BioBERT-Bi-LSTM-CRF Integrated with MRC (BBCM) model for Named Entity Recognition (NER), combining prior knowledge and the reading comprehension framework (MRC). The BBCM model achieves F1F1-scores of 78.79% and 76.3% on the public datasets JNLPBA and GENIA, respectively, and 67.52% on the BVDV Corpus, outperforming other models. This research presents a targeted NER method for BVDV, effectively identifying related entities and exploring their relationships, thus providing valuable data support for NLP’s downstream tasks.
In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with POS labels corresponding to categories such as noun, verb or adjective. This paper proposes a model of uniform-design genetic expression programming (UGEP) for POS tagging. UGEP is used to search for appropriate structures in function space of POS tagging problems. After the evolution of sequence of tags, GEP can find the best individual as solution. Experiments on Brown Corpus show that (1) in closed lexicon tests, UGEP model can get higher accuracy rate of 98.8% which is much better than genetic algorithm model, neural networks and hidden Markov model (HMM) model.; (2) in open lexicon tests, the proposed model can also achieve higher accuracy rate of 97.4% and a high accuracy rate on unknown words of 88.6%.
“Severity” is one of the essential features of software bug reports, which is a crucial factor for developers to decide which bug should be fixed immediately and which bug could be delayed to a next release. Severity assignment is a manual process and its accuracy depends on the experience of the assignee. Prior research proposed several models to automate this process. These models are based on textual preprocessing of historical bug reports and classification techniques. Although bug repositories suffer from severity class imbalance, none of the prior studies investigated the impact of implementing a class rebalancing technique on the accuracy of their models. In this paper, we propose a framework for predicting fine-grained severity levels which utilizes an over-sampling technique “SMOTE”, to balance the severity classes, and a feature selection scheme, to reduce the data scale and select the most informative features for training a KK-nearest neighbor (KNN) classifier. The KNN classifier utilizes a distance-weighted voting scheme to predict the proper severity level of a newly reported bug. We investigated the effectiveness of our proposed approach on two large bug repositories, namely Eclipse and Mozilla, and the experimental results showed that our approach outperforms cutting-edge studies in predicting the minority severity classes.
User-generated content such as comments or tweets (also called by social information) following a Web document provides additional information for enriching the content of an event mentioned in sentences. This paper presents a framework named SoSVMRank, which integrates the user-generated content of a Web document to generate a highquality summarization. In order to do that, the summarization was formulated as a learning to rank task, in which comments or tweets are exploited to support sentences in a mutual reinforcement fashion. To model sentence-comment (or tweet) relation, a set of local and social features are proposed. After ranking, top m ranked sentences and comments (or tweets) are selected as the summarization. To validate the efficiency of our framework, sentence and story highlight extraction tasks were taken as a case study on three datasets in two languages, English and Vietnamese. Experimental results indicate that: (i) our new features improve the summary performance of the framework in term of ROUGE-scores compared to state-of-the-art baselines and (ii) the integration of user-generated content benefits single-document summarization.
With technological development, human–computer interaction (HCI) has improved, and spoken communication among machines and humans is one solution to enhance and expedite this process. Researchers have recently explored several systems to improve speech and speaker recognition performance in recent decades. A crucial threat in HCI is developing models that can effectually listen and respond like humans. It resulted in the development of the automated speech emotion recognition (SER) method, which can recognize various emotional classes by electing and extracting effectual features from speech signals. The fundamental problem of automated speech detection is the considerable variation in speech signals because of distinct speakers, language differences, speech differences, contents and acoustic conditions, voice modulation differences based on age and gender. With enhancements in deep learning (DL) and the affordability of computational resources, specifically graphical processing units (GPUs), research underwent a paradigm shift. Therefore, this study develops a multi-class automated speech language recognition using natural language processing with optimal deep learning (MASLR-NLPODL) technique. The MASLR-NLPODL technique intends to accomplish the efficient identification of different spoken languages. In the MASLR-NLPODL technique, the initial preprocessing technique involves windowing, frame blocking, and pre-emphasis block. Next, an adaptive time-frequency feature extractor approach utilizing the discrete fractional Fourier transform (DFrFT) was applied, which can be attained by extending the discrete Fourier transform (DFT) with eigenvectors. An improved Harris hawks optimization (IHHO) technique can be employed to select effectual features. Moreover, the classification of spoken languages can be performed by the gated recurrent unit (GRU) model. Finally, the salp swarm algorithm (SSA)-based hyperparameter selection process is involved in enhancing the performance of the GRU model. The design of the IHHO-based feature selection and SSA-based hyperparameter tuning process demonstrates the novelty of the work. The performance evaluation of the MASLR-NLPODL technique takes place under the VoxForge Dataset. The experimental validation of the MASLR-NLPODL technique exhibited a superior accuracy outcome of 96.40% over existing techniques.
Social media platforms have become vast repositories of user-generated content, offering an abundant data source for sentiment analysis (SA). SA is a natural language processing (NLP) algorithm that defines the sentiment or emotional tone expressed in the given text. It includes utilizing computational techniques to automatically detect and categorize the sentiment as negative, positive, or neutral. Aspect-based SA (ABSA) systems leverage machine learning (ML) approaches to discriminate nuanced opinions within the text, which break down sentiment through particular attributes or aspects of the subject matter. Businesses and researchers can gain deep insights into brand perception, public opinion, and product feedback by integrating social media data with ABSA methodologies. This enables the extraction of sentiment polarity and more actionable and targeted insights. By applying ML approaches trained on the abundance of social media data, organizations can identify areas for improvement, tailor their strategies to meet their audience’s evolving needs and preferences and better understand customer sentiments. In this view, this study develops a new Fractal Snow Ablation Optimizer with Bayesian Machine Learning for Aspect-Level Sentiment Analysis (SAOBML-ALSA) technique on social media. The SAOBML-ALSA approach examines social media content to identify sentiments into distinct classes. In the primary stage, the SAOBML-ALSA technique preprocesses the input social media content to transform it into a meaningful format. This is followed by a LeBERT-based word embedding process. The SAOBML-ALSA technique applies a Naïve Bayes (NB) classifier for ALSA. Eventually, the parameter selection of the NB classifier will be done using the SAO technique. The performance evaluation of the SAOBML-ALSA methodology was examined under the benchmark database. The experimental results stated that the SAOBML-ALSA technique exhibits promising performance compared to other models.
The size of data shared has promptly surged with the usage of social networks, and this has become a crucial research area for environmental concerns. Sentiment analysis (SA) defines the behavior and people’s sensitivity to environmental problems. Ubiquitous learning and natural language processing (NLP) intersect to address climate change SA by analyzing vast textual data from scientific reports, social media, and news to gauge public attitudes and perceptions. This provides meaningful insight for organizations and policymakers working on climate change mitigation systems in rural areas and allows local communities to engage in environmental discourse actively, fostering a bottom-up algorithm to address climate problems. This study develops a new twitter climate change sentiment analysis using the Bayesian machine learning (TCCSA-BML) technique to promote sustainable development in rural areas. This technique exploits ubiquitous learning with NLP technologies to identify climate change in rural areas. Also, the TCCSA-BML technique undergoes data preprocessing in several ways to make the input data compatible with processing. Besides, the TCCSA-BML technique utilizes the TF-IDF model for the word embedding process. Moreover, the classification of various kinds of sentiments occurs using the Bayesian model averaging (BMA) technique comprising three classifiers, namely attention long short-term memory (ALSTM), extreme learning machine (ELM), and gated recurrent unit (GRU). Finally, the parameter tuning of the classifier is implemented by the coyote optimization algorithm (COA) model. The performance analysis of the TCCSA-BML approach is evaluated on the Kaggle SA dataset. The experimental validation of the TCCSA-BML approach portrayed a superior accuracy value of 94.07% over other models.
This paper describes a new approach to English to Bangla translation. The English to Bangla Translator was developed. This system (BANGANUBAD) takes a paragraph of English sentences as input sentences and produces equivalent Bangla sentences. The BANGANUBAD system comprises a preprocessor, morphological recursive parser, semantic parser using English word ontology for context disambiguation, an electronic lexicon associated with grammatical information and a discourse processor. It also employs a lexical disambiguation analyzer. This system does not rely on a stochastic approach. Rather, it is based on a special kind of amalgamated architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components.
A literature review on sarcasm detection has been undergone in this research work. To have a meaningful study about the existing works on sarcasm detection, a total of 65 research papers have been analyzed in diverse aspects like the datasets utilized, language, pre-processing technique, type of features, feature extraction technique, machine learning/deep learning-based sarcasm classification. All these papers belong to diverse international as well as national journals. Moreover, the performance of each work in terms of accuracy, F-score and recall will also be manifested. To show the superiority of the works, a comparative evaluation has been undergone in terms of analyzed performances of each of the works. Finally, the works that hold the superior or improved values are furnished. In addition, the current challenges faced by the sarcasm detection system are portrayed, and this will be a milestone for future researchers.
In this paper, we investigate the dynamics of the social media response on Reddit to the COVID-19 pandemic during its first year (February 2020–2021). The emergence of region-specific subreddits allows us to compare the reactions of individuals posting their opinions on social media about the global pandemic from two perspectives — the UK and the US.
In particular, we look at the volume of posts and comments on these two subreddits, and at the sentiment expressed in these posts and comments over time as a measure of the public level of engagement and response. Whilst an analysis of volume allows us to quantify how interested people are about the pandemic as it unfolds, sentiment analysis goes beyond this and informs us about how people respond towards the pandemic based on the textual content in the posts and comments. The research looks to develop a framework for analyzing the social response on Reddit to a large-scale event in terms of the level of engagement measured through post and comment volumes, and opinion measured through an analysis of sentiment applied to the post content. In order to compare the subreddits, we show the trend in the time series through the application of moving average methods. We also show how to identify the lag between time series and align them using cross-correlation. Moreover, once aligned, we apply moving correlations to the time series to measure their degree of correspondence to see if there is a similar response to the pandemic across the two groups (UK and US). The results indicate that both subreddits were posting in high volumes at specific points during the pandemic, and that, despite the generally negative sentiment in the posts and comments, a gradual decrease in negativity leading up to the start of 2021 is observed as measures are put in place by governments and organizations to contain the virus and provide necessary support the affected populations.
Electronic Health Record (EHR) systems in healthcare organisations are primarily maintained in isolation from each other that makes interoperability of unstructured(text) data stored in these EHR systems challenging in the healthcare domain. Similar information may be described using different terminologies by different applications that can be evaded by transforming the content into the Resource Description Framework (RDF) model that is interoperable amongst organisations. RDF requires a document’s contents to be translated into a repository of triplets (subject, predicate, object) known as RDF statements. Natural Language Processing (NLP) techniques can help get actionable insights from these text data and create triplets for RDF model generation. This paper discusses two NLP-based approaches to generate the RDF models from unstructured patients’ documents, namely dependency structure-based and constituent(phrase) structure-based parser. Models generated by both approaches are evaluated in two aspects: exhaustiveness of the represented knowledge and the model generation time. The precision measure is used to compute the models’ exhaustiveness in terms of the number of facts that are transformed into RDF representations.
Remarkably, Singapore as one of today's hotspots for bioinformatics and computational biology research appeared de novo out of pioneering efforts of engaged local individuals in the early 90-s that, supported with increasing public funds from 1996 on, morphed into the present vibrant research community. This article brings to mind the pioneers, their first successes and early institutional developments.
In many fields of science, IT applications and business environments successfully evolved systems to receive vast amount of electronic data and information. Due to increasing electronic data and information, most recent researches have tried to find a solution to resolve the crisis of information overload. These solutions include a combination of techniques of data mining, machine learning, natural language processing and information retrieval, information extraction, and knowledge management. A great challenge is how to exploit those information and knowledge resources and turn them into useful knowledge available to concerned people. The value of knowledge increases when people can share and capitalize on it. Thus, approaches that can help researchers to benefit from existing hidden knowledge are needed. For this, tools that can analyze, extract and explore relevant and useful information with relations are required. So, the main contribution of this paper is to integrate the technology of XML with text analysis for introducing an efficient concept-based structure model, where this model can represent the text in a form that can be easily understood, shared, managed and mined. This paper describes an efficient object oriented text analysis (OOTA) approach by generating an object oriented model that transforms unstructured text to a specific structured form and stored in XML format. The experimental results show that this approach has a good promotion on results.
This paper discusses principles for the design of natural language processing (NLP) systems to automatically extract data from doctor's notes, laboratory results and other medical documents in free-form text. We argue that rather than searching for "atom units of meaning" in the text and then trying to generalize them into a broader set of documents through increasingly complicated system of rules, an NLP practitioner should take concepts as a whole and as a meaningful unit of text. This simplifies the rules and makes NLP system easier to maintain and adapt. The departure point is purely practical; however, a deeper investigation of typical problems with the implementation of such systems leads us to a discussion of broader linguistic theories underlying the NLP practices, such as metaphors theories and models of human communication.
Cybersecurity is becoming indispensable for everyone and everything in the times of the Internet of Things (IoT) revolution. Every aspect of human society — be it political, financial, technological, or cultural — is affected by cyber-attacks or incidents in one way or another. Newspapers are an excellent source that perfectly captures this web of cybersecurity. By implementing various NLP techniques such as tf-idf, word embedding and sentiment analysis (SA) (machine learning method), this research will examine the cybersecurity-related articles from 18 major newspapers (English language online version) from six countries (three newspapers from each country) collected within one year from April 2018 till March 2019. The first objective is to extract the crucial events from each country, which we will achieve by our first step — ‘information extraction.’ The next objective is to find out what kind of sentiments those crucial issues garnered, which we will accomplish from our second step — ‘SA.’ SA of news articles would also help in understanding each ‘nation’s mood’ on critical cybersecurity issues, which can aid decision-makers in charting new policies.
The calculating of similarity is an important process in NLP. As the basis of sentence and article similarity, the research on calculating of word similarity seems especially vital. There are two word similarity calculating methods: semantic based and statistics based. Former semantic based method mostly relies on Wordnet or Hownet. This article introduces a new method which is based on CSD and which also addes a statistics based method. It offers good support for further processing of corpus.
Events represent a crucial source of information in any text; dates and proper names (people, locations, organizations, products, diseases, etc.) combine to describe a certain event. In this paper, our focus is on these events, on how we find them in the text, extract them, organize them, save them in a database, and use them to answer any question about them. In our system, each event has three main types of identifying factors: keywords that indicate that an interesting event has occurred, dates that specify the time of the event, and proper nouns that specify the people, organizations, locations, products, etc., that are involved in the event. In this paper, we describe a system that uses these keys to find events, classify them, and save them in a database along with the identifier of the document that mentions that event. The retrieval process uses this information to provide the user with menus to form queries about the events; it then executes those queries, and finds the related documents.
Sentiment Analysis (SA) is a computational study that examines people’s opinions, attitudes, and opinions based on their written text. Keralites’ mother tongue, Malayalam, is the most often used language to express themselves on Twitter. As there is no automatic Sentiment Analyzer in Malayalam, the SA of Twitter messages is necessary. In this research work, a Malayalam sentiment analysis model is introduced. The input raw data in the form of reviews are fed for the pre-processing stage. The pre-processing module includes sentence tokenization via Sandi Splitting and parts of speech (POS) tagging. Subsequently, Bag of Word, the Proposed weightage-based Term Frequency-Inverse Document Frequency, Unigram with the dictionary and Unigram with dictionary including negation words have been considered for feature vector formation. Finally, the review classification is undergone via the proposed hybrid classifier, which is constructed by hybridizing the Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), respectively. To enhance the classification performance, the weight of the CNN classifier will be fine-tuned via an Improved Bumble bee mating Optimization (IBBMO), which is an advanced version of standard BBMO. The performance of the proposed work is compared over other conventional models in terms of positive, negative, convergence, and other measures. The accuracy of the proposed method is 95.436, which is much better than the existing works like DBN==86.25, NN==87.080, SVM==87.234, WOA+Hybrid Classifier==88.440, LA+Hybrid Classifier==89.35285, CSO+Hybrid Classifier==92.021, BBMO+Hybrid Classifier==93.4248, and ML techniques==94.014, at the 90th training percentage.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.