Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The wide range of digital educational resources calls for developing an accurate and efficient method for categorizing and recommending English teaching materials. An automatic classification and recommendation system has been created and implemented using Natural Language Processing (NLP) techniques. Data from essays produced by English Language Learners (ELLs) in grades 8 through 12, as well as components including content, competency levels and score notes, are organized for this study. Models for precise language proficiency assessments are planned to be developed to enhance automated feedback mechanisms. Preprocessing methods such as stop-word removal, stemming and tokenization were applied to tidy up the data. Term Frequency–Inverse Document Frequency (TF–IDF) and word embeddings were two strategies used in the feature extraction process to convert textual data into numerical vectors. Then, the recently created Support Vector Machine–Neural Network–Genetic Algorithm (SVM–NN–GA) was fed to classify these vectors. The model’s performance was evaluated using F1-measure, accuracy, precision and recall metrics. Methods of collaborative filtering and content-based filtering were studied for the recommendation system. In contrast to collaborative filtering, which used user interaction data to identify patterns and suggest relevant items, content-based filtering matched materials with user preferences based on attributes gathered from NLP models. A hybrid recommendation system combines different approaches, increasing recommendations’ personalization and relevance. The results demonstrate that the hybrid recommendation and NLP-based categorization approach as a combination method suggestively improves the effectiveness of selecting appropriate teaching materials, helping teachers to enhance the learning process.
Sarcasm is a language phrase that expresses the opposite of what is stated, often used for mocking or offending. It is commonly seen on social media platforms day by day. The opinion analysis process is susceptible to errors due to the potential for sarcasm to alter the statement’s meaning. As automated social media research tools become more prevalent, the reliability problems of analytics have also increased. According to the prior study, sarcastic reports alone have greatly diminished the automatic Sentiment Analysis (SA) performance in complex systems platforms. Sarcasm detection utilizing Deep Learning (DL) contains training models to identify the nuanced linguistic cues that indicate sarcasm in text. Typically, this process applies large datasets annotated with sarcastic and non-sarcastic samples to teach models to discriminate between them. DL methodologies, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformer methods like BERT or GPT, are widely applied due to their ability to capture intricate patterns in language. This model learns to detect sarcasm by discriminating exaggerated expressions, contextual incongruities, and semantic reversals frequently related to sarcastic remarks. Therefore, this study presents a Fractal Red-Tailed Hawk Algorithm with Hybrid Deep Learning-Driven Sarcasm Detection (RTHHDL-SD) technique on complex systems and social media platforms. The purpose of the RTHHDL-SD technique is to identify and classify the occurrence of sarcasm in social media text. In the RTHHDL-SD approach, data preprocessing is performed in four ways to transform input data into valuable design. Besides, the RTHHDL-SD technique applies the FastText word embedding approach to generate word embeddings. The RTHHDL-SD technique applies a Deep Neural Network (DNN) with bi-directional long short-term memory for sarcasm detection, called the deep BiLSTM model. The RTH method was utilized as the hyperparameter optimizer to enhance the detection performance of the deep BiLSTM model. Moreover, the large language model is used to estimate the outcomes of the social media corpora. The simulation outcomes of the RTHHDL-SD methodology are examined under Twitter and Headlines datasets. The investigational outcomes of the RTHHDL-SD methodology exhibited superior accuracy values of 89.10% and 92.77% with other approaches.
Automated sentiment analysis is becoming increasingly recognized due to the growing importance of social media and e-commerce platform review websites. Deep neural networks outperform traditional lexicon-based and machine learning methods by effectively exploiting contextual word embeddings to generate dense document representation. However, this representation model is not fully adequate to capture topical semantics and the sentiment polarity of words. To overcome these problems, a novel sentiment analysis model is proposed that utilizes richer document representations of word-emotion associations and topic models, which is the main computational novelty of this study. The sentiment analysis model integrates word embeddings with lexicon-based sentiment and emotion indicators, including negations and emoticons, and to further improve its performance, a topic modeling component is utilized together with a bag-of-words model based on a supervised term weighting scheme. The effectiveness of the proposed model is evaluated using large datasets of Amazon product reviews and hotel reviews. Experimental results prove that the proposed document representation is valid for the sentiment analysis of product and hotel reviews, irrespective of their class imbalance. The results also show that the proposed model improves on existing machine learning methods.
Large-scale distributed computing infrastructures ensure the operation and maintenance of scientific experiments at the LHC: more than 160 computing centers all over the world execute tens of millions of computing jobs per day. ATLAS — the largest experiment at the LHC — creates an enormous flow of data which has to be recorded and analyzed by a complex heterogeneous and distributed computing environment. Statistically, about 10–12% of computing jobs end with a failure: network faults, service failures, authorization failures, and other error conditions trigger error messages which provide detailed information about the issue, which can be used for diagnosis and proactive fault handling. However, this analysis is complicated by the sheer scale of textual log data, and often exacerbated by the lack of a well-defined structure: human experts have to interpret the detected messages and create parsing rules manually, which is time-consuming and does not allow identifying previously unknown error conditions without further human intervention. This paper is dedicated to the description of a pipeline of methods for the unsupervised clustering of multi-source error messages. The pipeline is data-driven, based on machine learning algorithms, and executed fully automatically, allowing categorizing error messages according to textual patterns and meaning.
Machine learning (ML) architectures based on neural model have garnered considerable attention in the field of language classification. Code-mixing is a common phenomenon on social networking sites for exhibiting opinion on a topic. The code-mixed text is the approach of mixing two or more languages. This paper describes the application of the code-mixed index in Indian social media texts and compares the complexity to identify language at the word level using Bi-directional Long Short-Term Memory model. The major contribution of the work is to propose a technique for identifying the language of Hindi–English code-mixed data used in three social media platforms namely, Facebook, Twitter and WhatsApp. Here, we demonstrate that a special class of quantum LSTM network model is capable of learning and accurately predicting the languages used in social media texts. Our work paves the way for future applications of machine learning methods in quantum dynamics without relying on the explicit form of the Hamiltonian.
This paper has the aim of solving problems in research studies on the analysis tasks of text emotion; the problems are the low utilization of text, the difficulty of effective information extraction, the failure of recognizing word polysemy with effectiveness. Thus, based on LSTM and Bert, the method of sentiment analysis on text is adopted. To be precise, word embedding of dataset in view of the skip-gram model is used for training course. In each sample, the word embeddings combine matric with the two-dimensional feature to be neural network input. Next, construction of analysis model for text sentiment combines Bert pre-training language model and long short-term memory (LSTM) network, using the word vector pre-trained by Bert instead of that trained in the traditional way to dynamically generate the semantic vector according to the word context. Finally, the semantic representation of words from text is improved by effectively identifying the polysemy of words, and the semantic vector is input into the LSTM to capture the semantic dependencies, thereby enhancing the ability to extract valid information. The Accuracy, Precision, Recall and F-Measure for the method of Bert–LSTM based on analysis of text sentiment are 0.89, 0.9, 0.84 and 0.87, indicating high value than the compared ones. Thus, the proposed method significantly outperforms the comparison methods in text sentiment analysis.
In the era of information overload, text summarization has become a focus of attention in a number of diverse fields such as, question answering systems, intelligence analysis, news recommendation systems, search results in web search engines, and so on. A good document representation is the key point in any successful summarizer. Learning this representation becomes a very active research in natural language processing field (NLP). Traditional approaches mostly fail to deliver a good representation. Word embedding has proved an excellent performance in learning the representation. In this paper, a modified BM25 with Word Embeddings are used to build the sentence vectors from word vectors. The entire document is represented as a set of sentence vectors. Then, the similarity between every pair of sentence vectors is computed. After that, TextRank, a graph-based model, is used to rank the sentences. The summary is generated by picking the top-ranked sentences according to the compression rate. Two well-known datasets, DUC2002 and DUC2004, are used to evaluate the models. The experimental results show that the proposed models perform comprehensively better compared to the state-of-the-art methods.
Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on formula comparison to determine the similarity between mathematical documents. However, two similar formulae may appear in entirely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM, our previous work in [Y. Dai, L. Chen, and Z. Zhang, An N-ary tree-based model for similarity evaluation on mathematical formulae, in Proc. 2020 IEEE Int. Conf. Systems, Man, and Cybernetics, 2020, pp. 2578–2584.], we introduce a new hybrid retrieval model, NTFEM-K, which combines formulae with their surrounding keywords for more accurate retrieval. By using keywords extraction technology, we extract keywords from context, which can supplement the semantic information of the formula. Then, we get the vector representations of keywords by FastText N-gram embedding model and the vector representations of formulae by NTFEM. Finally, documents are sorted according to the similarity between keywords, and then the ranking results are optimized by formula similarity. For performance evaluation, NTFEM-K is not only compared with NTFEM but also hybrid retrieval models combining formulae with long text and hybrid retrieval models combining formulae with their keywords using other keyword extraction algorithms. Experimental results show that the accuracy of top-10 results of NTFEM-K is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
Requirements-to-code tracing is an important and costly task that creates trace links from requirements to source code. These trace links help engineers reduce the time and complexity of software maintenance. Code comments play an important role in software maintenance tasks. However, few studies have focused intensively on the impact of code comments on requirements-to-code trace links creation. Different types of comments have different purposes, so how different types of code comments provide different improvements for requirements-to-code trace links creation? We focus on learning whether code comments and different types of comments can improve the quality of trace links creation. This paper presents a study to evaluate the contribution of code comments and different types of code comments to the creation of trace links. More specifically, this paper first experimentally evaluates the impact of code comments on requirements-to-code trace links creation, and then divides code comments into six categories to evaluate its impact on trace links creation. The results show that the precision increases by an average of 15% (based on the same recall) after adding code comments (even for different trace links creation techniques), and the type of Purpose comments contributes more to the tracing task than the other five. This empirical study provides evidence that code comments are effective in tracing links creation, and different types of code comments contribute differently. Purpose comments can be used to improve the accuracy of requirements-to-code trace links creation.
Smart contracts are programs running on blockchain. In recent years, due to the persistent occurrence of security-related accidents in smart contracts, the effective detection of vulnerabilities in smart contracts has received extensive attention from researchers and engineers. Machine learning-based vulnerability detection techniques have the advantage that they do not need expert rules for determining vulnerabilities. However, existing approaches cannot identify vulnerabilities when the versions of smart contract compilers are updated. In this paper, we propose OC-Detector (Opcode Clustering Detector), a smart contract vulnerability detection approach based on clustering opcode instructions. OC-Detector learns the characteristics of opcode instructions to cluster them and replaces opcode instructions belonging to the same cluster with the ID of the cluster. After that, the similarity between the contract under analysis and contracts in the vulnerability database is calculated to identify vulnerabilities. The experimental results demonstrate that OC-Detector improves the F1 value of detecting vulnerabilities from 0.04 to 0.40 compared to DC-Hunter, Securify, SmartCheck and Osiris. Additionally, compared to DC-Hunter, the F1 value is improved by 0.27 when detecting vulnerabilities in smart contracts compiled by different versions of compilers.
The application of deep learning techniques in natural language processing tasks has been increased in recent years. Many studies have used the deep learning techniques to obtain a distributed representation of features. In particular, the convolutional neural network (CNN) with the distributed representation have subsequently been shown to be effective for the natural language processing tasks. This paper presents how to apply the CNN to speech-act classification. Then we analyze the experimental results on two issues, how to solve two problems about sparse speech-acts in train data and out of vocabulary, and how to utilize the advantages of CNN in the speech-act classification. As a result, we obtain the significant improved performances when CNN is applied to the speech-act classification.
Deep learning technology promotes the development of neural network machine translation (NMT). End-to-End (E2E) has become the mainstream in NMT. It uses word vectors as the initial value of the input layer. The effect of word vector model directly affects the accuracy of E2E-NMT. Researchers have proposed many approaches to learn word representations and have achieved significant results. However, the drawbacks of these methods still limit the performance of E2E-NMT systems. This paper focuses on the word embedding technology and proposes the PW-CBOW word vector model which can present better semantic information. We apply these word vector models on IWSLT14 German-English, WMT14 English-German, WMT14 English-French corporas. The results evaluate the performance of the PW-CBOW model. In the latest E2E-NMT systems, the PW-CBOW word vector model can improve the performance.
Posting sarcastic comments on social media has become popular in the modern era. Sarcasm is a linguistic expression that typically conveys the contrary meaning of what has already been said, making it challenging for machines to find the literal meaning. It depends mainly on context, making it a tedious process for computational analysis. It is well known for its modulation with spoken words and an irony undertone. In addition, sarcasm conveys negative sentiment using positive words, which easily confuses sentiment analysis (SA) models. Sarcasm detection is a natural language processing (NLP) process and is prevalent in SA, human–machine dialogue, and other NLP applications due to sarcasm’s ambiguities and complex nature. Concurrently, the advancement of machine learning (ML) techniques makes it easier to develop robust sarcasm detection methods. This paper presents an automated sarcasm recognition using applied linguistics-driven deep learning with a large language model (ASR-ALDL3M) technique. The purpose of the ASR-ALDL3M technique is to focus on recognizing the sarcastic data using the DL model. In the ASR-ALDL3M technique, the initial data preprocessing phase is utilized, and glove word embedding is applied. Next, the sarcasm recognition procedure is applied using the long short-term memory (LSTM) model. Moreover, the hyperparameter selection of the LSTM model is performed using the fractals monarch butterfly optimization (MBO) technique. At last, a large language model (LLM) is utilized to enhance the sarcastic recognition process. A comprehensive result analysis is made to validate the outcomes of the ASR-ALDL3M technique. The performance evaluation outcomes stated that the ASR-ALDL3M method gains better performance over other models.
Natural language processing (NLP) is a domain of artificial intelligence (AI) that concentrates on the communication between human and computer language. Detection of Arabic spam and ham tweets involves leveraging deep learning (DL) models, mainly NLP techniques such as brain-like computing and AI-driven tweets recognition, to mechanically differentiate between spam and ham messages dependent upon content semantics, linguistic patterns, and contextual data within the Arabic text. This study presents an optimal deep learning with natural language processing for Arabic spam and ham tweets recognition (ODLNLP-ASHTR) technique in various complex systems platforms. In the ODLNLP-ASHTR technique, the data pre-processing is initially performed to alter the input tweets into a compatible format, and a BERT word embedding process is used. For Arabic ham and spam tweet recognition, the ODLNLP-ASHTR technique makes use of the self-attention bidirectional gated recurrent unit (SA-BiGRU) model. At last, the detection performance of the SA-BiGRU model can be boosted by the design of an improved salp swarm algorithm (ISSA). The experimental evaluation of the ODLNLP-ASHTR technique takes place using the Arabic tweets dataset. The experimental results pointed out the improved performance of the ODLNLP-ASHTR model compared to recent approaches with a maximum accuracy of 98.11%.
Social media texts like tweets and blogs are collaboratively created by human interaction. Rapidly changing trends are leading to topic drift in the social media text. This drift is usually associated with words and hashtags. However, geotags play an essential part in determining topic distribution with location context. The rate of change in the distribution of words, hashtags and geotags cannot be considered uniform and must be handled accordingly. This paper builds a topic model that associates the topic with a mixture of distributions of words, hashtags and geotags. Stochastic gradient Langevin dynamic model with varying mini-batch sizes is used to capture the changes due to the asynchronous distribution of words and tags. Topic representations with co-occurrence and location contexts are specified as hashtag context vector and geotag context vector respectively. These two vectors are jointly learned to yield topical word embedding vectors over time conditioned on hashtags and geotags that can predict location-based topical variations effectively. When evaluated with Chennai and UK geolocated Twitter data, the proposed joint topical word embedding model enhanced by the social tags context, outperforms other methods.
In this digital era, there is a tremendous increase in the volume of data, which adds difficulties to the person who utilizes particular applications, such as websites, email, and news. Text summarization targets to reduce the complexity of obtaining statistics from the websites as it compresses the textual document to a short summary without affecting the relevant information. The crucial step in multi-document summarization is obtaining a relationship between the cross-sentence. However, the conventional methods fail to determine the inter-sentence relationship, especially in long documents. This research develops a graph-based neural network to attain an inter-sentence relationship. The significant step in the proposed multi-document text summarization model is forming the weighted graph embedding features. Furthermore, the weighted graph embedding features are utilized to evaluate the relationship between the document’s words and sentences. Finally, the bidirectional long short-term memory (BiLSTM) classifier is utilized to summarize the multi-document text summarization. The experimental analysis uses the three standard datasets, the Daily Mail dataset, Document Understanding Conference (DUC) 2002, and Document Understanding Conference (DUC) 2004 dataset. The experimental outcome demonstrates that the proposed weighted graph embedding feature + BiLSTM model exceeds all the conventional methods with Precision, Recall, and F1 score of 0.5352, 0.6296, and 0.5429, respectively.
Molecular events normally have significant meanings since they describe important biological interactions or alternations such as binding of a protein. As a crucial step of biological event extraction, event trigger identification has attracted much attention and many methods have been proposed. Traditionally those methods can be categorised into rule-based approach and machine learning approach and machine learning-based approaches have demonstrated its potential and outperformed rule-based approaches in many situations. However, machine learning-based approaches still face several challenges among which a notable one is how to model semantic and syntactic information of different words and incorporate it into the prediction model. There exist many ways to model semantic and syntactic information, among which word embedding is an effective one. Therefore, in order to address this challenge, in this study, a word embedding assisted neural network prediction model is proposed to conduct event trigger identification. The experimental study on commonly used dataset has shown its potential. It is believed that this study could offer researchers insights into semantic-aware solutions for event trigger identification.
Information on changes in a drug’s effect when taken in combination with a second drug, known as drug–drug interaction (DDI), is relevant in the pharmaceutical industry. DDIs can delay, decrease, or enhance absorption of either drug and thus decrease or increase their action or cause adverse effects. Information Extraction (IE) can be of great benefit in allowing identification and extraction of relevant information on DDIs. We here propose an approach for the extraction of DDI from text using neural word embedding to train a machine learning system.
Results show that our system is competitive against other systems for the task of extracting DDIs, and that significant improvements can be achieved by learning from word features and using a deep-learning approach. Our study demonstrates that machine learning techniques such as neural networks and deep learning methods can efficiently aid in IE from text. Our proposed approach is well suited to play a significant role in future research.
Antimicrobial peptides (AMPs), as the preferred alternatives to antibiotics, have wide application with good prospects. Identifying AMPs through wet lab experiments remains expensive, time-consuming and challenging. Many machine learning methods have been proposed to predict AMPs and achieved good results. In this work, we combine two kinds of word embedding features with the statistical features of peptide sequences to develop an ensemble classifier, named EnAMP, in which, two deep neural networks are trained based on Word2vec and Glove word embedding features of peptide sequences, respectively, meanwhile, we utilize statistical features of peptide sequences to train random forest and support vector machine classifiers. The average of four classifiers is the final prediction result. Compared with other state-of-the-art algorithms on six datasets, EnAMP outperforms most existing models with similar computational costs, even when compared with high computational cost algorithms based on Bidirectional Encoder Representation from Transformers (BERT), the performance of our model is comparable. EnAMP source code and the data are available at https://github.com/ruisue/EnAMP.
As a core subtask in anaphora resolution, anaphoricity determination has aroused the interest of researchers. However, in recent work, the influence caused by the deep semantic information and the context of the coreference elements have not been taken into account. In this paper, by combining the semantic feature of Uygur, we established a Convolutional Neural Network & Long Short-Term Memory (CNN_LSTM) model in determining the anaphoricity of Uygur pronoun. Firstly, the deep negative semantic feature representation is extracted via word2vec. Secondly, the shallow explicit feature representation of coreference elements is extracted by our system. Afterwards, two kinds of features are combined to recognize whether coreference element is referential or not. The results showed that the method we used can distinguish coreference element accurately, the ACC+ score is 90.18% and the ACC− score is 89.93%, which are higher than ANN (Artificial Neural Network) and SVM (Support Vector Machine) respectively.