Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In the current landscape, the Internet of Things (IoT) finds its utility across diverse sectors, including finance, healthcare, and beyond. However, security emerges as the principal obstacle impeding the advancement of IoT. Given the intricate nature of IoT cybersecurity, traditional security protocols fall short when addressing the unique challenges within the IoT domain. Security strategies anchored in the cybersecurity knowledge graph present a robust solution to safeguard IoT ecosystems. The foundation of these strategies lies in the intricate networks of the cybersecurity knowledge graph, with Named Entity Recognition (NER) serving as a crucial initial step in its implementation. Conventional cybersecurity entity recognition approaches IoT grapple with the complexity of cybersecurity entities, characterized by their sophisticated structures and vague meanings. Additionally, these traditional models are inadequate at discerning all the interrelations between cybersecurity entities, rendering their direct application in IoT security impractical. This paper introduces an innovative Cybersecurity Entity Recognition Model, referred to as CERM, designed to pinpoint cybersecurity entities within IoT. CERM employs a hierarchical attention mechanism that proficiently maps the interdependencies among cybersecurity entities. Leveraging these mapped dependencies, CERM precisely identifies IoT cybersecurity entities. Comparative evaluation experiments illustrate CERM’s superior performance over the existing entity recognition models, marking a significant advancement in the field of IoT security.
Nowadays, all the records in various languages are accessible with their advanced structures. For simple recovery of these digitized records, these reports should be ordered into a class as indicated by their content. Text Categorization is an area of Text Mining which helps to overcome this challenge. Text Classification is a demonstration of allotting classes to records. This paper investigates Text Classification works done in foreign Languages, regional languages and a list of books’ content. Messages available in different languages force the difficulties of NLP approaches. This study shows that supervised ML algorithms such as Logistic regression, Naive Bayes classifier, k-Nearest-Neighbor classifier, Decision Tree and SVMs performed better for Text Classification tasks. The automated document classification technique is useful in our day-to-day life to find out the type of language and different department books based on their text content. We have been using different foreign and regional languages here to classify such as Tamil, Telugu, Kannada, Bengali, English, Spanish, French, Russian and German. Here, we utilize one versus all SVMs for multi-characterization with 3-crease Cross Validation in all cases and see that SVMs outperform different classifiers. This implementation is done by using hybrid classifiers and it depicts analyses with delicate edge straight SVMs as well as bit-based SVMs.
The current named entity recognition (NER) is mainly based on joint convolution or recurrent neural network. In order to achieve high performance, these networks need to provide a large amount of training data in the form of feature engineering corpus and lexicons. Chinese NER is very challenging because of the high contextual relevance of Chinese characters, that is, Chinese characters and phrases may have many possible meanings in different contexts. To this end, we propose a model that leverages a pre-trained and bi-directional encoder representations-from-transformers language model and a joint bi-directional long short-term memory (Bi-LSTM) and conditional random fields (CRF) model for Chinese NER. The underlying network layer embeds Chinese characters and outputs character-level representations. The output is then fed into a bidirectional long short-term memory to capture contextual sequence information. The top layer of the proposed model is CRF, which is used to take into account the dependencies of adjacent tags and jointly decode the optimal chain of tags. A series of extensive experiments were conducted to research the useful improvements of the proposed neural network architecture on different datasets without relying heavily on handcrafted features and domain-specific knowledge. Experimental results show that the proposed model is effective, and character-level representation is of great significance for Chinese NER tasks. In addition, through this work, we have composed a new informal conversation message corpus called the autonomous bus information inquiry dataset, and compared to the advanced baseline, our method has been significantly improved.
Multi-contextualized representations learning is vital for named entity recognition (NER), which is a fundamental task for effectively extracting structured information from unstructured text, and forming knowledge bases. This task is particularly challenging when dealing with Chinese text given the absence of evident word boundaries. Chinese word segmentation (CWS) can be leveraged to recognize word boundaries, but named entities often encompass multiple segmented words, making it crucial to use boundary information to correctly recognize and distinguish the relationships between these words. In this paper, we propose MCA-NER, a multi-contextualized adversarial-based attentional deep learning approach for Chinese NER, which combines CWS and part-of-speech (POS) tagging information with the classic BiLSTM-CRF NER model, using adversarial multi-task learning. The model incorporates several self-attention components for adversarial and multi-task learning, effectively synthesizing task-specific and common information attribution while improving performance across all three tasks. Experimental results on the three datasets provide compelling evidence that supports the effectiveness and performance of our model.
To improve the recognition ability of clinical named entity recognition (CNER) in a limited number of Chinese electronic medical records, it provides meaningful support for clinical advanced knowledge extraction. In this paper, using CCKS2019 Chinese electronic medical record as an experimental data source, a fusion model enhanced by knowledge graph (KG) is proposed, and the model is applied to specific Chinese CNER tasks. This study consists of three main parts: single-mode model construction and comparison experiment, KG enhancement experiment, and model fusion experiment. The model has achieved good performance in CNER from the results. The accuracy rate, recall rate, and F1 value are 83.825%, 84.705%, and 84.263%, respectively, which is the global optimal, which proves the effectiveness of the model. This provides a good help for further research of medical information.
In response to the continuous sophistication of cyber threat actors, it is imperative to make the best use of cyber threat intelligence converted from structured or semi-structured data and Named Entity Recognition (NER) techniques that contribute to extracting critical cyber threat intelligence. To promote the NER research in Cyber Threat Intelligence (CTI) domain, we provide a Large Dataset for NER in Cyber Threat Intelligence (LDNCTI). On the LDNCTI corpus, we investigated the feasibility of mainstream transformer-based models in CTI domain. To settle the problem of unbalanced label distribution, we introduce a transformer-based model with a Triplet Loss based on metric learning and Sorted Gradient harmonizing mechanism (TSGL). Our experimental results show that the LDNCTI well represents critical threat intelligence and that our transformer-based model with the new loss function outperforms previous schemes on the Dataset for NER in Threat Intelligence (DNRTI) and the dataset for NER in Advanced Persistent Threats (APTNER).
The increasing sophistication of cyberattacks on blockchain systems has significantly disrupted security experts from gaining immediate insight into the security situation. The Cybersecurity Knowledge Graph (CKG) currently provides a novel technical solution for blockchain system situational awareness by integrating massive fragmented Cyber Threat Intelligence (CTI) about blockchain technology. However, the existing literature does not provide a solution for building CKG appropriate for blockchain systems. Therefore, designing a method to construct a CKG for blockchain systems by efficiently extracting information from the CTI is mandatory. This paper proposes PipCKG-BS, a pipeline-based approach that builds CKG for blockchain systems. The PipCKG-BS incorporates contextual features and Pre-trained Language Models (PLMs) to improve the performance of the information extraction process. Precisely, we develop the Named Entity Recognition (NER) and Relation Extraction (RE) models for cybersecurity text in PipCKG-BS. In the NER model, we apply the prompt-based learning paradigm to cybersecurity text by constructing prompt templates. In the RE model, we employ external features and prior knowledge of sentences to improve entity relationship extraction accuracy. Several experimental results demonstrate that PipCKG-BS is better than advanced methods in extracting CTI information and is an appealing solution to build high-quality CKG for blockchain systems.
Named Entity Extraction (NEE) is the process of identifying entities in texts and, very commonly, linking them to related (Web) resources. This task is useful in several applications, e.g. for question answering, annotating documents, post-processing of search results, etc. However, existing NEE tools lack an open or easy configuration although this is very important for building domain-specific applications. For example, supporting a new category of entities, or specifying how to link the detected entities with online resources, is either impossible or very laborious. In this paper, we show how we can exploit semantic information (Linked Data) at real-time for configuring (handily) a NEE system and we propose a generic model for configuring such services. To explicitly define the semantics of the proposed model, we introduce an RDF/S vocabulary, called “Open NEE Configuration Model”, which allows a NEE service to describe (and publish as Linked Data) its entity mining capabilities, but also to be dynamically configured. To allow relating the output of a NEE process with an applied configuration, we propose an extension of the Open Annotation Data Model which also enables an application to run advanced queries over the annotated data. As a proof of concept, we present X-Link, a fully-configurable NEE framework that realizes this approach. Contrary to the existing tools, X-Link allows the user to easily define the categories of entities that are interesting for the application at hand by exploiting one or more semantic Knowledge Bases. The user is also able to update a category and specify how to semantically link and enrich the identified entities. This enhanced configurability allows X-Link to be easily configured for different contexts for building domain-specific applications. To test the approach, we conducted a task-based evaluation with users that demonstrates its usability, and a case study that demonstrates its feasibility.
Neural Machine Translation (NMT) model has become the mainstream technology in machine translation. The supervised neural machine translation model trains with abundant of sentence-level parallel corpora. But for low-resources language or dialect with no such corpus available, it is difficult to achieve good performance. Researchers began to focus on unsupervised neural machine translation (UNMT) that monolingual corpus as training data. UNMT need to construct the language model (LM) which learns semantic information from the monolingual corpus. This paper focuses on the pre-training of LM in unsupervised machine translation and proposes a pre-training method, NER-MLM (named entity recognition masked language model). Through performing NER, the proposed method can obtain better semantic information and language model parameters with better training results. In the unsupervised machine translation task, the BLEU scores on the WMT’16 English–French, English–German, data sets are 35.30, 27.30 respectively. To the best of our knowledge, this is the highest results in the field of UNMT reported so far.
The primary goal of Sentiment Analysis (SA) is to recognize the emotions present in natural language text. Generally, in opinion content, emotions are often driven by several aspects of their interests. Any SA task that groups data into various aspects and identifies sentiments is referred to as Aspect-Based Sentiment Analysis (ABSA). Recent advances in Deep Learning (DL) have brought revolutionary changes in the performance of Machine Learning models. Their ability to capture semantic and syntactic traits of any intrinsic data model is highly appreciated. In this research work, we use DL techniques to address the challenges of ABSA aiming to improve sentiment granularity at the aspect level. The proposed methodology works in two stages: (i) aspect terms extraction and (ii) sentiment polarity classification. The task of aspect terms extraction is achieved through the concept of Named-Entity Recognition (NER). However, most of the available NER models are domain dependent and utilize hand-crafted features for learning labeled data. Hence, for aspect terms extraction, a joint model based on Bi-GRU and Conditional Random Fields (CRF) is proposed. Similarly, for sentiment polarity classification, we introduce a novel attention-based neural network called Polarity Embedded Attention Network (PEAN). The intuition behind the PEAN is that, when an aspect term appears in a sentence, its related sentiment term is represented by the polarity embedding. Hence, PEAN combines sentence embedding with aspect and polarity embedding to learn the relationship between sentence and aspect terms. The effectiveness of the proposed model is realized through a comparative study of different models on benchmark datasets. It yields better results compared to other baseline techniques.
Named Entity Recognition (NER) is an NLP field that deals with recognizing and classifying entities in written text. Most Arabic NER research studies discuss the Arabic NER challenge for the Modern Standard Arabic (MSA) language. However, the presence of dialectal Arabic textual resources in social media, blogs, TV shows, etc. is increasingly progressive. Therefore, the treatment of named entities is rapidly becoming a necessity, particularly for dialectal Arabic. In this paper, we are interested in the collection and annotation of a corpus as well as the realization of a NER system for Tunisian Arabic (TA), named TUNER. To the best of the researchers’ knowledge, this is the first study that uses the suggested method for this purpose. In the present study, we adopt a hybrid method based on a Bi-LSTM-CRF model and a rule-based method. The proposed TUNER system yields an F-measure of 91.43%. This is an interesting improvement over comparable related work dialectal Arabic NER systems.
Transliteration is the process of mapping the character of one language to the character of some other language based on its phonetics. India is very much diverse in languages where people speak different languages. Though they speak different languages, it might be difficult for them to read the script of those many languages. In a situation like this, transliteration process plays a major role. It helps in various Natural Language Processing applications such as Information retrieval, Machine translation, Speech recognition. These are NLP applications which make the computer understand the natural language as to how human being interprets. It helps in translating technical terms and proper names from one language to another language. Moreover, transliteration works have been carried out in languages such as Japanese, Chinese and English. But when considering Indian languages, especially Tamil language, very few recognizable works have been carried out. In this paper, transliteration process is carried out on Unicode Tamil characters. The phonetics-based forward list processing is implemented for transliterating from English language to Tamil language which yields promising results.
As social media platforms have gained huge momentum in recent years, the amount of information generated from the social media sites is growing exponentially and gives the information retrieval systems a great challenge to extract the potential named entities. Researchers have utilized the semantic annotation mechanism to retrieve the entities from the unstructured documents, but the mechanism returns with too many ambiguous entities. In this work, the DBpedia knowledge base is adopted for entity extraction and categorization. To achieve the entity extraction task precisely, a two-step process is proposed: (a) train the unstructured datasets with Word2Vec and classify the entities into their respective categories. (b) crawl the web pages, forums, and other web sources to identifying the entities that are not present in the DBpedia. The evaluation shows the results with more precision and promising F1 score.
In this era, news is not only generated continuously with high speed but also growing in its amount by different web sources like talent hunt, news agencies, and so on. To predict the exact class of news depending on its topic, GepH (Grouped entity predictor for Hindi) is proposed using entity extraction and grouping. Entity extraction is popular for English corpus. Hindi is a national language due to its resource scarceness not being explored so much by researchers. More than 1,270 news are processed to apply entity extraction, clustering, and classification using the vector space model for Hindi (VSMH), Synset vector space model for Hindi (SVSMH), and grouped entity document matrix for Hindi (GEDMH). Synset-based dimension reduction techniques are used to get improved accuracy. Evaluation of HAC using three matrices shows the best performance of GEDMH for varied datasets. Thus labelled corpus obtained after applying HAC (Hierarchical agglomerative clustering) to GEDMH is used as a training dataset and predictions are done using random forest and Naïve Bayes. The Naïve Bayes classifier implemented using the proposed GEDMH performs the best. GepH shows 0.8 purity, 0.4 entropy, and 0.3 as error rate for 1,273 Hindi news.
The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.
The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization — formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.
Although there are several corpora with protein annotation, incompatibility between the annotations in different corpora remains a problem that hinders the progress of automatic recognition of protein names in biomedical literature. Here, we report on our efforts to find a solution to the incompatibility issue, and to improve the compatibility between two representative protein-annotated corpora: the GENIA corpus and the GENETAG corpus. In a comparative study, we improve our insight into the two corpora, and a series of experimental results show that most of the incompatibility can be removed.
Since the Message Understanding Conferences on Information Extraction in the 80's and 90's, Named Entity ReCognition (NERC) is a well-established task in the Natural Language Processing (NLP) community. However, very different systems seem to perform very similarly when applied to the same corpus. In this paper, we present a state-of-the-art NERC system. This tool is a hybrid system, based on different resources and techniques. We then propose a protocol to "deconstruct" and evaluate the different components of a complex named entity recognition system. We examine the performance of such a system with learning capacities and reduced initial knowledge on medium-size unlabelled corpora.
This paper reports about the development of a Named Entity Recognition (NER) system in Indian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu using the statistical Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL) and tagged with the twelve NE tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as language specific features. Language independent features include the contextual words, prefixes and suffixes of all the words in the training corpus, several digit features depending upon the presence and/or the number of digits in a token, first word of the sentence and the frequency features of the words. The system considers linguistic features, particularly for Bengali and Hindi. Linguistic features of Bengali include the set of known suffixes that may appear with NEs, clue words that help in predicting the location and organization names, words that help to recognize measurement expressions, designation words that help to identify person names, various gazetteer lists like the first names, middle names, last names, location names, organization names, function words, month names, weekdays, etc. As part of linguistic features for Hindi, the system uses only the lists of first names, middle names, last names, function words, month names and weekdays along with the list of words that helps to recognize measurements. In addition to the other features, part of speech (POS) information of the word has been also considered for Bengali and Hindi. No linguistic features have been considered for Telugu, Oriya and Urdu. It has been observed from the evaluation results that the use of linguistic features improves the performance of the system. The system has been trained with 122,467 Bengali, 502,974 Hindi, 64,026 Telugu, 93,173 Oriya and 35,447 Urdu tokens. The system has demonstrated the highest overall average Recall, Precision, and F-Score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali with the 10-fold cross validation test. Experimental results of the 10-fold cross validation tests on the Hindi, Telugu, Oriya, and Urdu data have shown the overall average F-Score values of 82.66%, 70.11%, 70.13%, and 69.3%, respectively.
This paper reports on the development of a Named Entity Recognition (NER) system in Bengali by combining the outputs of the three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). A part of the Bengali news corpus developed from the web-archive of a leading Bengali newspaper has been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. We have also used the annotated corpus of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL). An appropriate tag conversion routine has been developed in order to convert the fine-grained NE tagged NERSSEAL corpus to the form, tagged with the coarse-grained NE tagset of four tags. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as the language dependent features extracted from the various language specific resources. Lexical context patterns, which are generated from an unlabeled corpus of 10 million wordforms using an active learning technique, have been used for developing a baseline NER system as well as the features of the classifiers in order to improve their performance. A number of post-processing techniques have been used in order to improve the performance of the classifiers. Finally, the classifiers are combined together into a multiengine NER system using three weighted voting techniques. The system has been trained and tested with the datasets of 272K wordforms and 35K wordforms, respectively. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 93.81%, 92.18% and 92.98%, respectively. The proposed system also outperforms the three other existing Bengali NER systems. The language independent versions of the ME, CRF and SVM based NER systems have been evaluated for the four other popular Indian languages, namely Hindi, Telugu, Oriya and Urdu, with the datasets obtained from the NERSSEAL shared task data. The SVM based system yielded the best performance with the F-Score values of 76.35%, 72.65%, 69.34% and 65.66% for Hindi, Telugu, Oriya and Urdu, respectively.