Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Machine learning (ML) architectures based on neural model have garnered considerable attention in the field of language classification. Code-mixing is a common phenomenon on social networking sites for exhibiting opinion on a topic. The code-mixed text is the approach of mixing two or more languages. This paper describes the application of the code-mixed index in Indian social media texts and compares the complexity to identify language at the word level using Bi-directional Long Short-Term Memory model. The major contribution of the work is to propose a technique for identifying the language of Hindi–English code-mixed data used in three social media platforms namely, Facebook, Twitter and WhatsApp. Here, we demonstrate that a special class of quantum LSTM network model is capable of learning and accurately predicting the languages used in social media texts. Our work paves the way for future applications of machine learning methods in quantum dynamics without relying on the explicit form of the Hamiltonian.
Transliteration is the process of mapping the character of one language to the character of some other language based on its phonetics. India is very much diverse in languages where people speak different languages. Though they speak different languages, it might be difficult for them to read the script of those many languages. In a situation like this, transliteration process plays a major role. It helps in various Natural Language Processing applications such as Information retrieval, Machine translation, Speech recognition. These are NLP applications which make the computer understand the natural language as to how human being interprets. It helps in translating technical terms and proper names from one language to another language. Moreover, transliteration works have been carried out in languages such as Japanese, Chinese and English. But when considering Indian languages, especially Tamil language, very few recognizable works have been carried out. In this paper, transliteration process is carried out on Unicode Tamil characters. The phonetics-based forward list processing is implemented for transliterating from English language to Tamil language which yields promising results.
In Korean text these days, the use of English words with or without phonetic translations are growing at a high speed. To make matters worse, the Korean transliteration of an English word may vary greatly. The mixed use of English words and their various transliterations in the same document or document collection may cause severe word mismatch problems in Korean information retrieval. There are two possible approaches to tackle this problem: transliteration and back-transliteration method. We argue that our newly proposed transliteration approach is more advantageous for the resolution of the word mismatch problem than the previously proposed back-transliteration approach. Our information retrieval experiment results support this argument.
A method to automatically extract translational Japanese-KATAKANA and English word pairs from bilingual corpora is proposed. The method applies all the existing back-transliteration rules to each mora unit in a KATAKANA word, and extracts the English word which matches or partially matches one of these back-transliteration candidates as translation. The mora unit is a Japanese syllable unit and one Katakana character often corresponds to one mora. For instance, if we have in the Japanese part of a bilingual corpora, we generate such back-transliteration candidates as <graf>, <graph>, <gulerph>,… and identify similar words from the English part of the corpora. The method performs reasonably well, achieving 80%–100% precision at 75% recall against the eight corpora we used for evaluation.
In this paper, we will describe a Korean transliterated foreign word extraction algorithm. In the proposed method, we reformulate the foreign word extraction problem as a syllable-tagging problem such that each syllable is tagged with a foreign syllable tag or a pure Korean syllable tag. Syllable sequences of Korean strings are modelled by Hidden Markov Model whose state represents a character with binary marking to indicate whether the syllable is part of a transliterated foreign word or not. The proposed method extracts a transliterated foreign word with high recall rate and precision rate. Moreover, our method shows good performance even with small-sized training corpora.
This article addresses the problem of standard Romanization of Arabic names using undiacritized-Arabic forms and their corresponding non-standard Romanization. The Romanization of Arabic names has long been studied and standardized. Huge amounts of non-standard Arabic databases of Romanized names exist that are in use in many private and government agencies. Examples of such applications are passport name holder databases, phone directories, and geographic names databases. Dealing with such databases can be inefficient and can produce inconsistent results. Converting such databases into their standard Romanization can help in solving these problems.
In this paper, we present an efficient algorithmic software implementation which produces standard Romanization of Arabic alphabet name presentation by utilizing the hints in the existing non-standard Romanized databases. The results of the software implementation have proven to be very promising.
In this paper, we present methods of transliteration and back-transliteration. In Korean technical documents and web documents, many English words and Japanese words are transliterated into Korean words. These transliterated words are usually technical terms and proper nouns, so it is hard to find them in a dictionary. Therefore an automatic transliteration system is needed. Previous transliteration models restrict an information length to two or three letters per letter. However, most transliteration phenomena cannot be explained with a single standard rule especially in Korean. Various rules such as the origin of a word and profession of users are applied to each transliteration. The restriction of information length may lose the discriminative information of each transliteration rule. In this paper, we propose the methods that find similar words which have the longest overlap with an input word. To find similar words without the loss of each transliteration rule, phoneme chunks that do not have a length limit are used. By merging phoneme chunks, an input word is transliterated. With our proposed method, we could get 86% character accuracy and 53% word accuracy in an English-to-Korean transliteration test.
One unique challenge in Chinese Language Processing is cross-strait named entity recognition. Due to the adoption of different transliteration strategies, foreign name transliterations can vary greatly between the PRC and Taiwan, creating difficulties in NLP tasks including data mining, translation and information retrieval. In this paper, we introduce a novel approach to automatic extraction of divergent transliterations of foreign named entities that bootstraps co-occurrence statistics from tagged Chinese corpora, thereby producing higher precision.
Romanization is used to phonetically translate names and technical terms from languages in non-Roman alphabets to languages in Roman alphabets. Because almost all dictionaries contain standard English forms for some Arabic names, this problem has been solved using machine transliteration. Several programs exist to deal with transliteration; they are based either on dictionary-based approach or on rule-based approach. In this study, a comparison between these two approaches is shown. Test data from the Yarmouk University library were used. Results show that while a rule-based Romanizer can romanize all names, a dictionary-based Romanizer romanizes (86%) of tested names. On the other hand, another kind of test was performed over the Romanization rules used by each Romanizer; the results show that the Romanization rules (in terms of accuracy and usability) used by the Dictionary-based Romanizer used in this study are better than the ones used by Rule-based Romanizer.