Please login to be able to save your searches and receive alerts for new content matching your search criteria.
We present results from our quantitative study of statistical and network properties of literary and scientific texts written in two languages: English and Polish. We show that Polish texts are described by the Zipf law with the scaling exponent smaller than the one for the English language. We also show that the scientific texts are typically characterized by the rank-frequency plots with relatively short range of power-law behavior as compared to the literary texts. We then transform the texts into their word-adjacency network representations and find another difference between the languages. For the majority of the literary texts in both languages, the corresponding networks revealed the scale-free structure, while this was not always the case for the scientific texts. However, all the network representations of texts were hierarchical. We do not observe any qualitative and quantitative difference between the languages. However, if we look at other network statistics like the clustering coefficient and the average shortest path length, the English texts occur to possess more clustered structure than do the Polish ones. This result was attributed to differences in grammar of both languages, which was also indicated in the Zipf plots. All the texts, however, show network structure that differs from any of the Watts–Strögatz, the Barabási–Albert, and the Erdös–Rényi architectures.
The syntax of different natural languages are different, hence the parsing of different natural languages are also different, thus leadings to structures of their parsing-trees being different. The reason that the sentences in different natural languages can be translated to each other is that they have the same meaning. This paper discusses a new sentence parsing, called semantic-parsing, based on semantic units theory. It is a new theory where a sentence of a natural language is not regarded as of words and phrases arranged linearly; rather it is expected to consist of semantic units with or without type-parameters. This is a new parsing approach where the syntax-parsing-tree and semantic-parsing-tree are isomorphic. It is also a new approach in which the structure-trees of the sentences in all different natural languages can correspond.
Since the end of last year, the researchers at the Institute of Systems Science (ISS) started to consider a more ambitious project as part of its multilingual programming objective. This project examines the domain of Chinese Business Letter Writing. With the problem defined as generating Chinese letters to meet business needs, investigations suggest an intersection of 3 possible approaches: knowledge engineering, form processing and natural language processing. This paper attempts to report some of the findings and document the design and implementation issues that have arisen and been tackled as prototyping work progresses.
The concept of K-stationarity of an environment in which a computer system (CS) works is introduced and the wide prevalence of such environments is pointed out. Their property of originating after a long period the problems belonging to a small number K of stable types leads to what is enough for a CS to only recognize the type of current problem. Thereafter the problem is solved by the algorithm previously prepared for that type. The problem of computer understanding of natural language (CUNL) can serve as an illustration. Any address made to a CS is a task, therefore a text analysis must reveal not its sense as in known constructions, but only the data about the task. However, in a K-stationary environment, those tasks can relate only to K known types, and this reduces the CUNL problem to the problem of recognition from a text of a proper algorithm (out of the K present) and of concrete values of its arguments, If a CS operates simultaneously in several subject areas (SA) then the stage of prerecognition of an SA is to be attached to the system. The CUNL systems under exploitation, which allow absolutely free and easy addressing within the SAs present, are described.
Any system for natural language processing must be based on a lexicon. Once a model has been defined, there is the problem of acquiring and inserting words. This task is tedious for a human operator; on the one hand he must not forget any of the words, and on the other the acquisition of a new concept requires the input of a number of parameters.
In view of these difficulties, research work has been undertaken in order to integrate pre-existing “paper” dictionaries. Nevertheless, these are not faultless, and are often incomplete when processing a very specialized technical field. We have therefore searched to mitigate these problems by automating the enrichment of an already partially integrated lexicon.
We work in a technical field on which we have gathered different sorts of texts: written texts, specialist interviews, technical reports, etc. These documents are stored in an object oriented database, and form part of a wider project, called REX (“Retour d’EXpérience” in French, or “Feedback of Experience” in English).
Our system, called ANA, reads the documents, analyses them, and deduces new knowledge, allowing the enrichment of the lexicon. The group of words already integrated into the lexicon form the “Bootstrap” of the discovery process of new words: it collects the instances of the different concepts thought to be interesting, in order to gather the semantic information. A special module makes it possible to avoid an explosion of the size of the database. It is responsible for forgetting certain instances and maintaining the database in such a way that the order in which the texts are introduced bears no influence.
This work presents a linguistic approach to the understanding of chaos. The idea comes from our work on translating into sounds and music the complexity of chaotic systems, as with Chua's attractors. Therefore, working with sounds, we have used a standardization criterion in order to detect only some of the features of the richness of chaos. This method involves the selection of a limited number of sounds, which, in turn, allow for the creation of other components of the system at different levels of organization, as with natural language. Thus, phonetic, morphological, syntactic and semantic levels, linked by a grammar of an "artificial chaotic language" are defined. Each linguistic unit of the artificial language presents networks of interrelated phenomena since for each of them, it is possible to detect its path length, tracing a graph of the mutual relationships of the system's components. We found that there is interaction among the different levels of the chaotic language. Furthermore, some traits of the dynamics of the evolution of language in human infants are found in the main routes to chaos. In this emergent dynamics, mutations, struggle for the fittest, natural selection, and the relative distribution of linguistic entities in genetic landscapes are observed. On the basis of this, a possible bridge of connection between the physical and the mental worlds may be developed.
In current multi-agent systems, the user is typically interacting with a single agent at a time through relatively inflexible and modestly intelligent interfaces. As a consequence, these systems force the users to submit simplistic requests only and suffer from problems such as the low-level nature of the system services offered to users, the weak reusability of agents, and the weak extensibility of the systems. In this paper, a framework for multi-agent systems called the open agent architecture (OAA) which reduces such problems, is discussed. The OAA is designed to handle complex requests that involve multiple agents. In some cases of complex requests from users, the components of the requests do not directly correspond to the capabilities of various application agents, and therefore, the system is required to translate the user's model of the task into the system's model before apportioning subtasks to the agents. To maximize users' efficiency in generating this type of complex requests, the OAA offers an intelligent multi-modal user interface agent which supports a natural language interface with a mix of spoken language, handwriting, and gesture. The effectiveness of the OAA environment including the intelligent distributed multi-modal interface has been observed in our development of several practical multi-agent systems.
We aim at controlling the biases that exist in every description, in order to give the best possible image of one of the protagonists of an event. Starting from a supposedly complete set of propositions accounting for an event, we develop various argumentative strategies (insinuation, justification, reference to customary norms) to imply the facts that cannot be simply omitted but have the "wrong" orientation w.r.t. the protagonist we defend. By analyzing these different strategies, a contribution of this work is to provide a number of relevant parameters to take into account in developing and evaluating systems aiming at understanding natural language (NL) argumentations. The source of inspiration for this work is a corpus of 160 texts where each text describes a (different) car accident. Its result, for a given accident, is a set of first-order literals representing the essential facts of a description intended to defend one of the protagonists. An implementation in Answer Set Programming is underway. A couple of examples showing how to extract, from the same starting point, a defense for the two opposite sides are provided. Experimental validation of this work is in progress, and its first results are reported.
In order to respond to the increasingly diverse requirements for network services, it would be best if the specifications could be defined by the users who actually have those requirements. If the users themselves are to define the specifications, the best case would be to have them use the natural languages that they are familiar and comfortable with. Since network services become visible to the users via their terminals, when users write specifications for network services, the specifications may be defined in a form including specifications for the terminals in addition to network services. In this paper, we propose a method of extracting network service specifications from natural-language descriptions made from various points of view. First we will organize the concepts of network services on the basis of a model for network services. Next we will extract a number of viewpoints from the relationships between the network service providers and the receivers, i.e. the terminals and users, and clarify the relationships between those viewpoints and the network-service concepts. On the basis of this, we will propose a method for describing network service specifications in a natural language and for understanding them.
This paper describes a temporal parser (TEMPO) introducing a semantic approach to the detection and analysis of quantifiers of time within simple texts. TEMPO processes expressions which implicitly contain a date, like "last Sunday", "two days ago" etc., and extracts the exact date. It uses a keyword-driven parsing technique to spot the quantifiers and then separates segments of the sentence around the keywords and passes them to a semantic DCG parser. Though developed for Greek, the presented scheme can be easily transferred to other languages. TEMPO might prove useful in the automatic processing of special kings of documents which require the extraction of time information, such as applications, legal documents, contracts, etc.
This paper considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this paper presents a large-scale application of these methods to text. It is found that both methods can distinguish real text from independently and identically distributed (i.i.d.) sequences. Furthermore, it is found that the Taylor exponents acquired from words can roughly distinguish text categories; this is also the case for Ebeling and Neiman exponents, but to a lesser extent. Additionally, both methods show some possibility of capturing script kinds.
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
Zadeh introduced the concept of Z-numbers in 2011 to deal with imprecise information. In this regard, many research works have been published in an attempt to introduce some basic theoretical concepts of Z-numbers to model real-world problems. To understand the current challenges when dealing with Z-numbers and the feasibility of using Z-number in solving real-world problems, a comprehensive review of the existing work on Z-number is paramount. This paper consists of an overview of existing literature on Z-number and identifies some of the key areas that are required for further improvement.
The information coded in natural language is called natural language information. It can be employed to analyze risks by computing with words. The disjunction and Cartesian product of fuzzy sets are the basic arithmetics to compute a probability distribution representing the random uncertainty of the risk source. An approach to infer a risk with words is to represent the risk system by using probabilistic and possibilistic constraints. In this paper, with fuzzy logic, we give a sample to verify that the suggested approach is more flexible and effective.
This paper will focus on linguistic constructions involving intensional emotive predicates in order to assess the computational complexity of such constructions based on the features of intensional elements in such constructions. It will be first shown that linguistic constructions vary as a function of the nature of intensional elements involved in such constructions. The important insights derived from these facts will be then carried forward for an exploration of the computational complexity of intensional emotive constructions which can be seen to co-vary in complexity scales with the number of intensional elements or features of intensional elements. Possible ramifications emanating from this for the cognitive tractability of semantic contents and for natural language processing will be offered at the end of the paper.
Current Requirement Engineering research must face the need to deal with the increasing scale of today's requirement specifications. One important and recent research direction is handling the consistency assurance between large scale specifications and many additional regulations (e.g. national and international norms and standards), which the specifications must consider or satisfy. For example, the specification volume for a single electronic control unit (ECU) in the automotive domain sums up to 3000 to 5000 pages distributed over 30 to 300 individual documents (specification and regulations).
In this work, we present an approach to automatically classify the requirements in a set of specification documents and regulations to content topics in order to improve review activities in identifying cross-document inconsistencies. An essential success criteria for this approach from an industrial perspective is a sufficient classification quality with minimal manual effort.
In this paper, we show the results of an evaluation in the domain of automotive specifications at Mercedes-Benz passenger cars. The results show that one manually classified specification is sufficient to derive automatic classifications for other documents within this domain with satisfactory recall and precision. So, the approach of using content topics is not only effective but also efficient in large scale industrial environments.
Parsing is an important process of Natural Language Processing (NLP) and Computational Linguistics which is used to understand the syntax and semantics of natural language sentences confined to the grammar. Parsing models need syntax and semantic coverage for better interpretation of natural language sentences. Though statistical parsing with trigram language models gives better performance through tri-gram probabilities and large vocabulary size, it has some disadvantages like lack of support in syntax, free ordering of words and long distance relationship which are the challenging features of the Tamil language. Grammar based structural parsing provides solutions to some extent. To overcome these disadvantages, structural component is to be involved in statistical approach which results in hybrid models like phrase and dependency models. To add the structural component, balance the vocabulary size and meet the challenging features, lexicalized and statistical parsing (LSP) is to be employed with the assistance of hybrid models. To incorporate all the features in complex and large sentences, phrase structure model may not be suitable to a larger extent. When dependency relations are applied among words, direct relationships can be established. Lexicalized and statistical parsing of natural language text in Tamil language using dependency model will give better performance than using phrase structure model. New part of speech (POS) and dependency tag sets for Tamil language have been Treebank has been developed with 326 sentences which comprises more than 5000 words with manual annotation. It has been extended to 1000 sentences using bootstrapping and manual correction and used to train the dependency model. This LSP with dependency model provides better results and covers all the features of Tamil language.
We propose an interactive approach to data mining meant as the derivation of linguistic summaries of databases. For interactively formulating the linguistic summaries, and then for searching the database, we employ Kacprzyk and Zadrożny's [6-11]) fuzzy querying add-on, FQUERY for Access. We present an implementation for the derivation of linguistic summaries of sales data at a computer retailer.
The aim of this chapter is to refine some questions regarding AI, and to provide partial answers to them. We analyze the state of the art in designing intelligent systems that are able to mimic human complex activities, including acts based on artificial consciousness. The analysis is performed to contrast the human cognition and behavior to the similar processes in AI systems. The analysis includes elements of psychology, sociology, and communication science related to humans and lower level beings. The second part of this chapter is devoted to human-human and man-machine communication, as related to intelligence. We emphasize that the relational aspects constitute the basis for the perception, knowledge, semiotic and communication processes. Several consequences are derived. Subsequently, we deal with the tools needed to endow the machines with intelligence. We discuss the roles of knowledge and data structures. The results could help building "sensitive and intelligent" machines.
Since the end of last year, the researchers at the Institute of Systems Science (ISS) started to consider a more ambitious project as part of its multilingual programming objective. This project examines the domain of Chinese Business Letter Writing. With the problem defined as generating Chinese letters to meet business needs, investigations suggest an intersection of 3 possible approaches: knowledge engineering, form processing and natural language processing. This paper attempts to report some of the findings and document the design and implementation issues that have arisen and been tackled as prototyping work progresses.