Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    CO-OCCURRENCE NETWORK OF REUTERS NEWS

    Networks describe various complex natural systems including social systems. We investigate the social network of co-occurrence in Reuters-21578 corpus, which consists of news articles that appeared in the Reuters newswire in 1987. People are represented as vertices and two persons are connected if they co-occur in the same article. The network has small-world features with power-law degree distribution. The network is disconnected and the component size distribution has power-law characteristics. Community detection on a degree-reduced network provides meaningful communities. An edge-reduced network, which contains only the strong ties has a star topology. "Importance" of persons are investigated. The network is the situation in 1987. After 20 years, a better judgment on the importance of the people can be done. A number of ranking algorithms, including Citation count and PageRank, are used to assign ranks to vertices. The ranks given by the algorithms are compared against how well a person is represented in Wikipedia. We find up to medium level Spearman's rank correlations. A noteworthy finding is that PageRank consistently performed worse than the other algorithms. We analyze this further and find reasons.

  • articleNo Access

    Crowdsourcing-Based Evaluation of Automatic References Between WordNet and Wikipedia

    The paper presents an approach to build references (also called mappings) between WordNet and Wikipedia. We propose four algorithms used for automatic construction of the references. Then, based on an aggregation algorithm, we produce an initial set of mappings that has been evaluated in a cooperative way. For that purpose, we implement a system for the distribution of evaluation tasks, that have been solved by the user community. To make the tasks more attractive, we embed them into a game. Results show the initial mappings have good quality, and they have also been improved by the community. As a result, we deliver a high quality dataset of the mappings between two lexical repositories: WordNet and Wikipedia, that can be used in a wide range of NLP tasks. We also show that the framework for collaborative validation can be used in other tasks that require human judgments.

  • articleNo Access

    A WIKIPEDIA-BASED FRAMEWORK FOR COLLABORATIVE SEMANTIC ANNOTATION

    The semantic web aims at automating web data processing tasks that nowadays only humans are able to do. To make this vision a reality, the information on web resources should be described in a computer-meaningful way, in a process known as semantic annotation. In this paper, a manual, collaborative semantic annotation framework is described. It is designed to take advantage of the benefits of manual annotation systems (like the possibility of annotating formats difficult to annotate in an automatic manner) addressing at the same time some of their limitations (reduce the burden for non-expert annotators). The framework is inspired by two principles: use Wikipedia as a facade for a formal ontology and integrate the semantic annotation task with common user actions like web search. The tools in the framework have been implemented, and empirical results obtained in experiences carried out with these tools are reported.

  • articleNo Access

    Automatic Extraction of Semantic Relations from Wikipedia

    We introduce a novel approach to extract semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. These relations are used to build up a large and up-to-date thesaurus providing background knowledge for tasks such as determining semantic ontology mappings. Our automatic approach uses a comprehensive set of semantic patterns, finite state machines and NLP techniques to extract millions of relations between concepts. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. We also illustrate the value of the newly found relations for improving existing ontology mappings.

  • articleNo Access

    TOLERANCE ROUGH SET BASED ATTRIBUTE EXTRACTION APPROACH FOR MULTIPLE SEMANTIC KNOWLEDGE BASE INTEGRATION

    In the integration of multiple semantic knowledge bases (SKBs), the inconsistence of the items or their attributes appeared in different SKBs is still an opening challenge for researchers. To address this issue, this paper presents an innovative approach which bases on extracting common class attributes and establishing unified category-attribute templates. Since the natural properties of uncertainty and vagueness of semantic analysis involved in selecting a specific attribute from numerous candidates, the tolerance rough set (TRS) techniques are applied in constructing class-attribute templates from online SKBs. The extraction of attribute is fulfilled by statistical techniques and is integrated into the TRS framework. Finally, experiments are conducted on random selected categories. Experimental results show the effectiveness of the proposed approach.

  • articleNo Access

    FUZZY ONTOLOGY ALIGNMENT USING BACKGROUND KNOWLEDGE

    We propose an ontology alignment framework with two core features: the use of background knowledge and the ability to handle vagueness in the matching process and the resulting concept alignments. The procedure is based on the use of a generic reference vocabulary, which is used for fuzzifying the ontologies to be matched. The choice of this vocabulary is problem-dependent in general, although Wikipedia represents a general-purpose source of knowledge that can be used in many cases, and even allows cross language matchings. In the first step of our approach, each domain concept is represented as a fuzzy set of reference concepts. In the next step, the fuzzified domain concepts are matched to one another, resulting in fuzzy descriptions of the matches of the original concepts. Based on these concept matches, we propose an algorithm that produces a merged fuzzy ontology that captures what is common to the source ontologies. The paper describes experiments in the domain of multimedia by using ontologies containing tagged images, as well as an evaluation of the approach in an information retrieval setting. The undertaken fuzzy approach has been compared to a classical crisp alignment by the help of a ground truth that was created based on human judgment.

  • articleNo Access

    EVOLUTION OF WIKIPEDIA'S CATEGORY STRUCTURE

    Wikipedia, as a social phenomenon of collaborative knowledge creation, has been studied extensively from various points of view. The category system of Wikipedia, introduced in 2004, has attracted relatively little attention. In this study, we focus on the documentation of knowledge, and the transformation of this documentation with time. We take Wikipedia as a sample of knowledge in general and its category system as an aspect of the structure of this knowledge. We investigate the evolution of the category structure of the English Wikipedia from its birth in 2004 to 2008. We treat the category system as if it is a hierarchical Knowledge Organization System, capturing the changes in the distributions of the top categories. We investigate how the clustering of articles, defined by the category system, matches the direct link network between the articles and show how it changes over time. We find the Wikipedia category network mostly stable, but with occasional reorganization. We show that the clustering matches the link structure quite well, except short periods preceding the reorganizations.

  • articleNo Access

    A FRAMEWORK FOR THE CALIBRATION OF SOCIAL SIMULATION MODELS

    Simulation with agent-based models is increasingly used in the study of complex socio-technical systems and in social simulation in general. This paradigm offers a number of attractive features, namely the possibility of modeling emergent phenomena within large populations. As a consequence, often the quantity in need of calibration may be a distribution over the population whose relation with the parameters of the model is analytically intractable. Nevertheless, we can simulate. In this paper we present a simulation-based framework for the calibration of agent-based models with distributional output based on indirect inference. We illustrate our method step by step on a model of norm emergence in an online community of peer production, using data from three large Wikipedia communities. Model fit and diagnostics are discussed.

  • articleNo Access

    Understanding Open Collaboration of Wikipedia Good Articles with Factor Analysis

    This research aims at understanding the open collaboration involved in producing Wikipedia Good Articles (GA). To achieve this goal, it is necessary to analyse who contributes to the collaborative creation of GA and how they are involved in the collaboration process. We propose an approach that first employs factor analysis to identify editing abilities and then uses these editing abilities scores to distinguish editors. Then, we generate sequence of editors participating in the work process to analyse the patterns of collaboration. Without loss of generality, we use GA of three Wikipedia categories covering two general topics and a science topic to demonstrate our approach. The result shows that we can successfully generate editor abilities and identify different types of editors. Then we observe the sequence of different editor involved in the creation process. For the three GA categories examined, we found that GA exhibited the characteristic of highly scored content-shaping ability editors involved in the later stage of the collaboration process. The result demonstrates that our approach provides a clearer understanding of how Wikipedia GA are created through open collaboration.

  • articleNo Access

    MAPPING VERBAL ARGUMENT PREFERENCES TO DEVERBAL NOUNS

    We describe an experiment mapping semantic role preferences for transitive verbs to their deverbal nominal forms. The preferences are learned by data mining large parsed corpora. Preferences are modeled for deverbal/argument pairs, falling back to a model for the deverbal only when sufficient data is not available. Errors in role assignment are reduced by 30–40%.

  • articleNo Access

    BUILDING SEMANTIC NETWORKS FROM PLAIN TEXT AND WIKIPEDIA WITH APPLICATION TO SEMANTIC RELATEDNESS AND NOUN COMPOUND PARAPHRASING

    The construction of suitable and scalable representations of semantic knowledge is a core challenge in Semantic Computing. Manually created resources such as WordNet have been shown to be useful for many AI and NLP tasks, but they are inherently restricted in their coverage and scalability. In addition, they have been challenged by simple distributional models on very large corpora, questioning the advantage of structured knowledge representations.

    We present a framework for building large-scale semantic networks automatically from plain text and Wikipedia articles using only linguistic analysis tools. Our constructed resources cover up to 2 million concepts and were built in less than 6 days. Using the task of measuring semantic relatedness, we show that we achieve results comparable to the best WordNet based methods as well as the best distributional methods while using a corpus of a size several magnitudes smaller. In addition, we show that we can outperform both types of methods by combining the results of our two network variants. Initial experiments on noun compound paraphrasing show similar results, underlining the quality as well as the flexibility of our constructed resources.

  • articleNo Access

    Multilingual Distant Supervised Relation Extractors Combining Multiple Feature Types

    A well-known drawback in building machine learning semantic relation detectors for natural language is the lack of a large number of qualified training instances for the target relations in multiple languages. Even when good results are achieved, the datasets used by the state-of-the-art approaches are rarely published. In order to address these problems, this work presents an automatic approach to build multilingual semantic relation detectors through distant supervision combining two of the largest resources of structured and unstructured content available on the Web, DBpedia and Wikipedia. We map the DBpedia ontology back to the Wikipedia text to extract more than 100.000 training instances for more than 90 DBpedia relations for English and Portuguese languages without human intervention. First, we mine Wikipedia articles to find candidate instances for relations described in the DBpedia ontology. Second, we preprocess and normalize the data filtering out irrelevant instances. Finally, we use the normalized data to construct regularized logistic regression detectors that achieve an F-Measure above 80% for both English and Portuguese languages. In this paper, we also compare the impact of different types of features on the accuracy of the trained detector, demonstrating significant performance improvements when combining lexical, syntactic and semantic features. Both the datasets and the code used in this research are available online.

  • articleNo Access

    Fine-Tuning an Algorithm for Semantic Search Using a Similarity Graph

    Given a set of documents and an input query that is expressed in a natural language, the problem of document search is retrieving the most relevant documents. Unlike most existing systems that perform document search based on keyword matching, we propose a method that considers the meaning of the words in the queries and documents. As a result, our algorithm can return documents that have no words in common with the input query as long as the documents are relevant. For example, a document that contains the words "Ford", "Chrysler" and "General Motors" multiple times is surely relevant for the query "car" even if the word "car" never appears in the document. Our information retrieval algorithm is based on a similarity graph that contains the degree of semantic closeness between terms, where a term can be a word or a phrase. Since the algorithms that constructs the similarity graph takes as input a myriad of parameters, in this paper we fine-tune the part of the algorithm that constructs the Wikipedia part of the graph. Specifically, we experimentally fine-tune the algorithm on the Miller and Charles study benchmark that contains 30 pairs of terms and their similarity score as determined by human users. We then evaluate the performance of the fine-tuned algorithm on the Cranfield benchmark that contains 1400 documents and 225 natural language queries. The benchmark also contains the relevant documents for every query as determined by human judgment. The results show that the fine-tuned algorithm produces higher mean average precision (MAP) score than traditional keyword-based search algorithms because our algorithm considers not only the words and phrases in the query and documents, but also their meaning.