Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we propose a new approach for managing domain specific thesauri, where object-oriented paradigm is applied to thesaurus construction and query-based browsing. The approach provides an object-oriented mechanism to assist domain experts in constructing thesauri; it determines a considerable part of relationship degrees between terms by inheritance and supplies the domain expert with information available from other parts of the thesaurus being constructed or already constructed. In addition to that, it enables domain experts to incrementally construct the thesaurus, since the automatically determined relationship degrees can be refined whenever a more sophisticated thesaurus is needed. It may minimize domain experts' burden caused by the exhaustive specification of individual relationship. This approach also provides a query-based browsing facility, which enables users to find desired thesaurus terms without tedious browsing in the thesaurus. A browsing query can be formulated with terms rather ambiguous, yet capable of deriving the desired terms. This browsing query is useful especially when users want precise results. In other words, it is useful when they want to use only thesaurus terms carefully selected in reformulating Boolean queries. To demonstrate the feasibility of our approach, we fully implemented an object-based thesaurus system, which supports the semiautomatic thesaurus construction and the query-based browsing facility.
We introduce a novel approach to extract semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. These relations are used to build up a large and up-to-date thesaurus providing background knowledge for tasks such as determining semantic ontology mappings. Our automatic approach uses a comprehensive set of semantic patterns, finite state machines and NLP techniques to extract millions of relations between concepts. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. We also illustrate the value of the newly found relations for improving existing ontology mappings.
This paper proposes a framework for a man-computer cooperative information retrieval system, supplying users of online discussion systems (or group decision support systems) with immediate information from the Internet. This system can sense the discussion topic and sub-topics actively, and sends the sub-topics as search terms to a WWW search engine. A filter and feedback module checks the quality of searching results and interacts with the search engine if needed. The filtered items from the filter and feedback module are sent to online discussion users. At last an information recommendation module records the visited history of all items, and then works out the most favorite ones to recommend them to each expert selectively. Some primary experimental results show that this framework is feasible.
Thesauri for science and technology information are increasingly used in bibliometrics and scientometrics. However, the manual construction and maintenance of thesauri are costly and time consuming; thus, methods for semi-automatic construction and maintenance are being actively studied. We propose a method that expands an existing thesaurus with specified terms extracted from the abstracts of articles. Specifically, we assign the terms to certain subcategories by our novel clustering method based on information entropy for word vectors. Then, we determine the hypernyms and hyponyms based on their relations with terms in the subcategories. The word vectors are constructed from 177,000 IEEE articles archived from 2012 to 2014 in the Scopus dataset. In experiments, the terms were correctly classified into the Japan Science and Technology thesaurus with 83.3% precision and 71.4% recall. In future, we will develop a semi-automatic thesaurus maintenance system that recommends new terms in their proper relative positions.
This paper presents an overview of an on-going study on digitization of old European renaissance documents at CESR, the research centre with which the author is attached. Three collections of digital resources are being developed at the CESR: Musicology (dealing with a corpus of Renaissance Songs), Architecture for Art History and the BVH program (Virtual Humanistic Libraries). The BVH contains a selection of Renaissance books located in different departmental libraries. The significant items that are so far not digitized by other libraries are selected. About 2,000 books (comprising of about 800,000 pages) out of about 50,000 books have been scanned by the CESR in collaboration with others. However, preservation of these documents is not the only requirement of this project. It also attempts to provide some kind of access inside the images of the texts. Document's images are classified into different groups mainly based on to the format (for printed material) like folios, quartos, octavos, etc. Books are further classified according to their graphic contents (e.g. medical books with large anatomy engravings, technical books with schemas, folk books with used woodcuts, maps and atlas, cosmographies, and so on). An application called AGORA (Analyseur Graphique pour OuvRages Anciens = Graphic analyzer for rare books) is used for primary structure analysis and segmentation of imaged documents. Indexing of documents makes use of the key-words coming from the Iconclass thesaurus, a set of semantically and hierarchically organized trees of keywords describing the works of art and ornaments, down to the smallest details, actions, foregrounds and backgrounds. Manuscripts zones are described mainly with metadata and annotations, and eventually with the same topics. In future, the names of personalities are also to be indexed as for mythological entities, with an alphanumeric number, to allow multiple queries inside collections of artifacts, book illustrations, archives, paintings, etc. Therefore, in the website of BVH, the images will comprise not only text in image form, but all the graphic elements extracted from the digitized documents, properly indexed and accessible by means of a double standard search engine, connected to the Iconclass thesaurus. Simultaneous visualisation of images and texts, will be rendered possible by the organization of metadata and XML/TEI encoding. However, in doing so, the first challenge is to build and connect a network of databases linked to the Iconclass thesaurus, so that one can find any item in any base from a range of hierarchized keywords and provide immediate access to the images the user is looking for. The second challenge is the treatment of multinlingualism offered by Iconclass. And final challenge is to test automatic procedures of encoding: until now, encoding is done by hand; but with automatic image retrieval and similarity analysis (processed in Tours), one can expect to save time-consuming encoding.