![]() |
After decades of research activity, Chinese spoken language processing (CSLP) has advanced considerably both in practical technology and theoretical discovery. In this book, the editors provide both an introduction to the field as well as unique research problems with their solutions in various areas of CSLP. The contributions represent pioneering efforts ranging from CSLP principles to technologies and applications, with each chapter encapsulating a single problem and its solutions.
A commemorative volume for the 10th anniversary of the international symposium on CSLP in Singapore, this is a valuable reference for established researchers and an excellent introduction for those interested in the area of CSLP.
Sample Chapter(s)
Chapter 1: Speech Analysis: The Production-Perception Perspective (1,839 KB)
https://doi.org/10.1142/9789812772961_fmatter
The following sections are included:
https://doi.org/10.1142/9789812772961_0001
This chapter introduces the basic concepts and techniques of speech analysis from the perspectives of the underlying mechanisms of human speech production and perception. Spoken Chinese language has special characteristics in its signal properties that can be well understood in terms of both the production and perception mechanisms. In this chapter, we will first outline the general linguistic, phonetic, and signal properties of spoken Chinese. We then introduce human production and perception mechanisms, and in particular, those relevant to spoken Chinese. We also present some recent brain research on the relationship between human speech production and perception. From the perspectives of human speech production and perception, we then describe popular speech analysis techniques and classify them based on the underlying scientific principles either from the speech production or perception mechanism or from both.
https://doi.org/10.1142/9789812772961_0002
This chapter provides the phonetic and phonological background of Chinese spoken languages, Mandarin especially. Before going into the main theme of the chapter, a brief clarification of the distinction between phonetics and phonology is provided for readers of non-linguistic background. After generally describing the major features of spoken Chinese, standard Mandarin, among other dialects, is chosen as a typical example to explore in detail the unique syllable structure of spoken Chinese. Both traditional initial-final analysis and modern linguistic analysis approaches such as phonetics, phonemics, and phonotactics are described with the goal of examining Mandarin syllables and phonemes in detail. Hopefully, this chapter can help lay the linguistic foundations of Chinese spoken language processing (CSLP) for both Chinese and non-Chinese speaking researchers and engineers.
https://doi.org/10.1142/9789812772961_0003
This chapter discusses why Mandarin speech prosody is not simply about tones and intonation, and how additional but crucial prosodic information could be analyzed. We present arguments with quantitative evidences to demonstrate that fluent speech prosody contains higher-level discourse information apart from segmental, tonal and intonation information. Discourse information is reflected through relative cross-phrase prosodic associations, and should be included and accounted for in prosody analysis. A hierarchical framework of Prosodic Phrase Grouping (PG) is used to explain how in order to convey higher-level association individual phrases are adjusted to form coherent multiple-phrase speech paragraphs. Only three PG relative positions (PG-initial, -medial and -final) are required to constrain phrase intonations to generate the prosodic association necessary to global output prosody which independent phrase intonations could not produce. The discussion focuses on why the internal structuring of PG forms prosodic associations, how global prosody can be accounted for hierarchically, how the key feature to speech prosody is cross-phrase associative prosodic templates instead of unrelated linear strings of phrase intonations; and how speech data type, speech unit selection, and methods of analysis affect the outcome of prosody analysis. Implications are significant to both phonetic investigations as well as technology development.
https://doi.org/10.1142/9789812772961_0004
Tone modeling for speech synthesis aims at providing proper pitch, duration, and energy information to generate natural synthetic speech from input text. As speech processing technology progresses rapidly in recent years, some advanced tone modeling techniques for Mandarin text-to-speech (MTTS) are proposed. In this chapter, two modern tone modeling approaches for Mandarin speech synthesis are discussed in detail.
https://doi.org/10.1142/9789812772961_0005
This chapter introduces Mandarin Text-To-Speech (MTTS) synthesis. Beginning with a brief review on the development history of MTTS and attributes of MTTS, three main constituents of the technology are presented: 1) Text processing: word segmentation, disambiguation of polyphones, and analysis of rhythm structure; 2) prosodic processing: features of Mandarin prosody, and prosody prediction, and; 3) speech synthesis: parametric synthesis and concatenative synthesis. Finally perspectives and applications for MTTS synthesis are discussed in the final sections.
https://doi.org/10.1142/9789812772961_0006
This chapter very briefly reviews the area of Large Vocabulary Continuous Speech Recognition (LVCSR) for Mandarin Chinese, which is apparently a very important core technology in Chinese Spoken Language Processing. The overall framework and basic principles of LVCSR for Mandarin Chinese are briefly reviewed in this chapter considering the structural features of the Chinese language, serving as the introduction of the four following chapters of this book that focus on the four key modules of this framework: acoustic modeling, tone modeling, language modeling and pronunciation modeling. Evolution of application tasks and a very recent prototype example are then summarized, and some lessons learned in the development of LVCSR for Mandarin Chinese finally discussed.
https://doi.org/10.1142/9789812772961_0007
After reviewing the history on Mandarin speech recognition in the previous chapter, we will now describe a few key technologies to build a highly accurate Mandarin Large Vocabulary Continuous Speech Recognition (LVCSR) system. LVCSR is the foundation for many useful speech-based applications, including keyword spotting, translation, voice indexing, etc. The core technologies developed on western languages are easily applicable to Chinese Mandarin. However, as noted in the previous chapter, we need to take care of the special characteristics of the Chinese language in order to achieve very high accuracy. Our emphasis will be on the differences and extra features used in the Mandarin system, with some brief summarization of the backbone technologies that are language independent. Finally we will present a state of the art Mandarin speech recognizer jointly developed by University of Washington (UW) and SRI International, and discuss unsolved challenges.
https://doi.org/10.1142/9789812772961_0008
Tone is an important linguistic component of spoken Chinese. For Chinese speech recognition, tone information is useful to differentiate words. This chapter is about automatic tone recognition for Chinese and its usefulness in automatic speech recognition. Our discussion focuses on Cantonese and Mandarin, which are the representatives of Chinese dialects that have been studied extensively. The key problematic issues in tone recognition are addressed and the major approaches in tone modeling are described. We also introduce various techniques of integrating tone recognition into state-of-the-art large vocabulary continuous speech recognition systems.
https://doi.org/10.1142/9789812772961_0009
Language modeling aims to extract linguistic regularities which are crucial in areas of information retrieval and speech recognition. Specifically, for Chinese systems, language dependent properties should be considered in Chinese language modeling. In this chapter, we first survey the works of word segmentation and new word extraction which are essential for the estimation of Chinese language models. Next, we present several recent approaches to deal with the issues of parameter smoothing and long-distance limitation in statistical n-gram language models. To tackle long-distance insufficiency, we address the association pattern language models. For the issue of model smoothing, we present a solution based on the latent semantic analysis framework. To effectively refine the language model, we also adopt the maximum entropy principle and integrate multiple knowledge sources from a collection of text corpus. Discriminative training is also discussed in this chapter. Some experiments on perplexity evaluation and Mandarin speech recognition are reported.
https://doi.org/10.1142/9789812772961_0010
Pronunciation variations in spontaneous speech can be classified into complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by an alternative phone. Partial changes are variations within the phoneme such as nasalization, centralization and voicing. We propose a solution for modeling both complete changes and partial changes in spontaneous Mandarin speech. We use the decision tree based pronunciation modeling to predict alternate pronunciations with associated probabilities in order to model complete changes. To avoid lexical confusion with the augmented pronunciation dictionary, we propose using a likelihood ratio test as a confidence measure. In order to model partial changes, we propose partial change phone models (PCPMs) and acoustic model reconstruction. We treat PCPMs as hidden models and merge them into the pre-trained baseform model via model reconstruction through decision tree merging. It is shown that our phone level pronunciation modeling results in an absolute 0.9% syllable error rate reduction, and the acoustic model reconstruction approach results in a significant 2.39% absolute syllable error rate reduction in spontaneous speech.
https://doi.org/10.1142/9789812772961_0011
In this chapter we focus our attention primarily on Chinese speech corpora design, collection and annotation, oriented to speech synthesis, recognition and fundamental speech research. Speech corpora are referred to not only as speech signals per se, but they include relevant documents, metadata, annotations and all related specifications. A complete corpus would be ready for use in the development of speech application systems and available for distribution to more than one research team in the community.
https://doi.org/10.1142/9789812772961_0012
In this chapter, the speech-to-speech translation problem is introduced and presented along with the research efforts and approaches towards a solution. Two different statistical approaches for speech-to-speech translation are described in considerable detail. The concept-based approach focuses on understanding, extracting the semantic meaning from the input speech, and then re-generating the same meaning in the output language. The approach uses requires a huge amount of human effort for linguistic information annotation, although the amount of annotated data needed is not very large (in our experiment only 10,000 sentences are annotated for each language). Detailed work on how to improve the natural language generation translation quality is presented. The second approach, a novel framework for performing phrase-based and finite-state transducer based approach, emphasizes both system development and search speed as well as memory efficiency. It is significantly faster and more memory efficient than existing approaches. The entire translation model is statically optimized with a single weighted finite-state transducer (WFST). The approach is particularly suitable for converged real-time translation on scalable computing devices, ranging from high-end servers to mobile PDAs. This approach exploits un-annotated parallel corpora at the cost of potential meaning loss and the requirement of large amount of parallel text data (in our experiment 240K sentences, in the order of millions of words for each language, are used).
https://doi.org/10.1142/9789812772961_0013
Huge, continually increasing quantities of multimedia content including speech information are filling up our computers, networks and lives. It is obvious that speech is one of the most important sources of information for multimedia content, as it is the speech of the content that tells us of the subjects, topics and concepts. As a result, the associated spoken documents of the multimedia content will be key for content retrieval and browsing. Substantial efforts along with very encouraging results for spoken document transcription, retrieval, and summarization have been reported. This chapter presents a concise yet comprehensive overview of information retrieval and automatic summarization technologies that have been developed in recent years for efficient spoken document retrieval and browsing applications. An example prototype system for voice retrieval of Chinese broadcast news collected in Taiwan will be introduced as well.
https://doi.org/10.1142/9789812772961_0014
Speech act, an essential element of conversation, underlies the principle that an utterance in a dialogue is an action being performed by a speaker. Since speech acts do convey speakers' intentions and opinions, it is key for the computer to identify and verify the speech act of a user's utterance in a spoken dialogue system. This chapter presents a few approaches to speech act identification and verification in Chinese spoken dialogue systems. Approaches using ontology-based partial pattern trees and semantic dependency graphs (SDGs) for speech act modeling are described. A verification mechanism using a latent semantic analysis (LSA) based Bayesian belief model (BBM) is adopted to improve the performance of speech act identification. Experimental results show the SDG-based approach outperforms the Bayes' classifier and the ontology-based partial pattern trees. By integrating discourse analysis into the SDG-based approach, the results show improvements obtained not only in the speech act identification accuracy rate, but also in the performance of semantic object extraction. Furthermore, LSA-based BBM for speech act verification further improves the performance of speech act identification.
https://doi.org/10.1142/9789812772961_0015
In speech and language processing, such as automatic speech recognition (ASR), text-to-speech (TTS), cross-lingual information retrieval (CLIR), and machine translation (MT), there is an increasing need to translate out-of-vocabulary words from one language into another, especially from Latin-scripted languages into ideographic graphemes of languages such as Chinese, Japanese or Korean (CJK). In practice, whenever semantic translation is not available, translation is done by transliteration. In general, transliteration refers to the method of translating from one language to another by preserving the way words sound in their original languages, also known as translation-by-sound. However, when translating from Latin-scripted words that are originally from Chinese and its dialects, Japanese, or Korean into Chinese, transliteration refers to the method of back-translating into their original ideographic graphemes, also known as back-transliteration. In this chapter, we will discuss an English to Chinese transliteration paradigm for proper nouns, in particular personal names, through the exploration of various transliteration and validation techniques.
https://doi.org/10.1142/9789812772961_0016
Cantonese is a major Chinese dialect spoken by tens of millions of people in southern China. Our research team at the Chinese University of Hong Kong (CUHK) has devoted great efforts on the research and development of Cantonese speech recognition and speech synthesis. This chapter gives an overview of our work. The linguistic and acoustic properties of spoken Cantonese are discussed. A set of large-scale Cantonese speech corpora are described as an indispensable infrastructural component for research and development. The development of Cantonese large-vocabulary continuous speech recognition (LVCSR) and text-to-speech (TTS) systems is presented in detail.
https://doi.org/10.1142/9789812772961_0017
In this chapter, we review research efforts in automatic speech recognition (ASR), text-to-speech (TTS) and speech corpus design for Taiwanese, or Min-nan – a major native language spoken in Taiwan. Following an introduction of the orthography and phonetic structure of Taiwanese, we describe the various databases used for these tasks, including the Formosa Lexicon (ForLex) – a phonetically transcribed database using Formosa Alphabet (ForPA), an alphabet system designed with Taiwan's multi-lingual applications in mind – and the Formosa Speech Database (ForSDat) – a speech corpus made up of microphone and telephone speech. For ASR, we propose a unified scheme that includes Mandarin/Taiwanese bilingual acoustic models, incorporate variations in pronunciation into pronunciation modeling, and create a character-based tree-structured searching network. This scheme is especially suitable for handling multiple character-based languages, such as members of the CJKV (Chinese, Japanese, Korean, and Vietnam) family. For speech synthesis, through the use of the bilingual lexicon information, the Taiwanese TTS system is made up of three functional modules: a text analysis module, a prosody module, and a waveform synthesis module. An experiment conducted to evaluate the text analysis and tone sandhi modules reveals about 90% labeling and 65% tone sandhi accuracies. Multiple-level unit selection for a limited domain application of TTS is also proposed to improve the naturalness of synthesized speech.
https://doi.org/10.1142/9789812772961_0018
Putonghua Shuiping Ceshi (PSC) is the official Standard Mandarin proficiency test in China. Currently evaluation of the PSC test is conducted entirely by human testers, which leads to some kind of variance between evaluators as a result of subjectivity. Furthermore, large-scale testing is practically impossible due to the low efficiency and high expenses it would incur. So there is an urgent need to implement the PSC test with the aid of a computer. This chapter introduces a computer-aided evaluation system for the PSC test. In the system, several optimized algorithms are advanced to evaluate a speaker's proficiency, such as using typical dialect error patterns to restrict recognition grammar, adjusting posterior probability by duration ratio of initials and finals, selective adaptation, F0 normalization using CDF-matching, etc. The evaluation problem is also tested as a classification problem by defining different types of pronunciation according to the speaker's proficiency. Experiments based on 1,662 persons' PSC test database indicate that these methods can efficiently improve the performance of evaluation, and that the computer-aided PSC test is indeed feasible.
https://doi.org/10.1142/9789812772961_0019
In this digital era, digital media data is available everywhere, such as internet online audio/video (AV) broadcasting, digital TV news, music, telephone conversations, etc., and this volume is ever-increasing. Digital content management and retrieval (DCMR) technology is thus expected to be helpful for the management and access of this huge amount of digital media data. Since the audio channel is an important component in digital media which contains rich information or features describing the digital media content, our studies are mainly focused on the techniques of audio-based digital content management and retrieval (AD-CMR), in which textual audio information retrieval (TAIR) and content-based audio retrieval (CBAR) are its two important research directions. This chapter introduces our recent studies on TAIR and CBAR, as well as their potential applications.
https://doi.org/10.1142/9789812772961_0020
Spoken dialog systems demonstrate a high degree of usability in many restricted domains, and these range from air travel, train schedules, restaurant guides, ferry timetables, electronic automobile classifieds, weather information and email access. A user typically interacts with these systems to retrieve certain information for example a train schedule; or to complete a task, such as booking a flight, reserving a restaurant table, or finding an apartment. Dialog modeling in these systems plays an important role in assisting users to achieve their goals effectively. This chapter presents an introduction to multilingual dialog systems in two parts: the first part describes the key components of a multilingual dialog system, using an illustrative example based on the CU FOREX system. The second part introduces the various kinds of dialog models, highlights the advantages of the mixed-initiative dialog model, and presents a possible data-driven approach for its implementation.
https://doi.org/10.1142/9789812772961_0021
Among all of the services provided by telecom companies, directory assistance (DA) is undoubtedly the most likely one to be automated. With increasing competition and revenue decline, directory assistance providers are seeking more cost-effective operational models. Speech technology is definitely a key solution to automation. Providing automated DA concerns most of the speech recognition technologies such as model training, large vocabulary continuous speech recognition, confidence measure, noise/channel robustness, etc. The more specific challenge is that this speech recognition application needs to handle very large vocabulary which includes homonyms, abbreviations and variations of speaker expressions. Although there is much literature and many system deployments regarding automated DA in Western languages, only a few deal with Chinese languages. A trial system has been developed in Chunghwa Telecom since April 2004 and significantly benefits the telecom company. The speech technologies and business concerns in the system development of this application will be discussed in detail.
https://doi.org/10.1142/9789812772961_0022
For a safer and more convenient operation, a car navigation and assistance system with speech-enabled functions is essential. This chapter addresses two challenges for this purpose. The first is that the speech signal is inevitably corrupted by the ambient noise coming from the car. The distance between a hands-free car microphone and the speaker makes this problem even more serious. In this chapter, an integration of a perceptual filterbank and subspace-based speech enhancement are presented to reduce the effects of the first problem. The next challenge is that traditional spoken dialogue systems (SDS) merely concentrate on the interaction between the system and a single speaker. For this reason we are motivated to conduct a further study on multi-speaker dialogue systems. Here, the interactions between multiple speakers and the system are classified into three types: independent, cooperative, and conflicting interactions. An algorithm for the multi-speaker dialogue management is proposed to determine the interaction type, and to keep the interaction running smoothly.
https://doi.org/10.1142/9789812772961_0023
This chapter discusses the fundamental issues related to the development of language resources for Chinese spoken language processing (CSLP). Chinese dialects, transcription systems, and Chinese character sets are described. The general procedure for speech corpus production is introduced, along with the dialect-specific problems related to CSLP corpora. Some activities in the development of CSLP corpora are also presented here. Finally, available language resources for CSLP as well as their related websites are listed.
https://doi.org/10.1142/9789812772961_bmatter
The following sections are included: