A NEW KERNEL-BASED FORMALIZATION OF MINIMUM ERROR PATTERN RECOGNITION
Previous expositions of the Minimum Classification Error framework for discriminative training of pattern recognition systems describe the use of a smoothed version of the error count as the criterion function for classifier design, but do not specify the origin and nature of the smoothing used. In this chapter we show that the same optimization criterion can be derived from the classic Parzen window approach to smoothing in the context of non-parametric density estimation. The density estimated is not that of the category pattern distributions - as performed in conventional non-discriminative methods such as maximum likelihood estimation - but rather that of a transformational variable comparing correct and best incorrect categories. The density estimate can easily be integrated over the domain corresponding to classification mistakes, yielding a cost function that is closely related to the original MCE cost function. The risk estimate formulation presented here provides a new link, Parzen estimation, between the empirical cost function calculated on a finite training set and the true theoretical classification risk. The classic Parzen concepts of linking the kernel width to the amount of training data can be used in the context of discriminative training to express the intuitive concept of using a smaller margin as the amount of training data increases.