LOGIC FORMULAS BASED KNOWLEDGE DISCOVERY AND ITS APPLICATION TO THE CLASSIFICATION OF BIOLOGICAL DATA
Classifiers built through supervised learning techniques are widely used in computational biology. Examples are neural networks, decision trees and support vector machines. Recently, an extension of Regularized Generalized Eigenvalues Classifier (ReGEC) has been proposed, in which prior knowledge is included. When knowledge is formalized as a set of linear constraints to the ReGEC, the resulting non linear classifier has a lower complexity and halves the misclassi-fication error with respect to the original method. In this work, we show how logic programming can extract knowledge from data to enhance classification models produced by ReGEC. The knowledge extraction method is based on two phases: a feature selection phase and a rules extraction phase. Feature selection is formulated as an integer programming problem that extends a set covering problem. The extraction phase is performed through the iterative solution of different instances of the same minimum cost satisfiability problem that models the logic separation rules used for classification. The overall method, that we call LF-ReGEC, guarantees that the number of points in the training set is not increased and the resulting model does not overfit the problem. Furthermore, the overall accuracy of the method is increased. Finally, the method is compared with other methods using genomic and proteomic data sets taken from the literature.