No Access

LOGIC FORMULAS BASED KNOWLEDGE DISCOVERY AND ITS APPLICATION TO THE CLASSIFICATION OF BIOLOGICAL DATA

G. FELICI

Institute for System Analysis and Computer Science, National Research Council, Rome, Italy

Search for more papers by this author

P. BERTOLAZZI

Institute for System Analysis and Computer Science, National Research Council, Rome, Italy

Search for more papers by this author

M. R. GUARRACINO

High Performance Computing and Networking Institute, National Research Council, Naples, Italy

Search for more papers by this author

A. CHINCHULUUN

Industrial and Systems Engineering Department, University of Florida, Gainesville, FL, USA

Search for more papers by this author

, and

P. M. PARDALOS

Industrial and Systems Engineering Department, University of Florida, Gainesville, FL, USA

Search for more papers by this author

https://doi.org/10.1142/9789814271820_0017Cited by:3 (Source: Crossref)

Abstract:

Classifiers built through supervised learning techniques are widely used in computational biology. Examples are neural networks, decision trees and support vector machines. Recently, an extension of Regularized Generalized Eigenvalues Classifier (ReGEC) has been proposed, in which prior knowledge is included. When knowledge is formalized as a set of linear constraints to the ReGEC, the resulting non linear classifier has a lower complexity and halves the misclassi-fication error with respect to the original method. In this work, we show how logic programming can extract knowledge from data to enhance classification models produced by ReGEC. The knowledge extraction method is based on two phases: a feature selection phase and a rules extraction phase. Feature selection is formulated as an integer programming problem that extends a set covering problem. The extraction phase is performed through the iterative solution of different instances of the same minimum cost satisfiability problem that models the logic separation rules used for classification. The overall method, that we call LF-ReGEC, guarantees that the number of points in the training set is not increased and the resulting model does not overfit the problem. Furthermore, the overall accuracy of the method is increased. Finally, the method is compared with other methods using genomic and proteomic data sets taken from the literature.

BIOMAT 2008

Metrics

Downloaded 3 times

History

PDF download

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

LOGIC FORMULAS BASED KNOWLEDGE DISCOVERY AND ITS APPLICATION TO THE CLASSIFICATION OF BIOLOGICAL DATA

Recommended