Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    HYBRIDIZATION OF GENETIC AND QUANTUM ALGORITHM FOR GENE SELECTION AND CLASSIFICATION OF MICROARRAY DATA

    In this work, we hybridize the Genetic Quantum Algorithm with the Support Vector Machines classifier for gene selection and classification of high dimensional Microarray Data. We named our algorithm GQASVM. Its purpose is to identify a small subset of genes that could be used to separate two classes of samples with high accuracy. A comparison of the approach with different methods of literature, in particular GASVM and PSOSVM [2], was realized on six different datasets issued of microarray experiments dealing with cancer (leukemia, breast, colon, ovarian, prostate, and lung) and available on Web. The experiments clearified the very good performances of the method. The first contribution shows that the algorithm GQASVM is able to find genes of interest and improve the classification on a meaningful way. The second important contribution consists in the actual discovery of new and challenging results on datasets used.

  • articleNo Access

    ITERATIVE FEATURE PERTURBATION AS A GENE SELECTOR FOR MICROARRAY DATA

    Gene-expression microarray datasets often consist of a limited number of samples with a large number of gene-expression measurements, usually on the order of thousands. Therefore, dimensionality reduction is critical prior to any classification task. In this work, the iterative feature perturbation method (IFP), an embedded gene selector, is introduced and applied to four microarray cancer datasets: colon cancer, leukemia, Moffitt colon cancer, and lung cancer. We compare results obtained by IFP to those of support vector machine-recursive feature elimination (SVM-RFE) and the t-test as a feature filter using a linear support vector machine as the base classifier. Analysis of the intersection of gene sets selected by the three methods across the four datasets was done. Additional experiments included an initial pre-selection of the top 200 genes based on their p values. IFP and SVM-RFE were then applied on the reduced feature sets. These results showed up to 3.32% average performance improvement for IFP across the four datasets. A statistical analysis (using the Friedman/Holm test) for both scenarios showed the highest accuracies came from the t-test as a filter on experiments without gene pre-selection. IFP and SVM-RFE had greater classification accuracy after gene pre-selection. Analysis showed the t-test is a good gene selector for microarray data. IFP and SVM-RFE showed performance improvement on a reduced by t-test dataset. The IFP approach resulted in comparable or superior average class accuracy when compared to SVM-RFE on three of the four datasets. The same or similar accuracies can be obtained with different sets of genes.

  • articleNo Access

    CAN MARKOV CHAIN MODELS MIMIC BIOLOGICAL REGULATION?

    A fundamental question in biology is whether the network of interactions that regulate gene expression can be modeled by existing mathematical techniques. Studies of the ability to predict a gene's state based on the states of other genes suggest that it may be possible to abstract sufficient information to build models of the system that retain steady-state behavioral characteristics of the real system. This study tests this possibility by: (i) constructing a finite state homogeneous Markov chain model using a small set of interesting genes; (ii) estimating the model parameters based on the observed experimental data; (iii) exploring the dynamics of this small genetic regulatory network by analyzing its steady-state (long-run) behavior and comparing the resulting model behavior to the observed behavior of the original system. The data used in this study are from a survey of melanoma where predictive relationships (coefficient of determination, CoD) between 587 genes from 31 samples were examined. Ten genes with strong interactive connectivity were chosen to formulate a finite state Markov chain on the basis of their role as drivers in the acquisition of an invasive phenotype in melanoma cells. Simulations with different perturbation probabilities and different iteration times were run. Following convergence of the chain to steady-state behavior, millions of samples of the results of further transitions were collected to estimate the steady-state distribution of network. In these samples, only a limited number of states possessed significant probability of occurrence. This behavior is nicely congruent with biological behavior, as cells appear to occupy only a negligible portion of the state space available to them. The model produced both some of the exact state vectors observed in the data, and also a number of state vectors that were near neighbors of the state vectors from the original data. By combining these similar states, a good representation of the observed states in the original data could be achieved. From this study, we find that, in this limited context, Markov chain simulation emulates well the dynamic behavior of a small regulatory network.

  • articleNo Access

    NONLINEAR PROBIT GENE CLASSIFICATION USING MUTUAL INFORMATION AND WAVELET-BASED FEATURE SELECTION

    We consider the problem of cancer classification from gene expression data. We propose using a mutual information-based gene or feature selection method where features are wavelet-based. The bootstrap technique is employed to obtain an accurate estimate of the mutual information. We then develop a nonlinear probit Bayesian classifier consisting of a linear term plus a nonlinear term, the parameters of which are estimated using the Gibbs sampler. These new methods are applied to analyze breast-cancer data and leukemia data. The results indicate that the proposed gene and feature selection method is very accurate in breast-cancer and leukemia classifications.

  • articleNo Access

    MULTISTAGE MUTUAL INFORMATION FOR INFORMATIVE GENE SELECTION

    An important issue in the design of gene selection algorithm for microarray data analysis is the formation of suitable criterion function for measuring the relevance between different gene expressions. Mutual information (MI) is a widely used criterion function but it calculates the relevance on the entire samples only once which cannot exactly identify the informative genes. This paper proposes a novel idea of computing MI in stages. The proposed multistage mutual information (MSMI) computes MI, initially using all the samples and based on the classification performance produced by artificial neural network (ANN), MI is repeatedly calculated using only the unclassified samples until there is no improvement in the classification accuracy. The performance of the proposed approach is evaluated using ten gene expression data sets. Simulation result shows that the proposed approach helps to improve the discriminate power of the genes with regard to the target disease of a microarray sample. Statistical analysis of the test result shows that the proposed method selects highly informative genes and produces comparable classification accuracy than the other approaches reported in the literature.

  • articleNo Access

    MINIMUM REDUNDANCY FEATURE SELECTION FROM MICROARRAY GENE EXPRESSION DATA

    How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy — maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naïve Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines.

    Supplimentary: The top 60 MRMR genes for each of the datasets are listed in . More information related to MRMR methods can be found at .

  • articleNo Access

    A THEORETICAL ANALYSIS OF THE SELECTION OF DIFFERENTIALLY EXPRESSED GENES

    A great deal of recent research has focused on the challenging task of selecting differentially expressed genes from microarray data ("gene selection"). Numerous gene selection algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like small sample sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. In this paper we propose a theoretical analysis of gene selection, in which the probability of successfully selecting differentially expressed genes, using a given ranking function, is explicitly calculated in terms of population parameters. The theory developed is applicable to any ranking function which has a known sampling distribution, or one which can be approximated analytically. In contrast to methods based on simulation, the approach presented here is computationally efficient and can be used to examine the behavior of gene selection algorithms under a wide variety of conditions, even when the number of genes involved runs into the tens of thousands. The utility of our approach is illustrated by comparing three widely-used gene selection methods.

  • articleNo Access

    BIOMARKER DISCOVERY AND VISUALIZATION IN GENE EXPRESSION DATA WITH EFFICIENT GENERALIZED MATRIX APPROXIMATIONS

    In most real-world gene expression data sets, there are often multiple sample classes with ordinals, which are categorized into the normal or diseased type. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to the up/down regulation across the normal and diseased types of classes, while the specific gene selection methods particularly consider the differential expressions across the normal and diseased, but ignore the existence of multiple classes. In this paper, to improve the biomarker discovery, we propose to make the best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of data set characteristic). Therefore, we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). Our results show that GMA cannot only improve the accuracy of classifying the samples, but also provide a visualization method to effectively analyze the gene expression data on both genes and samples. Based on the mechanism of matrix approximation, we further propose an algorithm, CBiomarker, to discover compact biomarker by reducing the redundancy.

  • articleNo Access

    MODEL-BASED CLUSTERING WITH GENE RANKING USING PENALIZED MIXTURES OF HEAVY-TAILED DISTRIBUTIONS

    Cluster analysis of biological samples using gene expression measurements is a common task which aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several model-based clustering algorithms have been proposed in which the distribution of gene expression values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme observations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism to quantify the relative contribution of each gene to the final partitioning of the data. We propose a penalized mixture of Student's t distributions for model-based clustering and gene ranking. Together with a resampling procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor sub-classes.

  • articleNo Access

    Identification of biomarker genes for resistance to a pathogen by a novel method for meta-analysis of single-channel microarray datasets

    The search for fast and reliable methods allowing for extraction of biomarker genes, e.g. responsible for a plant resistance to a certain pathogen, is one of the most important and highly exploited data mining problem in bioinformatics. Here we describe a simple and efficient method suitable for combining results from multiple single-channel microarray experiments for meta-analysis. A new technique presented here makes use of the fuzzy set logic for the initial gene selection and of the machine learning algorithm AdaBoost to retrieve a set of genes where expression profiles are the most different between the resistant and susceptible classes. As a proof of concept, our method has been applied to the analysis of a gene expression dataset composed of many independent microarray experiments on wheat head tissue, to identify genes that are biomarkers of resistance to the fungus Fusarium graminearum. We used microarray data from many experiments performed on wheat lines of various resistance level. The resulting set of genes was validated by qPCR experiments.

  • articleNo Access

    A New Multi-objective Hybrid Gene Selection Algorithm for Tumor Classification Based on Microarray Gene Expression Data

    Tumor classification based on microarray gene expression data is easy to fall into overfitting because such data are composed of many irrelevant, redundant, and noisy genes. Traditional gene selection methods cannot achieve satisfactory classification results. In this study, we propose a novel multi-target hybrid gene selection method named RMOGA (ReliefF Multi-Objective Genetic Algorithm), which aims to select a few genes and obtain good tumor recognition accuracy. RMOGA consists of two phases. Firstly, ReliefF is used to select the top 5% subset of genes from the original datasets. Secondly, a multi-objective genetic algorithm searches for the optimal gene subset from the gene subset obtained by the ReliefF method. To verify the validity of RMOGA, we conducted extensive experiments on 11 available microarray datasets and compared the proposed method with other previous methods. Two classical classifiers including Naive Bayes and Support Vector Machine were used to measure the classification performance of all comparison methods. Experimental results show that the RMOGA algorithm can yield significantly better results than previous state-of-the-art methods in terms of classification accuracy and the number of selected genes.

  • articleNo Access

    ON GENE SELECTION AND CLASSIFICATION FOR CANCER MICROARRAY DATA USING MULTI-STEP CLUSTERING AND SPARSE REPRESENTATION

    Microarray data profiles gene expression on a whole genome scale, and provides a good way to study associations between gene expression and occurrence or progression of cancer disease. Many researchers realized that microarray data is useful to predict cancer cases. However, the high dimension of gene expressions, which is significantly larger than the sample size, makes this task very difficult. It is very important to identify the significant genes causing cancer. Many feature selection algorithms have been proposed focusing on improving cancer predictive accuracy at the expense of ignoring the correlations between the features. In this work, a novel framework (named by SGS) is presented for significant genes selection and efficient cancer case classification. The proposed framework first performs a clustering algorithm to find the gene groups where genes in each group have higher correlation coefficient, and then selects (1) the significant (2) genes in each group using the Bayesian Lasso method and important gene groups using the group Lasso method, and finally builds a prediction model based on the shrinkage gene space with efficient classification algorithm (such as support vector machine (SVM), 1NN, and regression). Experimental results on public available microarray data show that the proposed framework often outperforms the existing feature selection and prediction methods such as SAM, information gain (IG), and Lasso-type prediction models.