In this work, we hybridize the Genetic Quantum Algorithm with the Support Vector Machines classifier for gene selection and classification of high dimensional Microarray Data. We named our algorithm GQASVM. Its purpose is to identify a small subset of genes that could be used to separate two classes of samples with high accuracy. A comparison of the approach with different methods of literature, in particular GASVM and PSOSVM [2], was realized on six different datasets issued of microarray experiments dealing with cancer (leukemia, breast, colon, ovarian, prostate, and lung) and available on Web. The experiments clearified the very good performances of the method. The first contribution shows that the algorithm GQASVM is able to find genes of interest and improve the classification on a meaningful way. The second important contribution consists in the actual discovery of new and challenging results on datasets used.
A question has been raised in several publications as to whether or not the expression levels or their logarithms for different genes are normally distributed. To answer this question would require a large data set where both biological variability and technological noise are present. An earlier attempt to test this assumption was limited to technical replicates and did not take multiplicity of tests into account when assessing the net results of goodness-of-fit testing. Therefore, the problem calls for further exploration. We applied several statistical tests to a large set of high-density oligonucleotide microarray data in order to systematically test for log-normality of expression levels for all the reporter genes. The multiple testing aspect of the problem was addressed by designing a pertinent resampling procedure. The results of testing did not reject normality of log-intensities in the non-normalized data under study. However, the global log-normality hypothesis was rejected beyond all reasonable doubt when the data were normalized by the quantile normalization procedure. Our results are consistent with the hypothesis that non-normalized expression levels of different genes are approximately log-normally distributed. The quantile normalization causes dramatic changes in the shape of marginal distributions of log-intensities which may be an indication that this procedure interferes not only in the technological noise but the true biological signal as well. This possibility invites a special investigation.
DNA microarray technology allows researchers to monitor the expression level of thousands of genes under various conditions in microarray experiments. However, high-dimensional data in microarray is a major challenge as the irrelevant genes often reduce the detection capability and increase the computation time. Many learning algorithms are not specifically developed to deal with the noisy genes, thus, incorporating them with gene selection techniques has become a necessity. In this paper, we propose a combined method of Gram–Schmidt orthogonal forward selection (OFS) and FunCluster to search for putatively co-regulated biological processes that share the co-expressed genes. There were two datasets used in this research: human white adipose tissue and human skeletal muscle. This study aimed to find a small subset of strongly correlated genes from the raw datasets to maximize the detection capability of cluster analysis. This method was found able to detect the clusters of biological categories that were overlooked in the previous research. Some clusters represented minor functions of the datasets and indicated more specific biological processes. Further, the computation time for both datasets was reduced using this proposed method, as the Gram–Schmidt OFS significantly reduced the dimensionality of the datasets.
We propose a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. Microarray data do not contain enough information for constructing gene networks accurately in many cases. Our method adds biological knowledge to the estimation method of gene networks under a Bayesian statistical framework, and also controls the trade-off between microarray information and biological knowledge automatically. We conduct Monte Carlo simulations to show the effectiveness of the proposed method. We analyze Saccharomyces cerevisiae gene expression data as an application.
Clustering time-course gene expression data (gene trajectories) is an important step towards solving the complex problem of gene regulatory network modeling and discovery as it significantly reduces the dimensionality of the gene space required for analysis. Traditional clustering methods that perform hill-climbing from randomly initialized cluster centers are prone to produce inconsistent and sub-optimal cluster solutions over different runs. This paper introduces a novel method that hybridizes genetic algorithm (GA) and expectation maximization algorithms (EM) for clustering gene trajectories with the mixtures of multiple linear regression models (MLRs), with the objective of improving the global optimality and consistency of the clustering performance.
The proposed method is applied to cluster the human fibroblasts and the yeast time-course gene expression data based on their trajectory similarities. It outperforms the standard EM method significantly in terms of both clustering accuracy and consistency. The biological implications of the improved clustering performance are demonstrated.
Since microarray gene expression data do not contain sufficient information for estimating accurate gene networks, other biological information has been considered to improve the estimated networks. Recent studies have revealed that highly conserved proteins that exhibit similar expression patterns in different organisms, have almost the same function in each organism. Such conserved proteins are also known to play similar roles in terms of the regulation of genes. Therefore, this evolutionary information can be used to refine regulatory relationships among genes, which are estimated from gene expression data. We propose a statistical method for estimating gene networks from gene expression data by utilizing evolutionarily conserved relationships between genes. Our method simultaneously estimates two gene networks of two distinct organisms, with a Bayesian network model utilizing the evolutionary information so that gene expression data of one organism helps to estimate the gene network of the other. We show the effectiveness of the method through the analysis on Saccharomyces cerevisiae and Homo sapiens cell cycle gene expression data. Our method was successful in estimating gene networks that capture many known relationships as well as several unknown relationships which are likely to be novel. Supplementary information is available at .
Motivation: Many applications of microarray technology in clinical cancer studies aim at detecting molecular features for refined diagnosis. In this paper, we follow an opposite rationale: we try to identify common molecular features shared by phenotypically distinct types of cancer using a meta-analysis of several microarray studies. We present a novel algorithm to uncover that two lists of differentially expressed genes are similar, even if these similarities are not apparent to the eye. The method is based on the ordering in the lists.
Results: In a meta-analysis of five clinical microarray studies we were able to detect significant similarities in five of the ten possible comparisons of ordered gene lists. We included studies, where not a single gene can be significantly associated to outcome. The detection of significant similarities of gene lists from different microarray studies is a novel and promising approach. It has the potential to improve upon specialized cancer studies by exploring the power of several studies in one single analysis. Our method is complementary to previous methods in that it does not rely on strong effects of differential gene expression in a single study but on consistent ones across multiple studies.
Many studies have used microarray technology to identify the molecular signatures of human cancer, yet the critical features of these often unmanageably large set of signatures remain elusive. We have investigated co-expression pattern in four subtypes of ovarian cancer from 104 cancer patients using covariance analysis, treating each subtype of ovarian cancer as a distinct disease entity. We sought gene pairs that were transcriptionally co-expressed in one or multiple subtypes of ovarian cancer, establishing a high confidence network of 87 genes interconnected by significantly high co-expression links that were observed in at least two subtypes of ovarian cancer. We have shown that certain groups of co-expressed gene pairs are cancer subtype specific, through demonstrating significant differences in co-expression patterns of gene pairs between subtypes of ovarian cancer. In addition, we identified a set of 24 genes that classified patients into specific cancer subtypes with a misclassification error rate of less than 5%. Our findings illustrate how large public microarray gene expression datasets could be exploited for identification of cancer subtype specific molecular signatures, and how to classify cancer patients into specific subtypes of cancer using gene expression profiles.
Combined interaction of all the genes forms a central part of the functional system of a cell. Thus, especially the data-based modeling of the gene expression network is currently one of the main challenges in the field of systems biology. However, the problem is an extremely high-dimensional and complex one, so that normal identification methods are usually not applicable specially if aiming at dynamic models. We propose in this paper a subspace identification approach, which is well suited for high-dimensional system modeling and the presented modified version can also handle the underdetermined case with less data samples than variables (genes). The algorithm is applied to two public stress-response data sets collected from yeast Saccharomyces cerevisiae. The obtained dynamic state space model is tested by comparing the simulation results with the measured data. It is shown that the identified model can relatively well describe the dynamics of the general stress-related changes in the expression of the complete yeast genome. However, it seems inevitable that more precise modeling of the dynamics of the whole genome would require experiments especially designed for systemic modeling.
A test-statistic typically employed in the gene set enrichment analysis (GSEA) prevents this method from being genuinely multivariate. In particular, this statistic is insensitive to changes in the correlation structure of the gene sets of interest. The present paper considers the utility of an alternative test-statistic in designing the confirmatory component of the GSEA. This statistic is based on a pertinent distance between joint distributions of expression levels of genes included in the set of interest. The null distribution of the proposed test-statistic, known as the multivariate N-statistic, is obtained by permuting group labels. Our simulation studies and analysis of biological data confirm the conjecture that the N-statistic is a much better choice for multivariate significance testing within the framework of the GSEA. We also discuss some other aspects of the GSEA paradigm and suggest new avenues for future research.
The currently practiced methods of significance testing in microarray gene expression profiling are highly unstable and tend to be very low in power. These undesirable properties are due to the nature of multiple testing procedures, as well as extremely strong and long-ranged correlations between gene expression levels. In an earlier publication, we identified a special structure in gene expression data that produces a sequence of weakly dependent random variables. This structure, termed the δ-sequence, lies at the heart of a new methodology for selecting differentially expressed genes in nonoverlapping gene pairs. The proposed method has two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, and in the stability of the results of testing; and (2) its outcomes are entirely free from the log-additive array-specific technical noise. We demonstrate the usefulness of this approach in conjunction with the nonparametric empirical Bayes method. The proposed modification of the empirical Bayes method leads to significant improvements in its performance. The new paradigm arising from the existence of the δ-sequence in biological data offers considerable scope for future developments in this area of methodological research.
DNA microarrays (gene chips), frequently used in biological and medical studies, measure the expressions of thousands of genes per sample. Using microarray data to build accurate classifiers for diseases is an important task. This paper introduces an algorithm, called Committee of Decision Trees by Attribute Behavior Diversity (CABD), to build highly accurate ensembles of decision trees for such data. Since a committee's accuracy is greatly influenced by the diversity among its member classifiers, CABD uses two new ideas to "optimize" that diversity, namely (1) the concept of attribute behavior–based similarity between attributes, and (2) the concept of attribute usage diversity among trees. The ideas are effective for microarray data, since such data have many features and behavior similarity between genes can be high. Experiments on microarray data for six cancers show that CABD outperforms previous ensemble methods significantly and outperforms SVM, and show that the diversified features used by CABD's decision tree committee can be used to improve performance of other classifiers such as SVM. CABD has potential for other high-dimensional data, and its ideas may apply to ensembles of other classifier types.
Cluster analysis of biological samples using gene expression measurements is a common task which aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several model-based clustering algorithms have been proposed in which the distribution of gene expression values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme observations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism to quantify the relative contribution of each gene to the final partitioning of the data. We propose a penalized mixture of Student's t distributions for model-based clustering and gene ranking. Together with a resampling procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor sub-classes.
The correct inference of gene regulatory networks for the understanding of the intricacies of the complex biological regulations remains an intriguing task for researchers. With the availability of large dimensional microarray data, relationships among thousands of genes can be simultaneously extracted. Among the prevalent models of reverse engineering genetic networks, S-system is considered to be an efficient mathematical tool. In this paper, Bat algorithm, based on the echolocation of bats, has been used to optimize the S-system model parameters. A decoupled S-system has been implemented to reduce the complexity of the algorithm. Initially, the proposed method has been successfully tested on an artificial network with and without the presence of noise. Based on the fact that a real-life genetic network is sparsely connected, a novel Accumulative Cardinality based decoupled S-system has been proposed. The cardinality has been varied from zero up to a maximum value, and this model has been implemented for the reconstruction of the DNA SOS repair network of Escherichia coli. The obtained results have shown significant improvements in the detection of a greater number of true regulations, and in the minimization of false detections compared to other existing methods.
Correct inference of genetic regulations inside a cell from the biological database like time series microarray data is one of the greatest challenges in post genomic era for biologists and researchers. Recurrent Neural Network (RNN) is one of the most popular and simple approach to model the dynamics as well as to infer correct dependencies among genes. Inspired by the behavior of social elephants, we propose a new metaheuristic namely Elephant Swarm Water Search Algorithm (ESWSA) to infer Gene Regulatory Network (GRN). This algorithm is mainly based on the water search strategy of intelligent and social elephants during drought, utilizing the different types of communication techniques. Initially, the algorithm is tested against benchmark small and medium scale artificial genetic networks without and with presence of different noise levels and the efficiency was observed in term of parametric error, minimum fitness value, execution time, accuracy of prediction of true regulation, etc. Next, the proposed algorithm is tested against the real time gene expression data of Escherichia Coli SOS Network and results were also compared with others state of the art optimization methods. The experimental results suggest that ESWSA is very efficient for GRN inference problem and performs better than other methods in many ways.
Biotechnological analysis of DNA microarray genes provides valuable insights into the discovery and treatment of diseases such as cancer. It may also be crucial for the prevention and treatment of other genetic diseases. However, due to the large number of features and dimensions in a DNA microarray, the “curse of dimensions” problem is very common. Many machine learning methods require an effective subset of input genes to achieve high accuracy. Unfortunately, extracting features (genes) is an inherently NP-hard problem. Recently, the use of metaheuristics to overcome the NP-hardness of the feature extraction problem has attracted the attention of many researchers. In this paper, we use the combination of fuzzy entropy and Giza Pyramid Construction (GPC) for feature selection. First, redundant features in the microarray dataset are removed using the fuzzy entropy approach. GPC is then used to reduce the execution time. This results in the selection of a near-optimal subset of genes for cancer detection. Dimensionality reduction with GPC followed by classification with Convolutional Neural Network (CNN) creates a synergy to increase efficiency. The proposed method is tested on five well-known cancer patient datasets: leukemia, lymphoma, MLL, ovarian, and SRBCT. The performance of CNN was also measured with four well-known classifiers, including K-nearest neighbor, naïve Bayesian, decision tree, and logistic regression. Our results show that, on average, CNN has the highest accuracy, recall, precision, and F-measure in all datasets.
Microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. The application of microarray data on cancer-type classification has recently gained in popularity. The properties of microarray data contain a large number of features (genes) with high dimensions, and one in the multi-class category. These facts make testing and training of general classification methods difficult. Reducing the number of genes and achieving lower classification error rates are the main issues to be solved. The classification of microarray data samples can be regarded as a feature selection and classifier design problem. The goal of feature selection is to select those subsets of differentially expressed genes that are potentially relevant for distinguishing the sample classes. Classical genetic algorithms (GAs) may suffer from premature convergence and thus lead to poor experimental results. In this paper, combat genetic algorithm (CGA) is used to implement the feature selection, and a K-nearest neighbor with the leave-one-out cross-validation method serves as a classifier of the CGA fitness function for the classification problem. The proposed method was applied to 10 microarray data sets that were obtained from the literature. The experimental results show that the proposed method not only effectively reduced the number of gene expression levels but also achieved lower classification error rates.
Although microarray technology has revealed transcriptomic diversities underlining various cancer phenotypes, transcriptional programs controlling them have not been well elucidated. To decode transcriptional programs governing cancer transcriptomes, we have recently developed a computational method termed EEM, which searches for expression modules from prescribed gene sets defined by prior biological knowledge like TF binding motifs. In this paper, we extend our EEM approach to predict cancer transcriptional networks. Starting from functional TF binding motifs and expression modules identified by EEM, we predict cancer transcriptional networks containing regulatory TFs, associated GO terms, and interactions between TF binding motifs. To systematically analyze transcriptional programs in broad types of cancer, we applied our EEM-based network prediction method to 122 microarray datasets collected from public databases. The data sets contain about 15000 experiments for tumor samples of various tissue origins including breast, colon, lung etc. This EEM based meta-analysis successfully revealed a prevailing cancer transcriptional network which functions in a large fraction of cancer transcriptomes; they include cell-cycle and immune related sub-networks. This study demonstrates broad applicability of EEM, and opens a way to comprehensive understanding of transcriptional networks in cancer cells.
Ensemble method can be more effective when an ensemble is built by using knowledge of the diversity among base learners. However, implementation of an ensemble when evaluating diversity of learners for a given data mining task can be very time consuming. This paper presents a framework of developing a flexible software platform for building an ensemble based on the diversity measures. An ensemble classification system (ECS) has been implemented for mining biomedical data as well as general data. The ECS consists of Data Pre-process, Feature Selection, Classifiers Selection, Feature-Classifier Pair Evaluation and Selection, Combination and Decision Making. The ECS has been tested with several benchmark datasets and microarray data. The experiment results show that ECS is a practical program both in improving data mining performance and reducing computational time.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.