We show how the concept of an annotated ordered set can be used to model large taxonomically structured ontologies such as the Gene Ontology. By constructing a formal context consistent with a given annotated ordered set, their concept lattice representations are derived. We develop the fundamental mathematical relations present in this formulation, in particular deriving a conceptual pre-ordering of the taxonomy, and constructing a correspondence between the annotations of an ordered set and the closure systems of its filter lattice. We study an example from the Gene Ontology to demonstrate how the introduced technique can be utilized for taxonomy review.
Functional and molecular characterization was performed on the major organs of damp-obstructed rats by applying expression datasets of microarray experiments and real-time RT-PCR. Gene ontology repertoires, i.e. cellular component, molecular function, and biological process were used to classify differentially expressed genes in the major organs of rats upon treatment of dampness. As to the cellular component, over-expression of genes associated with the plasma membrane was observed in the stomach, spleen, kidney, heart, liver, and lung. Genes associated with translational machinery, endoplasmic recticulum membrane, Golgi apparatus, and nuclear envelope were down-regulated in the stomach. Concerning the molecular function, genes associated with oxidoreductase activity were up-regulated in the stomach, spleen, kidney, lung, and brain. Channel activity, membrane receptor, and electron transporter activity were up-regulated in stomach, kidney, and lung. Regarding the biological process, genes associated with signal transduction were up-regulated in the stomach, while genes associated with biosynthesis and ATP metabolism were down-regulated. In the spleen, melanin biosynthesis was up-regulated while hormone-related activities were down-regulated. In the kidney, genes associated with nucleotide biosynthesis and ATP metabolism were depressed. In the heart and liver, apoptosis was up-regulated while immune response and RAS signal transduction were down-regulated. Interestingly, genes associated with oncogenesis were up-regulated in the stomach and kidney. Functional fingerprints indicated that dampness weakened membrane structures, depressed metabolic activity (especially ATP metabolism), damaged matrix proteins, enhanced signal transduction, and revealed a positive association with oncogenesis. To quantify the functional impact at the molecular level, mRNA levels of key genes were determined by real-time RT-PCR. The results indicated that ATP storage in kidney, spleen, and stomach was depleted in damp-obstructed rats. We propose that oxidative stress, membrane integrity, melanin biosynthesis, ion channel activity, and ATP metabolism might be hallmarks for damp-obstructed rats. Our results also suggested dampness as a pathogenic factor in rats which is possibly associated with enhanced liabilities of cancer.
Essential proteins are important for the survival and development of organisms. Lots of centrality algorithms based on network topology have been proposed to detect essential proteins and achieve good results. However, most of them only focus on the network topology, but ignore the false positive (FP) interactions in protein–protein interaction (PPI) network. In this paper, gene ontology (GO) information is proposed to measure the reliability of the edges in PPI network and we propose a novel algorithm for identifying essential proteins, named EGC algorithm. EGC algorithm integrates topology character of PPI network and GO information. To validate the performance of EGC algorithm, we use EGC and other nine methods (DC, BC, CC, SC, EC, LAC, NC, PEC and CoEWC) to identify the essential proteins in the two different yeast PPI networks: DIP and MIPS. The results show that EGC is better than the other nine methods, which means adding GO information can help in predicting essential proteins.
Diabetic retinopathy is the most common cause of blindness, associated with many biochemical pathways mediated by several genes and proteins. Disease gene identification can be achieved through several approaches but still it is a challenging task. This study, aimed to find out the novel genes associated with diabetic retinopathy. In this study, all the well-known genes associated with diabetic retinopathy were collected from databases and the protein interaction partners were identified. The interacting candidate genes were chosen by chromosomal locations, sharing with disease genes. The protein–protein interaction network was constructed and the key nodes (genes) were identified by degree, betweenness centrality, closeness centrality and eccentricity centrality. Further, the ontological terms, molecular function, biological process and cellular components were related with that of the disease genes with p-value <0.05. The genes UBC, FOS, ITGB1, FOXA2, CCND1, FOSL1, RXRA and NCAM1 were identified as potential genes associated with diabetic retinopathy. The molecular functions of these genes include protein binding, receptor activity, receptor binding, oxidoreductase activity, protein kinase activity, serine-type peptidase activity and growth factor. Many of the identified genes were clinically related as evidence by the literature.
In this day and age, conducting a biological experiment is presumably a very expensive procedure largely owing to the highly sophisticated and expensive equipment necessitated by the process. Conceivably, being capable of isolating and focusing on a smaller set of imperative genes or gene products that are of high relevance to the experiment, pathway, or biological system under investigation is very desirable largely owing to the potential savings in experimental costs. In this work, we propose an intelligent information system capable of generating a ranked list of genes and gene products that are most pertinent to a given biological pathway, experiment or system (referred to as a biological context henceforth). We assume that the biological context of interest can be described by various textual query terms and phrases from the biological domain which, in turn, relate to various molecular functions, biological processes and cellular components of genes and their products. Intelligent text-based analyses and mining are utilised for this purpose by using the published literature, in the form of publication abstracts downloaded from PubMed, with the intention of ranking genes and gene products having identified relationships to the specified description terms based on the gene ontology (GO) standard. At this stage, our approach is capable of producing promising results given all surrounding restrictions, one of which is the lack of similar work in the literature. For demonstration purposes, we report experimental results on the molting regulation pathway in Drosophila melanogaster (fruit fly).
The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies.
Function prediction of uncharacterized protein sequences generated by genome projects has emerged as an important focus for computational biology. We have categorized several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association (genomic context)-, interaction (cellular context)-, process (metabolic context)-, and proteomics-experiment-based methods. Because they incorporate structural and experimental data that is not used in sequence-based methods, they can provide additional accuracy and reliability to protein function prediction. Here, first we review the definition of protein function. Then the recent developments of these methods are introduced with special focus on the type of predictions that can be made. The need for further development of comprehensive systems biology techniques that can utilize the ever-increasing data presented by the genomics and proteomics communities is emphasized. For the readers' convenience, tables of useful online resources in each category are included. The role of computational scientists in the near future of biological research and the interplay between computational and experimental biology are also addressed.
Cisplatin-induced drug resistance is known to involve a complex set of cellular changes whose molecular mechanism details remain unclear. In this study, we developed a systems biology approach to examine proteomics- and network-level changes between cisplatin-resistant and cisplatin-sensitive cell lines. This approach involves experimental investigation of differential proteomics profiles and computational study of activated enriched proteins, protein interactions, and protein interaction networks. Our experimental platform is based on a Label-free liquid Chromatography/mass spectrometry proteomics platform. Our computational methods start with an initial list of 119 differentially expressed proteins. We expanded these proteins into a cisplatin-resistant activated sub-network using a database of human protein-protein interactions. An examination of network topology features revealed the activated responses in the network are closely coupled. By examining sub-network proteins using gene ontology categories, we found significant enrichment of proton-transporting ATPase and ATP synthase complexes activities in cisplatin-resistant cells in the form of cooperative down-regulations. Using two-dimensional visualization matrixes, we further found significant cascading of endogenous, abiotic, and stress-related signals. Using a visual representation of activated protein categorical sub-networks, we showed that molecular regulation of cell differentiation and development caused by responses to proteome-wide stress as a key signature to the acquired drug resistance.
There is a strong need to systematically organize and comprehend the rapidly expanding stores of biomedical knowledge to formulate hypotheses on disease mechanisms. However, no method is available that automatically structuralizes fragmentary knowledge along with domain-specific expressions for a large-scale integration. A method presented here, cross-subspace analysis (CSA), produces a holistic view of over 3,000 human genes with a two-dimensional (2D) arrangement. The genes are plotted in relation to functions determined by machine learning from the occurrence patterns of various biomedical terms in MEDLINE abstracts. By focusing on the 2D distributions of gene plots that share the same biomedical concepts, as defined by databases such as Gene Ontology, relevant biomedical concepts can be computationally extracted. In an analysis where myocardial infarction and ischemic stroke were taken as examples, we found valid relations with lifestyle, diet-related metabolism, and host immune responses, all of which are known risk factors for the diseases. These results demonstrate that systematizing accumulated gene knowledge can lead to hypothesis generation and knowledge discovery, regardless of the area of inquiry or discipline.
A large number of biological pathways have been elucidated recently, and there is a need for methods to analyze these pathways. One class of methods compares pathways semantically in order to discover parts that are evolutionarily conserved between species or to discover intraspecies similarities. Such methods usually require that the topologies of the pathways being compared are known, i.e. that a query pathway is being aligned to a model pathway. However, sometimes the query only consists of an unordered set of gene products. Previous methods for mapping sets of gene products onto known pathways have not been based on semantic comparison of gene products using ontologies or other abstraction hierarchies. Therefore, we here propose an approach that uses a similarity function defined in Gene Ontology (GO) terms to find semantic alignments when comparing paths in biological pathways where the nodes are gene products. A known pathway graph is used as a model, and an evolutionary algorithm (EA) is used to evolve putative paths from a set of experimentally determined gene products. The method uses a measure of GO term similarity to calculate a match score between gene products, and the fitness value of each candidate path alignment is derived from these match scores. A statistical test is used to assess the significance of evolved alignments. The performance of the method has been tested using regulatory pathways for S. cerevisiae and M. musculus.
Recent proteome-wide screening efforts have made available genome-wide, high-throughput protein–protein interaction (PPI) maps for several model organisms. This has enabled the systematic analysis of PPI networks, which has become one of the primary challenges for the systems biology community. Here, we address the problem of predicting the functional classes of proteins (i.e. GO annotations) based solely on the structure of the PPI network. We present a maximum likelihood formulation of the problem and the corresponding learning and inference algorithms. The time complexity of both algorithms is linear in the size of the PPI network, and our experimental results show that their accuracy in functional prediction outperforms current existing methods.
Reconstruction of signaling pathways is crucial for understanding cellular mechanisms. A pathway is represented as a path of a signaling cascade involving a series of proteins to perform a particular function. Since a protein pair involved in signaling and response have a strong interaction, putative pathways can be detected from protein–protein interaction (PPI) networks. However, predicting directed pathways from the undirected genome-wide PPI networks has been challenging. We present a novel computational algorithm to efficiently predict signaling pathways from PPI networks given a starting protein and an ending protein. Our approach integrates topological analysis of PPI networks and semantic analysis of PPIs using Gene Ontology data. An advanced semantic similarity measure is used for weighting each interacting protein pair. Our distance-wise algorithm iteratively selects an adjacent protein from a PPI network to build a pathway based on a distance condition. On each iteration, the strength of a hypothetical path passing through a candidate edge is estimated by a local heuristic. We evaluate the performance by comparing the resultant paths to known signaling pathways on yeast. The results show that our approach has higher accuracy and efficiency than previous methods.
The gene ontology (GO) is used extensively in the field of genomics. Like other large and complex ontologies, quality assurance (QA) efforts for GO’s content can be laborious and time consuming. Abstraction networks (AbNs) are summarization networks that reveal and highlight high-level structural and hierarchical aggregation patterns in an ontology. They have been shown to successfully support QA work in the context of various ontologies. Two kinds of AbNs, called the area taxonomy and the partial-area taxonomy, are developed for GO hierarchies and derived specifically for the biological process (BP) hierarchy. Within this framework, several QA heuristics, based on the identification of groups of anomalous terms which exhibit certain taxonomy-defined characteristics, are introduced. Such groups are expected to have higher error rates when compared to other terms. Thus, by focusing QA efforts on anomalous terms one would expect to find relatively more erroneous content. By automatically identifying these potential problem areas within an ontology, time and effort will be saved during manual reviews of GO’s content. BP is used as a testbed, with samples of three kinds of anomalous BP terms chosen for a taxonomy-based QA review. Additional heuristics for QA are demonstrated. From the results of this QA effort, it is observed that different kinds of inconsistencies in the modeling of GO can be exposed with the use of the proposed heuristics. For comparison, the results of QA work on a sample of terms chosen from GO’s general population are presented.
Protein–Protein Interactions (PPIs) are very important as they coordinate almost all cellular processes. This paper attempts to formulate PPI prediction problem in a multi-objective optimization framework. The scoring functions for the trial solution deal with simultaneous maximization of functional similarity, strength of the domain interaction profiles, and the number of common neighbors of the proteins predicted to be interacting. The above optimization problem is solved using the proposed Firefly Algorithm with Nondominated Sorting. Experiments undertaken reveal that the proposed PPI prediction technique outperforms existing methods, including gene ontology-based Relative Specific Similarity, multi-domain-based Domain Cohesion Coupling method, domain-based Random Decision Forest method, Bagging with REP Tree, and evolutionary/swarm algorithm-based approaches, with respect to sensitivity, specificity, and F1 score.
In complex disorders, collaborative role of several genes accounts for the multitude of symptoms and the discovery of molecular mechanisms requires proper understanding of pertinent genes. Majority of the recent techniques utilize either single information or consolidate the independent outlook from multiple knowledge sources for assisting the discovery of candidate genes. In any case, given that various sorts of heterogeneous sources are possibly significant for quality gene prioritization, every source bearing data not conveyed by another, we assert that a perfect strategy ought to give approaches to observe among them in a genuine integrative style that catches the degree of each, instead of utilizing a straightforward mix of sources. We propose a flexible approach that empowers multi-source information reconciliation for quality gene prioritization that augments the complementary nature of various learning sources so as to utilize the maximum information of aggregated data. To illustrate the proposed approach, we took Autism Spectrum Disorder (ASD) as a case study and validated the framework on benchmark studies. We observed that the combined ranking based on integrated knowledge reduces the false positive observations and boosts the performance when compared with individual rankings. The clinical phenotype validation for ASD shows that there is a significant linkage between top positioned genes and endophenotypes of ASD. Categorization of genes based on endophenotype associations by this method will be useful for further hypothesis generation leading to clinical and translational analysis. This approach may also be useful in other complex neurological and psychiatric disorders with a strong genetic component.
Protein Function Prediction from Protein–Protein Interaction Network (PPIN) and physico-chemical features using the Gene Ontology (GO) classification are indeed very useful for assigning biological or biochemical functions to a protein. They also lead to the identification of those significant proteins which are responsible for the generation of various diseases whose drugs are still yet to be discovered. So, the prediction of GO functional terms from PPIN and sequence is an important field of study. In this work, we have proposed a methodology, Multi Label Protein Function Prediction (ML_PFP) which is based on Neighborhood analysis empowered with physico-chemical features of constituent amino acids to predict the functional group of unannotated protein. A protein does not perform functions in isolation rather it performs functions in a group by interacting with others. So a protein is involved in many functions or, in other words, may be associated with multiple functional groups or labels or GO terms. Though functional group of other known interacting partner protein and its physico-chemical features provide useful information, assignment of multiple labels to unannotated protein is a very challenging task. Here, we have taken Homo sapiens or Human PPIN as well as Saccharomyces cerevisiae or yeast PPIN along with their GO terms to predict functional groups or GO terms of unannotated proteins. This work has become very challenging as both Human and Yeast protein dataset are voluminous and complex in nature and multi-label functional groups assignment has also added a new dimension to this challenge. Our algorithm has been observed to achieve a better performance in Cellular Function, Molecular Function and Biological Process of both yeast and human network when compared with the other existing state-of-the-art methodologies which will be discussed in detail in the results section.
Protein complexes are the cornerstones of most of the biological processes. Identifying protein complexes is crucial in understanding the principles of cellular organization with several important applications, including in disease diagnosis. Several computational techniques have been developed to identify protein complexes from protein–protein interaction (PPI) data (equivalently, from PPI networks). These PPI data have a significant amount of false positives, which is a bottleneck in identifying protein complexes correctly. Gene ontology (GO)-based semantic similarity measures can be used to assign a confidence score to PPIs. Consequently, low-confidence PPIs are highly likely to be false positives. In this paper, we systematically study the impact of low-confidence PPIs on the performance of complex detection methods using GO-based semantic similarity measures. We consider five state-of-the-art complex detection algorithms and nine GO-based similarity measures in the evaluation. We find that each complex detection algorithm significantly improves its performance after the filtration of low-similarity scored PPIs. It is also observed that the percentage improvement and the filtration percentage (of low-confidence PPIs) are highly correlated.
We propose a new bi-clustering algorithm, LinCoh, for finding linear coherent bi-clusters in gene expression microarray data. Our method exploits a robust technique for identifying conditionally correlated genes, combined with an efficient density-based search for clustering sample sets. Experimental results on both synthetic and real datasets demonstrated that LinCoh consistently finds more accurate and higher quality bi-clusters than existing bi-clustering algorithms.
The enterovirus 71 infection is associated with severe neurological disease in several clinical researches; however, the detailed gene network mechanisms of enterovirus 71-infected cells remain unclear at present. We present a new approach integrating microarray expression data, KEGG database, gene ontology (GO), and OMIM information for efficiently deciphering pathways of enterovirus 71-infected cells. This approach includes the following steps: (1) profiling the significant gene-gene interaction through pathway database, (2) utilizing Fisher's exact test to analyze pathway information and to rank the first ten significant pathways, (3) annotating functions of genes in the pathways through gene ontology, (4) investigating related genes and perhaps concern diseases by referring to OMIM information. Our findings illustrate at least three possible pathways in enterovirus 71-infected human neural SF268 cells: Jak-STAT signaling: cell cycle and apoptosis. Furthermore, we show that some genes are associated with neural development and neural apoptosis, such as c-Myc, BAX, NGF, and CPP32. These would be useful for profiling disease mechanisms and host response to virus in future research.
The similarity of two gene products can be used to solve many problems in information biology. Since one gene product corresponds to several GO (Gene Ontology) terms, one way to calculate the gene product similarity is to use the similarity of their GO terms. This GO term similarity can be defined as the semantic similarity on the GO graph. There are many kinds of similarity definitions of two GO terms, but the information of the GO graph is not used efficiently. This paper presents a new way to mine more information of the GO graph by regarding edge as information content and using the information of negation on the semantic graph. A simple experiment is conducted and, as a result, the accuracy increased by 8.3 percent in average, compared with the traditional method which uses node as information source.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.