![]() |
This volume contains papers presented at the 19th International Conference on Genome Informatics (GIW 2008) held at the Marriott Surfers Paradise Resort, Gold Coast, Queensland, Australia from December 1 to 3, 2008. The GIW Series provides an international forum for the presentation and discussion of original research papers on all aspects of bioinformatics, computational biology and systems biology. Its scope includes biological sequence analysis, protein structure prediction, genetic regulatory networks, bioinformatic algorithms, comparative genomics, and biomolecular data integration and analysis. Boasting a history of 19 years, GIW is the longest-running international bioinformatics conference.
A total of 18 contributed papers were selected for presentation at GIW 2008 and for inclusion in this book. The selected papers come from institutions in 18 countries. In addition, this book contains abstracts from the six invited speakers: Sean Grimmond (Institute for Molecular Bioscience, The University of Queensland, Australia), Eugene V Koonin (National Center for Biotechnology Information, National Institutes of Health, USA), Ming Li (University of Waterloo, Canada), Yi-Xue Li (Chinese Academy of Sciences and Shanghai Jiaotong University, China), John Mattick (Institute for Molecular Bioscience, The University of Queensland, Australia), and Eric Schadt (Rosetta Inpharmatics, USA).
https://doi.org/10.1142/9781848163324_fmatter
Preface
Acknowledgments
Committee
Contents
https://doi.org/10.1142/9781848163324_0001
Transcriptome analysis using high-throughput short-read sequencing technology is straightforward when the sequenced genome is the same species or extremely similar to the reference genome. We present an analysis approach for when the sequenced organism does not have an already sequenced genome that can be used for a reference, as will be the case of many non-model organisms. As proof of concept, data from Solexa sequencing of the polyploid plant Pachycladon enysii was analysed using our approach with its nearest model reference genome being the diploid plant Arabidopsis thaliana. By using a combination of mapping and de novo assembly tools we could determine duplicate genes belonging to one or other of the genome copies. Our approach demonstrates that transcriptome analysis using high-throughput short-read sequencing need not be restricted to the genomes of model organisms.
https://doi.org/10.1142/9781848163324_0002
We recently introduced a biologically realistic and reliable significance analysis of the output of a popular class of motif finders [16]. In this paper we further improve our significance analysis by incorporating local base composition information. Relying on realistic biological data simulation, as well as on FDR analysis applied to real data, we show that our method is significantly better than the increasingly popular practice of using the normal approximation to estimate the significance of a finder's output. Finally we turn to leveraging our reliable significance analysis to improve the actual motif finding task. Specifically, endowing a variant of the Gibbs Sampler [18] with our improved significance analysis we demonstrate that de novo finders can perform better than has been perceived. Significantly, our new variant outperforms all the finders reviewed in a recently published comprehensive analysis [23] of the Harbison genome-wide binding location data [9]. Interestingly, many of these finders incorporate additional information such as nucleosome positioning and the significance of binding data.
https://doi.org/10.1142/9781848163324_0003
Tag SNP selection is an important problem in computational biology and genetics because a small set of tag SNP markers may help reduce the cost of genotyping and thus genome-wide association studies. Several methods for selecting a smallest possible set of tag SNPs based on different formulations of tag SNP selection (block-based or genome-wide) and mathematical models of marker correlation have been investigated in the literature. In this paper, we propose a new model of multi-marker correlation for genome-wide tag SNP selection, and a simple greedy algorithm to select a smallest possible set of tag SNPs according to the model. Our experimental results on several real datasets from the HapMap project demonstrate that the new model yields more succinct tag SNP sets than the previous methods.
https://doi.org/10.1142/9781848163324_0004
Phenotype MicroArray (PM) technology is high-throughput phenotyping system [1] and is directly applicable to assay the effects of genetic changes in cells. In this study, we performed comprehensive PM analysis using single gene deletion mutants of central metabolic pathway and related genes. To elucidate the structure of central metabolic networks in Escherichia coli K-12, we focused 288 different PM conditions of carbon and nitrogen sources and performed bioinformatic analysis. For data processing, we employed noise reduction procedures. The distance between each of the mutants was defined by Manhattan distance and agglomerative Ward's hierarchical method was applied for clustering analysis. As a result, five clusters were revealed which represented to activate or repress cellular respiratory activities. Furthermore, the results might suggest that Glyceraldehyde-3P plays a key role as a molecular switch of central metabolic network.
https://doi.org/10.1142/9781848163324_0005
This paper considers the problem of enumerating all non-isomorphic tree-like chemical graphs with given path frequency, where "tree-like" means that the graph can be viewed as a tree if multiple edges (i.e., edges with the same end points) and a benzene ring are treated as one edge and one vertex, respectively, and "path frequency" is a vector of the numbers of specified vertex-labeled paths that must be realized in every output. This and related problems have several potential applications such as classification of chemical compounds, structure determination using mass-spectrum and/or NMR and design of novel chemical compounds.
For this problem, several studies have been done. Recently, Fujiwara et al. (2008) showed two formulations and for each of them, they gave a branch-and-bound algorithm, which combined efficient enumeration of non-isomorphic trees with bounding operations based on the path frequency and the atom-atom bonds to avoid the generation of invalid trees. In this paper, based on their work and a result of Nagamochi (2006), we introduce two new bounding operations, the detachment-cut and the H-cut, to further reduce the size of the search space. We performed computational experiments to compare our proposed algorithms with those of Fujiwara et al. (2008) using some chemical compound data obtained from the KEGG LIGAND database (http://www.genome.jp/kegg/ligand.html). The results show that our proposed algorithms are much faster than their algorithms.
https://doi.org/10.1142/9781848163324_0006
Detection of ligand-binding sites in protein structures is a crucial task in structural bioinformatics, and has applications in important areas like drug discovery. Given the knowledge of the site in a particular protein structure that binds to a specific ligand, we can search for similar sites in the other protein structures that the same ligand is likely to bind. In this paper, we propose a new method named "BSAlign" (Binding Site Aligner) for rapid detection of potential binding site(s) in the target protein(s) that is/are similar to the query protein's ligand-binding site. We represent both the binding site and the protein structure as graphs, and employ a subgraph isomorphism algorithm to detect the similarities of the binding sites in a very time-efficient manner. Preliminary experimental results show that the proposed BSAlign binding site detection method is about 14 times faster than a well-known method called SiteEngine, while offering the same level of accuracy. Both BSAlign and SiteEngine achieve 60% search accuracy in finding adenine-binding sites from a data set of 126 proteins. The proposed method can be a useful contribution towards speed-critical applications such as drug discovery in which a large number of proteins are needed to be processed. The program is available for download at: http://www1.i2r.a-star.edu.sg/~azeyar/BSAlign/.
https://doi.org/10.1142/9781848163324_0007
The increasing amount of available Protein-Protein Interaction (PPI) data enables scalable methods for the protein complex prediction. A protein complex is a group of two or more proteins formed by interactions that are stable over time, and it generally corresponds to a dense sub-graph in PPI Network (PPIN). However, dense sub-graphs correspond not only to stable protein complexes but also to sets of proteins including dynamic interactions. As a result, conventional simple PPIN based graph-theoretic clustering methods have high false positive rates in protein complex prediction. In this paper, we propose an approach to predict protein complexes based on the integration of PPI data and mutually exclusive interaction information drawn from structural interface data of protein domains. The extraction of Simultaneous Protein Interaction Cluster (SPIC) is the essence of our approach, which excludes interaction conflicts in network clusters by achieving mutually exclusion among interactions. The concept of SPIC was applied to conventional graph-theoretic clustering algorithms, MCODE and LCMA, to evaluate the density of clusters for protein complex prediction. The comparison with original graph-theoretic clustering algorithms verified the effectiveness of our approach; SPIC based methods refined false positives of original methods to be true positive complexes, without any loss of true positive predictions yielded by original methods.
https://doi.org/10.1142/9781848163324_0008
Genome-scale metabolic modeling is a systems-based approach that attempts to capture the metabolic complexity of the whole cell, for the purpose of gaining insight into metabolic function and regulation. This is achieved by organizing the metabolic components and their corresponding interactions into a single context. The reconstruction process is a challenging and laborious task, especially during the stage of manual curation. For the mouse genome-scale metabolic model, however, we were able to rapidly reconstruct a compartmentalized model from well-curated metabolic databases online. The prototype model was comprehensive. Apart from minor compound naming and compartmentalization issues, only nine additional reactions without gene associations were added during model curation before the model was able to simulate growth in silico. Further curation led to a metabolic model that consists of 1399 genes mapped to 1757 reactions, with a total of 2037 reactions compartmentalized into the cytoplasm and mitochondria, capable of reproducing metabolic functions inferred from literatures. The reconstruction is made more tractable by developing a formal system to update the model against online databases. Effectively, we can focus our curation efforts into establishing better model annotations and gene–protein–reaction associations within the core metabolism, while relying on genome and proteome databases to build new annotations for peripheral pathways, which may bear less relevance to our modeling interest.
https://doi.org/10.1142/9781848163324_0009
We propose a statistical strategy to predict differentially regulated genes of case and control samples from time-course gene expression data by leveraging unpredictability of the expression patterns from the underlying regulatory system inferred by a state space model. The proposed method can screen out genes that show different patterns but generated by the same regulations in both samples, since these patterns can be predicted by the same model. Our strategy consists of three steps. Firstly, a gene regulatory system is inferred from the control data by a state space model. Then the obtained model for the underlying regulatory system of the control sample is used to predict the case data. Finally, by assessing the significance of the difference between case and predicted-case time-course data of each gene, we are able to detect the unpredictable genes that are the candidate as the key differences between the regulatory systems of case and control cells. We illustrate the whole process of the strategy by an actual example, where human small airway epithelial cell gene regulatory systems were generated from novel time courses of gene expressions following treatment with(case)/without(control) the drug gefitinib, an inhibitor for the epidermal growth factor receptor tyrosine kinase. Finally, in gefitinib response data we succeeded in finding unpredictable genes that are candidates of the specific targets of gefitinib. We also discussed differences in regulatory systems for the unpredictable genes. The proposed method would be a promising tool for identifying biomarkers and drug target genes.
https://doi.org/10.1142/9781848163324_0010
Thorough knowledge of the model organism S. cerevisiae has fueled efforts in developing theories of cell ageing since the 1950s. Models of these theories aim to provide insight into the general biological processes of ageing, as well as to have predictive power for guiding experimental studies such as cell rejuvenation. Current efforts in in silico modeling are frustrated by the lack of efficient simulation tools that admit precise mathematical models at both cell and population levels simultaneously. We developed a novel hierarchical simulation tool that allows dynamic creation of entities while rigorously preserving the mathematical semantics of the model. We used it to expand a single-cell model of protein damage segregation to a cell population model that explicitly tracks mother-daughter relations. Large-scale exploration of the resulting tree of simulations established that daughters of older mothers show a rejuvenation effect, consistent with experimental results. The combination of a single-cell model and a simulation platform permitting parallel composition and dynamic node creation has proved to be an efficient tool for in silico exploration of cell behavior.
https://doi.org/10.1142/9781848163324_0011
Recent development of cluster of differentiation (CD) antibody arrays has enabled expression levels of many leukocyte surface CD antigens to be monitored simultaneously. Such membrane-proteome surveys have provided a powerful means to detect changes in leukocyte activity in various human diseases, such as cancer and cardiovascular diseases. The challenge is to devise a computational method to infer differential leukocyte activity among multiple biological states based on antigen expression profiles. Standard DNA microarray analysis methods cannot accurately infer differential leukocyte activity because they often fail to take the cell-to-antigen relationships into account. Here we present a novel latent variable model (LVM) approach to tackle this problem. The idea is to model each cell type as a latent variable, and represent the class-to-cell and cell-to-antigen relationships as a LVM. Once the parameters of the LVM are learned from the data, differentially active leukocytes can be easily identified from the model. We describe the model formulation and assumptions which lead to an efficient expectation-maximization algorithm. Our LVM method was applied to re-analyze two cardiovascular disease datasets. We show that our results match existing biological knowledge better than other methods such as gene set enrichment analysis. Furthermore, we discuss how our approach can be extended to become a general framework for gene set analysis for DNA microarrays.
https://doi.org/10.1142/9781848163324_0012
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
https://doi.org/10.1142/9781848163324_0013
Models of nucleotide or amino acid sequence evolution that implement homogeneous and stationary Markov processes of substitutions are mathematically convenient but are unlikely to represent the true complexity of evolution. With the large amounts of data that next generation sequencing promises, appropriate models of evolution are important, particularly when data are collected from ancient and sub-fossil remains, where changes in evolutionary parameters are the norm and not the exception. In this paper, we describe a new codon-based model of evolution that applies to Measurably Evolving Populations (MEPs). A MEP is defined as a population from which it is possible to detect a statistically significant accumulation of substitutions when sequences are obtained at different times. The new model of codon evolution permits changes to the substitution process, including changes to the intensity of selection and the proportions of sites undergoing different selective pressures. In our serial model of codon evolution, changes in the selective regime occur simultaneously across all lineages. Different regions of the protein may also evolve under distinct selective patterns. We illustrate the application of the new model to a dataset of HIV-1 sequences obtained from an infected individual before and after the commencement of antiretroviral therapy.
https://doi.org/10.1142/9781848163324_0014
Gene transfer is a major contributing factor to functional innovation in genomes. Endosymbiotic gene transfer (EGT) is a specific instance of lateral gene transfer (LGT) in which genetic materials are acquired by the host genome from an endosymbiont that has been engulfed and retained in the cytoplasm. Here we present a comprehensive approach for detecting gene transfer within a phylogenetic framework. We applied the approach to examine EGT of red algal genes into Thalassiosira pseudonana, a free-living diatom for which a complete genome sequence has recently been determined. Out of 11,390 predicted protein-coding sequences from the genome of T. pseudonana, 124 (1.1%, clustered into 80 gene families) are inferred to be of red algal origin (bootstrap support ≥ 75%). Of these 80 gene families, 22 (27.5%) encode novel, unknown functions. We found 21.3% of the gene families to putatively encode non-plastid-targeted proteins. Our results suggest that EGT of red algal genes provides a relatively minor contribution to the nuclear genome of the diatom, but the transferred genes have functions that extend beyond photosynthesis. This assertion awaits experimental validation. Whereas the current study is focused within the context of secondary endosymbiosis, our approach can be applied to large-scale detection of gene transfer in any system.
https://doi.org/10.1142/9781848163324_0015
Using cis-regulatory motifs known to regulate plant osmotic stress response, an artificial neural network model was built to identify other functionally releted genes involved in the same process. The rationale behind our approach is that gene expression is largely controlled at the transcriptional level through the interactions between transcription factors and cis-regulatory elements. Gene Ontology enrichment analysis on the 500 top-scoring predictions showed that, 60% of the enriched GO classification was related to stress response. RT-PCR analysis showed that nearly 70% of the top-scoring predictions exhibited altered expression under various stress treatments. We expect that similar approach is widely applicable to infer gene function in various cellular processes in different species.
https://doi.org/10.1142/9781848163324_0016
Regulation of transcription is controlled by sets of transcription factors binding specific sites in the regulatory regions of genes. It is therefore believed that regulatory regions driving similar expression profiles share some common structural features. We here introduce a computational approach for finding a small set of rules describing the presence and positioning of motifs in a set of promoter sequences. This rule set is subsequently used for finding promoters that drive similar expression profiles from a genomic set of sequences. We applied our approach on muscle-expressed genes in Caenorhabditis elegans. We obtained a high average performance, and in the best case we found that almost 50% of true positive test genes scored higher than 90% of the true negative test genes. High scoring non-training sequences were enriched for muscle-expressed genes, and predicted motifs fitting the rules showed a significant tendency to be present in experimentally verified regulatory regions. Our model is more general than existing cis-regulatory module models, as rules selected by our model contain a variety of information, including not only proximal but also distal positioning of pairs of motifs, positioning with regard to the translation start site, and simply presences of motifs. We believe our model can help to increase our understanding about transcription factor cooperation and transcription initiation.
https://doi.org/10.1142/9781848163324_0017
Robust cancer molecular pattern identification from microarray data not only plays an essential role in modern clinic oncology, but also presents a challenge for statistical learning. Although principal component analysis (PCA) is a widely used feature selection algorithm in microarray analysis, its holistic mechanism prevents it from capturing the latent local data structure in the following cancer molecular pattern identification. In this study, we investigate the benefit of enforcing non-negativity constraints on principal component analysis (PCA) and propose a nonnegative principal component (NPCA) based classification algorithm in cancer molecular pattern analysis for gene expression data. This novel algorithm conducts classification by classifying meta-samples of input cancer data by support vector machines (SVM) or other classic supervised learning algorithms. The meta-samples are low-dimensional projections of original cancer samples in a purely additive meta-gene subspace generated from the NPCA-induced nonnegative matrix factorization (NMF). We report strongly leading classification results from NPCA-SVM algorithm in the cancer molecular pattern identification for five benchmark gene expression datasets under 100 trials of 50% hold-out cross validations and leave one out cross validations. We demonstrate superiority of NPCA-SVM algorithm by direct comparison with seven classification algorithms: SVM, PCA-SVM, KPCA-SVM, NMF-SVM, LLE-SVM, PCA-LDA and k-NN, for the five cancer datasets in classification rates, sensitivities and specificities. Our NPCA-SVM algorithm overcomes the over-fitting problem associative with SVM-based classifications for gene expression data under a Gaussian kernel. As a more robust high-performance classifier, NPCA-SVM can be used to replace the general SVM and k-NN classifiers in cancer biomarker discovery to capture more meaningful oncogenes.
https://doi.org/10.1142/9781848163324_0018
Circadian rhythms of the living organisms are 24hr oscillations found in behavior, biochemistry and physiology. Under constant conditions, the rhythms continue with their intrinsic period length, which are rarely exact 24hr. In this paper, we examine the effects of light on the phase of the gene expression rhythms derived from the interacting feedback network of a few clock genes, taking advantage of a computer simulation with Cell Illustrator. The simulation results suggested that the interacting circadian feedback network at the molecular level is essential for phase dependence of the light effects, observed in mammalian behavior. Furthermore, the simulation reproduced the biological observations that the range of entrainment to shorter or longer than 24hr light-dark cycles is limited, centering around 24hr. Application of our model to inter-time zone flight successfully demonstrated that 6 to 7 days are required to recover from jet lag when traveling from Tokyo to New York.
https://doi.org/10.1142/9781848163324_0019
Since the sequencing of the mouse and human genomes, there has been a concerted effort to define their complete transcriptional output. EST, full length cDNA sequencing, and transcriptome annotation efforts by FANTOM, ENCODE and other consortia surveyed mammalian expression space, revealing that loci on average generate 6-10 transcripts. Alternative promoters, splicing and 3'UTRs are commonplace.
While these data have provided an excellent atlas of what can be generated from mammalian genomes, we have not had, until recently, the right genomic tools to place this transcriptional complexity into a biological context. Array based profiling has been an excellent tool for assessing overall gene activity, but lacks the sensitivity and resolution required to study complete transcriptome content.
RNA sequencing (RNAseq) has recently been demonstrated in several eukaryotic species and is redefining our understanding of mRNA transcriptome content and mRNA dynamics, all at a single nucleotide resolution. We have developed methods for performing multi-gigabase shotgun sequencing of human and mouse transcriptomes and have developed approaches to assess locus activity and demonstrated its improved sensitivity relative to the current "gold standard" array platforms. We also use RN Aseq to assess the expression levels of variant transcripts via diagnostic sequences. Thirdly, we are able to perform genome-wide transcriptome discovery. Finally we have also established approaches to identify alternations to the reference sequence content, allowing us to search for expressed polymorphisms, mutations or events such as RNA editing.
These data are combined with RNAseq surveys of other fractions of the transcriptome (i.e. small RNA and polysome-associated RNAs) to gain a fuller picture of coding and functional RNA content. This is being used to define, at unprecedented resolution, the transcriptional networks driving specific biological states.
https://doi.org/10.1142/9781848163324_0020
Dynamic programming [1] has full sensitivity, but too slow for large scale homology search. FASTA / BLAST type of heuristics [2] trade sensitivity for speed. Can we have both sensitivity and speed?
We present the mathematical theory of optimized spaced seeds which allows modern homology search to achieve high sensitivity and high speed simultaneously. The spaced seed methodology is implemented in our PatternHunter software [3, 4], as well as many other modern homology search software, serving thousands of queries daily.
The theory is then extended and implemented in ZOOM [5] to do fast genome scale reads mapping for the second generation sequencers.
https://doi.org/10.1142/9781848163324_0021
MicroRNAs are short endogenous non-coding transcripts which regulate their target mRNAs by translational inhibition or mRNA degradation. Recent microRNA transfection experiments show strong evidence that microRNAs influence not only their target but also non-target genes, but how the regulatory signals are transduced from microRNAs to the downstream genes remains to been elucidated. We suspect that primary and secondary regulatory mechanisms, initially triggered by microRNAs, form refined local networks in the cell. In light of this hypothesis, a comprehensive strategy was developed to reconstruct combinatory networks of primary and secondary microRNA regulatory cascades, using microRNA's target and non-target gene expression profiles and information of microRNA-regulated transcription factors (TF) and TF regulated genes. This strategy was then applied to 53 microRNA transfection expression datasets and led to discovery of combinatorial regulatory networks triggered by 20 microRNAs. Many of these networks were enriched with genes whose functional roles were consistent with known regulatory roles of microRNAs. More importantly, a tumor-related regulatory network and related pathways were discovered, in which novel discoveries were integrated with existing knowledge on the regulatory mechanisms of four microRNAs. In the network, by activating mir-34 family, the tumor suppressor gene p53 can inhibit five target oncogenes, four of which have never been reported. Our approach was carried out on a sizeable number of public micro RNA transfection experiment datasets, enabling a global view of combinatory regulatory networks triggered by microRNAs. Through reconstructing micro RNA-triggered combinatory regulatory networks, the work help identify the true degradation targets of mammal microRNAs, and more importantly, aid in fundamental understanding of microRNA related biological functional processes.
https://doi.org/10.1142/9781848163324_0022
Common human diseases and drug response are complex traits that involve entire networks of changes at the molecular level driven by genetic and environmental perturbations. Efforts to elucidate disease and drug response traits have focused on single dimensions of the system. Studies focused on identifying changes in DNA that correlate with changes in disease or drug response traits, changes in gene expression that correlate with disease or drug response traits, or changes in other molecular traits (e.g., metabolite, methylation status, protein phosphorylation status, and so on) that correlate with disease or drug response are fairly routine and have met with great success in many cases. However, to further our understanding of the complex network of molecular and cellular changes that impact disease risk, disease progression, severity, and drug response, these multiple dimensions must be considered together. Here I present an approach for integrating a diversity of molecular and clinical trait data to uncover models that predict complex system behavior. By integrating diverse types of data on a large scale I demonstrate that some forms of common human diseases are most likely the result of perturbations to specific gene networks that in turn causes changes in the states of other gene networks both within and between tissues that drive biological processes associated with disease. These models elucidate not only primary drivers of disease and drug response, but they provide a context within which to interpret biological function, beyond what could be achieved by looking at one dimension alone. That some forms of common human diseases are the result of complex interactions among networks has significant implications for drug discovery: designing drugs or drug combinations to impact entire network states rather than designing drugs that target specific disease associated genes.
https://doi.org/10.1142/9781848163324_0023
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 18 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome streamlining. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a new notion that undermines the "Tree of Life" model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
https://doi.org/10.1142/9781848163324_0024
It appears that the genetic programming of mammals and other complex organisms has been misunderstood for the past 50 years, because of the assumption – largely true in prokaryotes, but not in complex eukaryotes – that most genetic information is transacted by proteins. The numbers of protein-coding genes do not change appreciably across the metazoa, whereas the relative proportion of non-protein-coding sequences increases markedly. Moreover, while only a tiny fraction encodes proteins, it is now evident that the majority of the mammalian genome is transcribed in a developmentally regulated manner, and that most complex genetic phenomena in eukaryotes are RNA-directed. Evidence will be presented that (i) regulatory information scales quadratically with functional complexity and hence the majority of the genomes of the higher organisms comprises regulatory information; (ii) there are thousands of non-protein-coding transcripts in mammals that are dynamically expressed during differentiation and development, including in embryonal stem cell and neuronal cell differentiation, and T-cell and macrophage activation, among others, many of which show precise expression patterns and subcellular localization in the brain; (iii) many 3'UTRs are not only linked to but are also expressed in a regulated manner separately from their associated protein-coding sequences to transmit genetic information in trans (iv) there are large numbers of small RNAs, including new classes, expressed from the human and mouse genomes, that may be discerned from bioinformatic analysis of genomic and deep sequencing transcriptomic datasets; and (v) much, if not most, of the mammalian genome may not be evolving neutrally, but rather is composed of different types of sequences (including transposon-derived sequences) that are evolving at different rates under different selection pressures and different structure-function constraints. There is also genome-wide evidence of editing of noncoding RNA sequences, especially in the brain and especially in humans (Alu elements), which may constitute a key part of the molecular basis of memory and cognition. Taken together, these and other observations suggest that the majority of the human genome is devoted to an very sophisticated RNA regulatory system that directs developmental trajectories and mediates gene-environment interactions via the control of chromatin architecture and epigenetic memory, transcription, splicing, RNA modification and editing, mRNA translation and RNA stability.
https://doi.org/10.1142/9781848163324_bmatter
Author Index