![]() |
Since the beginning of the genome project, the necessary involvement of scientists of widely divergent backgrounds has been evident. The proper handling, analysis, dissemination of information, and the control and data gathering of automated process are areas where computers are directly involved. Thus computers are intimately tied into the production and analysis of biological data. However, many challenges lie ahead.
This volume is a collection of selected oral and poster presentations given at The Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis, organized to address some of these challenges. The topics include the current status and future prospects of genome map, mapping and sequencing, complex genome analysis,linguistic and neural network approaches, database issues, and computer tools in the genome project. The volume will be ideal for students, newcomers, young researchers and experts alike, who are computationally or experimentally oriented.
Keynote Speakers: C L Smith, D Grothues, T Ito, T Sano, D Wang, Y-W Zhu, C R Canton & R J Rohins.
https://doi.org/10.1142/9789814503655_fmatter
The following sections are included:
https://doi.org/10.1142/9789814503655_0001
At some point, analysis of the data will certainly become the rate limiting step in the Human Genome Project. Our goal is to get to that point as quickly as possible. The purpose is not to make life hard. What we are interested in doing is developing methods that ever more rapidly will provide mapping and sequencing information. This presentation is mostly a progress report on a couple of new tricks that we are trying to develop that could speed mapping and sequencing in principle tremendously, possibly by factors of 103.
https://doi.org/10.1142/9789814503655_0002
Informatics of some kind will play a role in every aspect of the Human Genome Project (HGP): data acquisition, data analysis, data exchange, data publication and, data visualization. What are the real requirements and challenges? The primary requirement is clear thinking and the main challenge is design. If good design is lacking, the price will be failure of genome informatics and ultimately failure of the genome project itself. We need good designs to deliver the tools necessary for acquiring and analyzing DNA sequences. As these tools become more efficient, we will need new tools for comparative genomic analyses. To make the tools work, we will need to address and solve nomenclature issues that are essential, if also tedious. We must devise systems that will scale gracefully with the increasing flow of data. We must be able to move data easily from one system to another, with no loss of content. As scientists, we will have failed in our responsibility to share results, should repeating experiments ever become preferable to searching the literature. Our databases must become a new kind of scientific literature and we must develop ways to make electronic data publishing as routine as traditional journal publishing. Ultimately, we must build systems so advanced that they are virtually invisible. In summary, the HGP can be considered the most ambitious, most audacious information-management project ever undertaken. In the HGP, computers will not merely serve as tools for cataloging existing knowledge. Rather, they will serve as instruments, helping to create new knowledge by changing the way we see the biological world. Computers will allow us to see genomes, just as radio telescopes let us see quasars and electron microscopes let us see viruses.
https://doi.org/10.1142/9789814503655_0003
The theme of this section of the “Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis” is linguistic approaches to understanding the meaning of DNA. Webster's New World Dictionary defines linguistics as: ‘(1) the science of language, including phonology, morphology, syntax, and semantics; often “general linguistics”; usually subdivided into descriptive, historical, comparative, and geographical linguistics; (2) the study of the structure, development, etc., of a particular language and of its relationship to other languages, as the English language.’ When the international attendees of this conference were queried to find out how many spoke German, Russian, French, etc., the expected response was a serious show of hands. But, when the same group was asked how many “spoke” DNA, there was the predictable response of laughter.
https://doi.org/10.1142/9789814503655_0004
Many biological questions require reasoning with structure and function. We have begun by developing representations for structures which can be extended in a symmetric manner to functions. Each structure is represented as an arrangement of functional groups in a layered grammar. Expressing the function of these groups will then be equivalent to forming the antigerund of a noun. We illustrate these points with a biochemical example.
https://doi.org/10.1142/9789814503655_0005
To sequence all the genes in human DNA and analyze their functions, the Human Genome Project has been collecting a large body of data in data banks such as GenBank©. One of the important issues concerning computational biologists is the availability of the tools which can provide easy and efficient access to these data banks. In this paper, we will give a brief overview of different approaches to the design of a relational database interface. The paper concentrates on the design of GenEng,1 a dialogue based natural language interface for information retrieval from the GenBank relational database.2
https://doi.org/10.1142/9789814503655_0006
Overlapping word paradox known in combinatorics for 20 years is to this day disregarded in many papers on DNA statistics. We consider Conway equation for the best bet for simpletons as an example of the overlapping word paradox. We give a new short proof of Conway equation and discuss the implications of the overlapping word paradox for DNA statistics. In particular we demonstrate that ignoring overlapping word paradox in DNA statistics can easily lead to 500% mistakes in estimations of statistical significance. We also present formulas allowing one to find ‘anomalous’ words in DNA texts.
https://doi.org/10.1142/9789814503655_0007
The following sections are included:
https://doi.org/10.1142/9789814503655_0008
We review both theoretical and practical results of a linguistic approach to studying the structure of features of DNA sequences. Using generative grammars, complex assemblages can not only be described and analyzed abstractly, but also concretely, such that features can be searched for by a general-purpose parser. Our parser, called GENLANG, uses an extended logic grammar formalism and has found features as complex as tRNA genes, group I introns, and protein-encoding genes, within input sequences on a genomic scale.
https://doi.org/10.1142/9789814503655_0009
The classical triplet code is not the only code carried by the sequences. They contain, for example, the gene-splicing code, transcription codes and many other codes. By analyzing a large volume of the nucleotide sequences available, i.e., by performing various computer experiments with the sequences, one can decipher them and extract from them valuable biological information. At the DNA level there are at least two more codes — the DNA shape code and the chromatin code. The overall DNA shape is sequence-dependent and can be described by a set of angles characteristic for various dinucleotide elements — codons of the DNA shape code. The chromatin code provides instructions for histone octamers where along the DNA to form the nucleosomes. This code is expressed as positional periodicity of, primarily, AA and TT dinucleotides. A new RNA code has been described — the translation framing code. The frame seems to be maintained by a synchronizing pattern GCUGCUGCU… hidden in mRNA. Most enigmatic of all is, perhaps, the gene-splicing code. An interesting recent development indicates that the gene-splicing pattern in the sequences and the nucleosomal pattern have some common features. This has to do with superposition of the patterns that is characteristic for the sequence language in general which carries simultaneously many codes in one and the same text. This results in an increased complexity of the sequences. Analysis of the protein-coding sequence complexity in eukaryotes and in prokaryotes revealed that the former are simpler. This is interpreted as the result of a spatial separation of the triplet code (carried by exons) and the chromatin code (carried by introns). Perhaps, the necessity of the separation of otherwise conflicting codes is one of the reasons why the intervening sequences had been introduced at all. The nucleotide sequences are written in an unbroken manner. One way to detect “words” in such a continuous text is to evaluate the degree of internal correlation by calculating contrast values for the words. This technique allows one to derive vocabularies, which are species- and function-specific. The nucleotide sequences, thus, carry numerous superimposed messages. We do understand only a few of these messages while many more are waiting for their turn to be deciphered.
https://doi.org/10.1142/9789814503655_0010
A major goal of our laboratory is to identify as completely as possible the expressed gene complement of the human brain. Toward this end, we have collected partial sequence data (expressed sequence tags, ESTs) from over 6000 human brain cDNA clones using automated fluorescence-based sequencers. Over 80% of these sequences represent genes not previously described in humans and up to one third may represent coding sequences with no matches in the existing databases. Database searches have identified several hundred clones with significant similarity to known genes, including novel genes similar to Notch/Tan-1 and the Drosophila discs-large tumor suppressor, a new neurotransmitter transporter gene, a new member of the multidrug resistance gene family, new members of the Ca++−ATPase, ADP-ribosylation factor, alpha-actinin, and neural cell adhesion molecule gene families, and at least seven new C2-H2 zinc finger proteins. Various brain libraries contain an excellent diversity of clones, with judicious screening of a relatively small number of highly and moderately represented sequences, large numbers of different genes can be tagged. Thus, normalization is not required at this stage to reduce the redundancy of sequencing. We have constructed a relational EST database for storage and integration of cDNA sequence analysis data (see abstract by Fields et al.). EST sequence data and cDNA clones are available to researchers through GenBank and the American Type Culture Collection.
https://doi.org/10.1142/9789814503655_0011
A variant of sequencing by hybridization (SBH) is under development with a potential to inexpensively determine up to 100 million base pairs per year. The method comprises five experimental steps: 1) arraying short clones in 864-well plates; 2) growth of the M13 clones or PCR of the inserts; 3) automated spotting of DNAs by corresponding pin-arrays; 4) hybridization of dotted samples with 200-3000 32P- or 33P-labeled 6- to 8-mer probes; and 5) scoring hybridization signals using storage phosphor plates. The method opens up intriguing possibilities for genome analysis. Some 200 7- to 8-mers can provide an inventory of the genes if cDNA clones are hybridized, or can define the order of 2-kb genomic clones, creating physical and structural maps with 100-bp resolution; the distribution of G+C, LINEs, SINEs, and gene families would be revealed. cDNAs that represent new genes and genomic clones in regions of interest selected by SBH can be sequenced by a gel method. Uniformly distributed clones from the previous step (20% of all) will be hybridized with 2000–3000 6- to 8-mers. As a result, approximately 50-60% of the genomic regions containing members of large repetitive and gene families and those families represented in GenBank would be completely sequenced. In the less redundant regions, every base pair is expected to be read with 3-4 probes, but the complete sequence can not be reconstructed. Such partial sequences allow the inference of similarity and the recognition of coding, regulatory, and repetitive sequences, as well as study of the evolutionary processes all the way up to the species delineation. Targeted gel sequencing with up to 10% error can be effectively used to complete sequences of genomic segments more than 70% similar to the treated ones. A 1000-bp read from a single gel strip would be sufficient to complete several thousand base pairs of sequence generated by duplications or present in genomes of closely related species. More interestingly, partial sequences generated with the same probes on three to four 70– to 90%-similar genomes may mutually complete each other, minimizing additional experimental data. This SBH variant can effectively fill a gap between expensive 300- to 600-bp runs on sequencing gels and the 10- to 100-kb resolution of the presently available mapping techniques.
https://doi.org/10.1142/9789814503655_0012
This paper describes the mathematical and computational techniques used at Généthon to obtain within one year a YAC contig map covering more than 50% of the whole human genome. The fingerprinting approach used has already yielded more than 1,000 contigs totalling more than 3,600 clones, after 16,896 clones from the CEPH YAC library have been analysed. The resulting map will be a powerful tool for the identification of unknown genes, particularly those responsible for genetic diseases.
https://doi.org/10.1142/9789814503655_0013
Gene mapping assigns chromosomal coordinates to genetic loci based on analysis of fragmentary ordering and metric data. In assembling genetic maps, geneticists use rules of inference to derive new facts about order and distance between loci from experimentally derived conclusions about order and distance. They construct comprehensive maps by merging related sets of data and resolving conflicts between them. In this article we describe software which formalizes and automates some of these rules of inference to yield a useful map construction utility called CPROP.
https://doi.org/10.1142/9789814503655_0014
This article presents an application of the simulated annealing algorithm used at Généthon for the STS-content map of the chromosome 21. This algorithm is a part of an integrated system which starts from the PCR gel analysis and produces ordered contigs that can be handled with a graphical user interface. For this project, 250 STSs have been used to screen a 14 genome equivalent YAC library. The result is a map of all of the long arm of chromosome 21 (21q). This map contains 210 STSs and 770 YACs and covers a 45 Megabase region with an average resolution of 1 STS /220 kb 1. The order obtained by simulated annealing is consistent both with genetic data and with other methods of physical mapping (RCRF1, alu-PCR). This map will be a powerful tool for gene analysis, especially in the study of Down syndrome and Alzheimer disease.
https://doi.org/10.1142/9789814503655_0015
The following sections are included:
https://doi.org/10.1142/9789814503655_0016
In order to improve our ability to simulate the complex behavior of polymers, we have introduced a new “two-space” algorithm that is well suited to parallel processing. This algorithm can simulate the dynamics of abstract polymers and is suitable for modeling a variety of polymeric systems including dilute, dense, confined and grafted polymers. A medium such as a model of a gel may be implemented through the initial conditions without change in the algorithm. Using the two-space algorithm, the microscopic behavior of DNA during electrophoretic separation may be simulated in the diffusive regime. The diffusive motion of long polymers in a medium is simulated using microscopic Monte-Carlo steps. We describe preliminary simulations of polymers migrating under an external field through a random medium of obstacles in 2-dimensions. Two sequences of simulations are performed with different obstacle densities corresponding to pore sizes larger and smaller than the polymer radius of gyration. In the dilute medium, polymers are characteristically draped on single obstacles. In the denser medium, draping across multiple obstacles results in reduced orientation in the field direction. Simulations of 90° field direction switching at different rates demonstrate the reorientation time and the influence of field pulse duration. The preliminary simulations were performed on a Cellular Automaton Machine, CAM-6, and further investigations are to be performed on the newly completed CAM-8, as well as other massively parallel computers.
https://doi.org/10.1142/9789814503655_0017
We are presently sequencing the entire genome of Mycoplasma capricolum, one of the smallest of free living organisms by a Multiplex Genomic Walking strategy. This technique involves the repetitive hybridization of sequencing membranes with oligonucleotide probes to acquire sequence data in discrete steps along the genome. The technique allows one to walk a genome in a directed manner eliminating the problems associated with random shotgun assembly. Furthermore, the repetitive stripping and hybridization process is relatively simple to reproduce and has the potential to be easily automated. The Genetic Data Environment (GDE), an X Windows based Graphic User Interface has allowed the seamless integration of a core multiple sequence editor with pre-existing external sequence analysis programs and internally developed programs into a single prototypic environment. This system has facilitated linkage of the Harvard Genome Lab's internal database and automated data control systems into one Graphic User Interface which can handle the archiving and analysis of both random fluorescent sequencing data and genomic walking data from the Mycoplasma project. Finally, it has facilitated the integration of the Genomic sequence data into a PROLOG database environment for the comparative analysis of Mycoplasma capricolum and other organisms.
https://doi.org/10.1142/9789814503655_0018
The human genome project has greatly stimulated the advancement of techniques to sequence large fragments of DNA. The development of improved molecular methods has also simplified the process of comparing shorter, homologous DNA sequences from different individuals and species. This process of ‘re-sequencing’ DNA has applications in medical genetics, in evolutionary studies, and for the identification of complex molecular variation that may explain multifactorial traits. Intrinsic differences in the processes of ‘sequencing’ and ‘re-sequencing’ suggest new requirements for data management tools. A data management scheme for a ‘re-sequencing’ project is demonstrated using the Virtual Notebook System, a flexible multi-user tool designed as a metaphor of the laboratory notebook.
https://doi.org/10.1142/9789814503655_0019
The success of an effort to produce an overlapping set of clones covering all of a complex genome depends, to a large degree, on the efficiency of detecting overlaps. Two kinds of approach exist for this purpose. The first involves a quantitative and statistical matching of restriction fragments, where the more extensive the overlap or the better the accuracy of fragment sizing, the more confident is the assignment. The second approach is more qualitative in nature, involving the use of distinct, unique reference points which when found to be shared between clones, demonstrates an overlap between them. My colleagues and I have developed such a landmark-based overlap detection strategy, and we have used it to complete a map of the 4.1 Mbp genome of the archaeon Haloferax volcanii. A second map, that of Halobacterium sp. GRB (2 Mbp), is nearly complete. In mapping genomes of this size using our method, there is but a minimal need for computer assistance. For larger, more complex genomes, the method is amenable to less tedious, more computer-intensive analysis. With the business of detecting overlaps simplified, finding solutions to the bigger problems in mapping, such as cloning the unclonable, can now occupy the brunt of one's effort.
https://doi.org/10.1142/9789814503655_0020
Protein coding and non-coding regions of the DNA primary structure can be represented by non-homogeneous and homogeneous Markov chain models respectively. These models can be employed by an algorithm predicting gene locations in a newly sequenced DNA. The key notion of this algorithm is an a posteriori probability of protein coding function for the given fragment of DNA sequence. We use Markov chain models from the first through the fifth order for the calculation of the value of this probability. The parameters of the non-homogeneous and homogeneous Markov chain models have been derived from a training set of 479,589 bp coding and 245,307 bp non-coding prokaryotic (E. coli) DNA sequences. The predictive accuracy of the method has been determined for the control set of 373,845 bp coding and 131,538 bp non-coding sequences of E. coli DNA. For instance, the version of the algorithm that employs fourth order Markov chain models gives a 10.0% false negative rate and a 25.2% false positive rate when coding function is identifying for a fragment of the 96 bp length.
https://doi.org/10.1142/9789814503655_0021
A massively parallel computing system is one tool that has been adopted by researchers in the Human Genome Project. This tool is one of many in a toolbox of theories, algorithms, and systems that are used to attack the many questions posed by the project. A good tool functions well when applied alone to the problem for which it was devised. A superior tool achieves its solitary goal, and supports and interacts with other tools to achieve goals beyond the scope of any individual tool. It is our thesis that Intel's massively parallel Paragon™ XP/S system is a superior tool. This paper presents specific requirements for a superior computing tool for the Human Genome Project (HGP) and shows how the Paragon system addresses these requirements.
Computing requirements for HGP are based on three factors:
1. computing requirements of algorithms currently used in sequence homology, protein folding, and database insertion/retrieval;
2. estimates of the computing requirements of new applications arising from evolving biological theories; and
3. the requirements for facilities that support collaboration among scientists in a project of this magnitude.
The Paragon system provides many hardware and software features that effectively address these requirements. These features include high-performance RISC processors for compute-intensive applications, a scalable architecture for increasing problem sizes, flexible resource management for ease-of-use and machine sharing, adherence to communication standards and remote access facilities for interactive and collaborative work with applications and researchers at remote facilities.
The power of these system features are exemplified by the results of applications in sequence alignment, sequence analysis, and molecular phylogeny, developed by computer scientists at Argonne National Laboratory and biologists at the University of Illinois, and executed on the massively parallel Intel Touchstone Delta system, a one-of-a-kind prototype of the Paragon XP/S system.
https://doi.org/10.1142/9789814503655_0022
There is a growing need for a well-designed database system for searching and analyzing nucleotide sequence data. We developed Overlapping Oligonucleotide Database for Signal Sequence Search (ODS), the first relational database that integrates information on biological features into the search for signal sequence. Furthermore, based on it, a deductive database system to search and analyze nucleotide sequence data was developed. Deductive database system is a next generation one and it contains an inference system. Database queries in ODS are described in both SQL and logical rules. These queries are simple even for molecular biologists who are not experts in computer programs. Particularly, queries based on logical rules are declarative and more powerful than those of SQL. Recursive rules are suitable for examining secondary structures of nucleotide sequences. In our analysis of TfR's IRE, we noted five stem-and-loop structures.
https://doi.org/10.1142/9789814503655_0023
Physical map assembly typically begins with a number of pairwise relationships between clones, and from these produces an overall arrangement of the clones. When there are only a few clones, an investigator can keep in mind all of the relevant data, and can weigh the evidence to produce a map that fits all the experimental results reasonably well. Today, however, it is common to build maps with thousands of clones and millions of pairwise relationships. Computer aided map assembly is thus required. Current computer algorithms typically use only a small fraction of available experimental results, and sometimes fail to deal adequately with inconsistency in the data. The assembly problem is here framed as optimizing a map to fit all the experimental data, and a genetic algorithm to search for optimal maps is described (in genetic algorithms, possible solutions to a problem are treated as individuals in an evolving population). The method has been used to construct or improve ordered clone maps for large parts of human chromosome 16.
https://doi.org/10.1142/9789814503655_0024
We are developing a computer representation of pathways of intermediary metabolism in cells. We have established a database of enzymes and associated information including reactants, cofactors, and subcellular reactant space. This information is maintained under dBASE, to which we have recently added a related database of enzyme inhibitors. The original programs were coded in dBASE III+. We have now developed a graphical user interface using MS-Windows and ‘C’ to traverse the metabolic network. The structural formulae of pathway substrates and of inhibitors is now displayed in a window, and associated information is displayed in other windows in the program. It is now possible to step through pathways, store trip logs, bring up ancillary information on each step, and maintain a ‘scratch pad’. We have also formulated a simple method to handle the myriad of cell types present in the biological kingdom: a single field database consisting of only the key field of the main database (Enzyme Commission number) serves as a mask to create a subset database. Future development includes more complete population of the databases, new cross search features for the information, and new graphical representations of layers of the metabolic pathways. The database representation will also be used as the underlying structure of quantitative metabolic models.
https://doi.org/10.1142/9789814503655_0025
One of the goals of any large scale DNA sequencing project is to understand the molecular details about the metabolic control sites that will be found in the sequence of the chromosome region being studied. In addition, once an interesting observation has been made, questions will quickly arise concerning the distribution of such sites within the genome and how well the same observations hold between related species.
This paper will discuss our approach toward building a flexible analysis environment that facilitates the analysis of genomic sequence data. The Integrated Genomic Database (IGD),5 developed by Ray Hagstrom, Ross Overbeek, Morgan Price and Dave Zawada at the Argonne National Laboratory, organizes genome mapping and sequencing data to provide a global chromosome view for multiple genomes. We describe here our use of the IGD system and how we employ it for relational analysis of sequence features that are found distributed throughout the genome under study. The primary goal of this work is to provide a system to support research on the global organization of genomic regulation patterns.
https://doi.org/10.1142/9789814503655_0026
Like many other domains of research and applications, scientific research requires the use of databases and the support of database management systems. The data characteristics and access requirements of scientific databases are quite different from those of business databases. Clearly, existing database management systems, which were mainly developed for business applications, do not provide scientists with adequate modeling, processing, and analysis tools and capabilities to meet their database needs. This paper introduces an object-oriented knowledge base management technology which has a number of desirable features. First, an object-oriented semantic association model OSAM* provides general structural constructs to model complex objects and their various types of semantic associations. It also allows the user to define the behavioral properties of objects through user-defined operations and knowledge rules, which results in an active knowledge base management system (KBMS). Second, a pattern-based query language, OQL, allows complex search conditions and constraints to be easily specified. Third, a set of intelligent graphical interface tools greatly eases scientists' tasks in defining and querying complex knowledge bases. Fourth, the system can be extended to meet the changing requirements of applications by extending the modeling capabilities of the data model, and by modifying the structure of system components. Lastly, the efficiency of processing large knowledge bases is achieved by using a transputer-based multiprocessor system and some multi-wavefront parallel processing algorithms. A prototype KBMS with the above features has been developed which runs on IBM and SUN workstations.
https://doi.org/10.1142/9789814503655_0027
The following sections are included:
https://doi.org/10.1142/9789814503655_0028
A neural network classification method has been developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), is implemented on a Cray Y-MP8/864 supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function for sequence encoding and modular back-propagation networks for classification. The system was developed with the first 2,724 entries in 690 superfamilies of the annotated PIR (Protein Identification Resource) protein sequence database. Three prediction sets were used to evaluate the system performance. The first consists of 651 annotated entries randomly chosen from the 690 superfamilies. The second set consists of 482 unclassified entries from the preliminary PIR database, whose superfamilies were identified by the fasta, blastp and sp database search methods. The third set is a subset of data set 2 with only superfamilies of more than 20 entries. At a low cut-off score of 0.01, the sensitivity is 92, 82 and 100%, respectively, for the three prediction sets. At a high cut-off score of 0.9, on the other hand, a close to 100% specificity is achieved with a reduced sensitivity. The classification accuracy is determined by three factors: the degree of similarity, the sequence length, and the size of the superfamily. The classification on neural nets is fast (i.e., less than 0.5 Cray CPU second per sequence on a full-scale system). The speed would not be constrained by database sizes because the search time grows with the number of superfamilies which is likely to remain low. Therefore, ProCANS can be used as a filter program to provide a reduced search space and speed up database searches. The rapid superfamily identification provided by ProCANS would be particularly valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects. A current extension to ProCANS is the incorporation of motif information to further improve its sensitivity. The design concept has also been applied to the classification of nucleic acid sequences. A preliminary result showed a 96% accuracy for 16S ribosomal RNA classification. The software tool is generally applicable to any second generation databases that are organized according to family relationships.
https://doi.org/10.1142/9789814503655_0029
We have used a large scale backpropagation neural network simulator, BigNet, running on Cray 2 and X-MP machines, to learn, recall and predict protein structures from sequence. In the present study, we extended previous work with a revised training/testing set (20 training, 4 testing) and a more detailed analysis of BigNet's operation. We describe an enhanced training development environment and new methods, including training data preprocessing and sequence shifting, that maximize generalization and minimize artifacts in distance matrices produced by the neural network. The results demonstrate improved learning and generalization performance relative to previous reports. The trained network produced good predictions of distance matrices when presented with novel sequence data from proteins homologous to proteins in the training set.
https://doi.org/10.1142/9789814503655_0030
Artificial neural networks have proven to be a useful technique for analyzing biological data and automatically producing accurate pattern recognizers. However, most applications of neural networks have not taken advantage of existing knowledge about the task at hand. This paper presents a method for using such problem-specific knowledge. The KBANN algorithm uses inference rules about the current biological problem, which need only be approximately correct, to initially configure a neural network. This network is then refined by analyzing sample examples and counterexamples of the concept being learned. An application of KBANN to the prediction of E. coli transcriptional promoters demonstrates its superiority to alternative techniques, taken from both the machine-learning and molecular biology literatures. In addition, since KBANN uses a human comprehensible “theory” of the current problem to define the initial neural network topology, it is possible to extract a refined set of inference rules following training. A refined theory for promoter recognition is presented; the extracted rules are roughly as accurate on novel data as the trained neural network from which they came.
https://doi.org/10.1142/9789814503655_0031
Spatial structures, characterized by invariant geometric feature vectors, can be learned by neural networks or other statistical methods. The protein folding problem is formulated as a learnable mapping from amino acid sequences to invariant geometric feature vectors which are calculated from atomic coordinates of the polypeptide backbone and side chains. This extends to tertiary structures directly the usual neural network method of a mapping from amino acid sequences to secondary structures: α-helix, β-strand, and random coil.
https://doi.org/10.1142/9789814503655_0032
Studies of homolog evolution and interpretation of mutational patterns are useful approaches for investigating the structural and functional information contained in sequences. To increase the accuracy and reliability of these approaches, a systematic comparative analysis of the evolutionary modes of sequence families is needed. The first step in such an analysis is the compilation of possible families from databases. The goal of this work is to develop algorithms and software for compiling homologous genes coding for proteins from the GenBank database. Stages in the database compilation are described, and the resulting database is used for studying evolutionary modes of gene families.
https://doi.org/10.1142/9789814503655_0033
An algorithm for contig joining which can be used in programs for primary analysis of the results of automatic sequencing of DNA is described. It permits us to find the subset of contigs which can be joined into one sequence and then to reconstruct the consensus sequence, which provides the best choice of bases for the regions of contig overlapping.
https://doi.org/10.1142/9789814503655_0034
A fast algorithm for multiple sequence alignment based on new approaches of tree construction and sequence comparison is suggested. We developed a version of the pairwise sequence alignment algorithm,1 which was based on analysis of DOT matrix Diagonal fragments (Df) followed by joining of significant Dfe in the final alignment. The algorithm maintains some methodological features of Needleman-Wunsch (NW) type algorithms and uses statistical estimations of similarity of various Dfs. The estimations were entered into a compact “competition matrix” (CM). Homology of sequence positions for multiple alignment was changed to homology of the corresponding rows in the aligned subsets. In addition, instead of one-iteration filling of the CM by Df information, a multi-iteration method was suggested. We assumed that the minimal length of Df used for each iteration must be selected so that the probability of occurrence of homologous subsequences of a given length, by chance, would be low. On the basis of these significant Dfs we reconstructed an initial rough alignment. In the next iteration, calculations were repeated for all gaps in the previous alignment, where no significant homology between aligned sets was fixed. For each such gap, its length was used for the estimation of a new minimal length Df. The method of multiple alignment presented here also uses a new approach for tree reconstruction based on the analysis of the relatively conserved oligonucleotides in a given set. This approach has some advantages compared to traditional methods of phylogenetic tree reconstruction, which as a rule, attribute a definite weight to mutations independent of their location along the sequence. The tree construction was a very fast procedure, which does not require preliminary alignment of sequences.2 The method was tested on 5S RNA sequences and its application for contigs joining was discussed.
https://doi.org/10.1142/9789814503655_0035
Currently-available software tools are capable of predicting the locations of most protein-coding genes in anonymous genomic DNA sequences. The use of predicted exons to select primers for PCR amplification from cDNA libraries allows the complete structures of novel genes to be determined efficiently. As the number of expressed sequence tag (EST) sequences increases, the fraction of genes that can be localized in genomic sequences by searching EST databases will rapidly approach unity. The challenge for automated DNA sequence analysis is now to develop methods for accurately predicting gene structure and alternative splicing patterns. Substantially improving current accuracies in gene structure prediction will require retrospective comparative analysis of sequences from different organisms and gene families.
https://doi.org/10.1142/9789814503655_0036
In the interest of providing more efficient computer-based analysis of DNA and protein sequences, Cray Research has developed a high performance implementation of the sequence alignment method of Needleman and Wunsch using the programming technique of pocket arithmetic. The basis for this implementation is the program SEQHDP, which finds locally homologous subsequences of a protein sequence pair and determines the statistical significance of the homology. Pocket arithmetic takes advantage of the 64-bit width of an operand on the Cray Y-MP by packing more than one integer value per word, then performing logical or integer operations on the packed word to yield multiple results per operation. This technique, in combination with the vector processing capabilities of the Cray Y-MP CPU, produces substantially improved performance over the conventionally coded version of the same algorithm. We will introduce the programming technique of pocket arithmetic, then describe its implementation in the Needleman-Wunsch sequence comparison function in SEQHDP. Performance results based on actual protein sequence comparisons are presented.
https://doi.org/10.1142/9789814503655_0037
Patterns of short oligonucleotide distribution within DNA and RNA functional sites have been analysed using “Site-Video” computer system. The group of DNA functional sites involved nucleosome binding sites, gyrase cleavage sites, promoters of E. coli and men. The group of RNA functional sites involved donor and acceptor splice sites of men, translation initiation sites of E. coli and men and translation frame shift sitesites. Analysis of these samples of nucleotide sequences have been carried out by the “Site-Video” computer system. For each type of site specific set of patterns of oligonucleotide distribution important for the functioning and recognition have been revealed. At the same time, the number of specific patterns revealed in RNA sites was significantly higher than those in DNA sites. On the base of the results obtained, the script of functional sites for evolutionary emergency have been prompted. According to it, two types of context feature selection took place: (1) positive selection targeted to the appearance of the definite types of context features in particular regions of functional sites;and (2) negative selection targeted to the elimination of definite types of context features in particular regions of functional sites. We suppose that evolutionary formation of any functional site is a multistep process realized via combination of positive and negative selections. Negative selection, via fixation of a specific pattern of mutations, eliminates false signals of regulatory proteins binding with the functional site. Positive selection leads to the appearance of local context features (signals) which provide for the specificity and efficiency of the site functioning.
https://doi.org/10.1142/9789814503655_0038
GRAIL is a comprehensive system being constructed to analyze and characterize the genetic structure of DNA sequences. A number of program modules supply information to the system including the Coding Recognition Module (CRM), which forms the basis of the current e-mail GRAIL server system. Additional modules determine the positions and scores of possible splice junctions, the positions of potential translation-initiation sites, the coding strand for each gene, and the probable-translation-frame function over the sequence. The Gene Assembly Program module (GAP) attempts to predict the sequence of the spliced mRNA for a gene from the genomic DNA sequence. It constructs and scores gene models, given a DNA sequence and the outputs of the other GRAIL modules for the sequence. GAP tests combinations of those splice junctions which are within acceptable distance from the initial predicted edges of the coding regions. Every complete gene model, comprising translation-initiation site, splice junctions and stop codon, which agrees with GAP's set of rules is scored, and the ten highest-scoring models are saved. Each gene model's score depends on the input scores of splice junctions used in the model, their positions relative to the initial predicted edges of the included coding regions, and the degree of agreement of the entire model with the probable-translation-frame function. If error conditions are detected, the present version of GAP attempts to correct them by the insertion and/or deletion of one or more coding regions. These insertions and deletions have resulted in a net improvement of gene models, and a particularly large improvement in the recognition and characterization of very short coding regions. The results of GRAIL including the GAP module for 26 sequences from GenBank, each with an experimentally characterized gene, are quite promising and demonstrate the feasibility of constructing largely accurate gene models strictly on the basis of DNA sequence data.
https://doi.org/10.1142/9789814503655_0039
The availability of new ‘non-Von Neumann’ hardware architectures, and the consequent speed-up of the performance permits the development and implementation of more efficient algorithms in order to perform massive analysis of biosequences. At the present, one of the successful classes of new computational methodologies is the Artificial Neural Networks. In this paper we present an application of Kohonen's Self-Organizing Maps (KFM) to the recognition of uncommon domains on cDNA sequences by using FPS 500 EA vectorial computer.
https://doi.org/10.1142/9789814503655_0040
One of the major theoretical concerns associated with the Human Genome Project is that of the methodology to decipher “raw” sequences of DNA. This work is concerned with a subsequent problem, the one of how huge amounts of already deciphered information that will emerge in the near future can be integrated in order to enhance our biological understanding. The formal foundations for a linguistic theory of the regulation of gene expression will be discussed. The linguistic analysis presented here is restricted to sequences with known biological function since: i) there is no way to obtain, from DNA sequences alone, a regulatory representation of transcription units, and ii) the elements of substitution -methodologically equivalent to phonemes- are complete sequences of the binding sites of proteins.
We have recently collected and analyzed the regulatory regions of a large number of E.coli promoters. The number of sigma 70 promoters studied may well represent the largest homogeneous body of knowledge of gene regulation at present. This collection is a data set for the construction of a grammar of the sigma 70 system of transcription and regulation. This grammatical model generates all the arrays of the collection, as well as novel combinations predicted to be consistent with the principles of the data set. This Grammar is testable, as well as expandable if the analysis of emerging data requires it. The elaboration of a linguistic methodology capable of integrating prokaryotic data constitutes a preliminary step towards the analysis and integration of the more complex eukaryotic systems of regulation.
https://doi.org/10.1142/9789814503655_0041
A new agarose gel model is introduced, which corresponds to what we believe agarose gels look like microscopically. While the scientific literature is filled with studies of the microscopic structure of agarose, the fact remains that there is no unambiguous and exact model of its underlying structure. Given this, we are left to construct our own model numerically.
https://doi.org/10.1142/9789814503655_0042
Producing a structure/function prediction from amino acid sequence is currently a major problem in sequence interpretation. We have developed a visual method to display and demonstrate conserved structural domains in protein and nucleic acid sequences, and have used this method to describe structural conservation in vertebrate lipase families3 and to define structural domains for subsequent database searching. We have developed strategies to search the databases using previously defined consensus domains in order to derive an empirical basis for the determination of higher order structures in proteins. This method was used successfully to define important structurally conserved regions in proteins.
https://doi.org/10.1142/9789814503655_0043
Several computer programs now available will predict exons based upon naive genomic sequence data, but they generally require access to a unix workstation or e-mail access to Internet. We have developed a program, called SORFIND, which predicts vertebrate internal exons at 5 different confidence levels, and which runs on an IBM-PC computer. The program reads sequence data in several formats, identifies “spliceable open reading frames” (SORFs) possessing high consensus scores with known acceptor and donor splice junctions, and analyzes codon usage. Potential exons are filtered through successive stages, and in a data set of 130 human genes results in the identification of 89.6% of the internal exons greater than 60 base pairs in length (62.5% predicted with exact splice junctions and reading frame, and a further 27.1% predicted with at least one exact splice junction and an average 77.3% overlap with true internal exons). Specificity (the percentage of SORFs that either completely or partially match a true exon) is 91.8%, 90%, 75.5%, 53.2% and 38.4% for the combined confidence levels 1, 1 and 2, 1 to 3, 1 to 4 and 1 to 5, respectively. The program's output displays nucleotide position, confidence level, reading frame phase at the 5′ and 3′ ends, acceptor and donor sequences and scoring statistics. It also generates an amino acid translation which can be used in protein database homology searches. The program compares favourably with the CRM module of GRAIL and with the Geneld program on an analysis of a 105 kilobase contig from human chromosome 4. It also successfully predicts exons from other vertebrates.
https://doi.org/10.1142/9789814503655_0044
The problems of functional site analysis and recognition are considered in this article. The methods for selection of context features important for site functioning and recognition via interactive and automatic analysis of nucleotide sequences of functional sites are described. The first method, which is based on the theory of utility, is applied for generation and estimation of hypothesis on functional site context features with a high level of recognition ability and specificity. The second method permits us to reveal nonrandom patterns of short oligonucleotide distribution within the functional sites.
The construction of methods for functional site recognition using the revealed context features are also presented. Two different approaches will be described: “Site-Video” system for functional site analysis and recognition, and the method of large scale oligonucleotide distributions. By using these approaches, recognition programs for splice sites and promoters of eukaryotes are constructed. The mean error of recognition is from 10 to 15% for the functional sites studied.
https://doi.org/10.1142/9789814503655_0045
We have developed a system, GenelD, for prediction of coding sequences in vertebrate DNA sequences. The approach relies on an ensemble of simple algorithms, the output of which is hierarchically assembled into likely models of gene structure. When tested on vertebrate genes smaller that 15000 basepairs, the actual coding sequence is predicted with an average correlation coefficient of 0.70. This accuracy is higher than any other method currently available. On average, 69% of the coding region is predicted, which in most cases will be sufficient for a search against protein databases. GenelD is available as an electronic mail server, and the predicted gene is automatically compared to protein databases, and any similarities to known proteins reported.
https://doi.org/10.1142/9789814503655_0046
The sequencing by hybridization (SBH) method has been developed for assaying millions of 0.5- to 2-kb-long clones. This opens up an efficient way for defining the order of short clones and creating a physical map at 100-bp resolution. Moreover, complete sequences can be obtained using a modest number (about 3000) of probes if hybridization and gel sequence data from overlapped or similar sequences are used. In light of these possibilities, various heuristic algorithms have been developed and tested in simulation experiments. This approach can influence the interpretation of the intuitively obvious term, “known sequence”.
https://doi.org/10.1142/9789814503655_0047
This paper presents a time and space efficient algorithm that searches for all homologous regions between two DNA sequences. A homologous region is defined by exact match segments interspersed with poorly matched segments (gaps) that contain high amount of errors. This definition is subject to two input parameters to the algorithm: the minimum length of exact match segments and the maximum length of gaps. We believe that the distribution of the exact match segments in a homologous region reveals more about the quality of homology than the error rate measure based on base-to-base substitution, deletion, and insertion errors.
https://doi.org/10.1142/9789814503655_0048
In this paper a new computing tool Gen Viewer is described. It is designed for recognition of gene coding regions in human nucleotide sequences. The recognition is made on the base of the combinated technique which involves searching for potential splice sites, creating a set of potential coding regions along with the estimation of their coding potential. Finally, through the combinated selection of the potential coding regions, a search for the gene with the maximal average coding potential is made. The Gen Viewer works in two modes: either automated or dialog. The automated version ORFJ allows us to realize a linear scenario of the recognition because of its standard programs. An advanced interface provides for a careful analysis of the obtained results and ensures an alternative interactive prediction.
https://doi.org/10.1142/9789814503655_0049
The frequency and distribution of all possible purine/pyrimidine reverse complement hexamer pairs was determined in DNA sequences representing two different types of mammalian chromatin and in random sequence. Consistent deviations from random, both in the frequencies and in the clustering, of specific patterns in the real sequences was detected. Filtering the data for frequencies and distributions correlating with known chromatin structure revealed the informative pattern rryyrr/yyrryy. This motif has a periodic distribution in the range of 0.9–1.4kb in a variety of sequences and is also in period with the distribution of 90% of the transcriptional regulatory domains. The periodic distribution of this motif is on the order of the repeat length of one turn (∼1.2kb) of the 30nm chromatin fibers. The observed distributions would result in a vertical alignment of clusters of this motif and transcriptional regulatory domains along one face of a 30nm fiber. The relationship of this informative motif (rryyrr/yyrryy) to other non-randomly occurring conformational motifs is discussed.
https://doi.org/10.1142/9789814503655_0050
An effective interchange between biologists and scientists in other disciplines is essential to the success of the Human Genome Initiative. A hypertext bibliography is seen as a unique resource designed to bring non-biologists, and biologists with other specialties, up to speed in the field of Sequence Analysis. The character, advantages, and problems with hypertext are reviewed, and a specific plan is presented.
https://doi.org/10.1142/9789814503655_0051
By applying fractal representation of nucleotide sequences for plotting a set of functionally similar sequences, a new approach for classification of nucleic sequences was suggested1,2 and some measures of sequence similarity were introduced. Many examples of good separation of sequences belonging to different gene families were shown. Among them are a good classification for such subfamilies as α and β-actins, α-, β-, γ-interferons. The method does not require alignment procedure both for generating a recognition matrix of learning set and for searching homologous regions. The computer time does not depend on the length of the searching sequence and the fractal images of sequence sets can be compared easily by computer procedures as well as visually. The latter is especially convenient for representing the density of fractal mask as a third coordinate of the image. The method is successfully applied both for searching genes (globins, histones, etc.) and different kind of repetitive DNA sequences (Alu, LTR, etc.).3 The FRS approach is used also for revealing the gene structure in uncharacterized sequences. The fractal images for exons, introns, 5′- and 3′-region have significantly different patterns which permit us to find preliminary localizations of these gene regions. We obtained functions which recognize small size coding regions with 90% accuracy for 54 bp windows, 95% for 108 bp windows and the functions which recognize 5′- and 3′-regions of eukaryotic genes based on 8–9 bp oligonucleotides statistics, that can be simply recomputed with the extension of the current database. Some examples of the application of fractal representation of amino acids sequences for protein classification are also given.
https://doi.org/10.1142/9789814503655_0052
The aim of the work consists of modeling Dayhoff matrix from the experimentally estimated matrix of nucleotide substitutions. The close resemblance observed between the Dayhoff matrix and the modeling matrix suggests that frequencies of transitions and transversions in genome determine the frequencies of amino acids substitutions. The substitutions apparently are not a consequence of selection of interchangeable amino acids substitutions.
https://doi.org/10.1142/9789814503655_0053
Fractal dimension (FD) of oligonucleotide composition is presented as an analog of genetic text complexity. FD for prokaryotic and eukaryotic sequences are estimated. Reliable differences between FD of coding and non-coding sequences in higher organisms are demonstrated. At the same time similar value of coding regions from different sources illustrate stability of such sequences against evolutionary processes. The proposed method provides a fast calculation of FD value for sequences of any length.
https://doi.org/10.1142/9789814503655_bmatter
The following sections are included: