![]() |
The Pacific Symposium on Biocomputing brings together key researchers from the international biocomputing community. It is designed to be maximally responsive to the need for critical mass in subdisciplines within biocomputing. These proceedings contain peer-reviewed articles in computational biology and bioinformatics.
https://doi.org/10.1142/9789814447300_fmatter
The following sections are included:
https://doi.org/10.1142/9789814447300_0001
Biology is currently undergoing a shift from a mostly qualitative to an information rich, quantitative science. Using large-scale biological technologies, we are gaining global views of structural and dynamic information in the form of whole genome sequences and the corresponding gene activity patterns at the RNA and protein level…
https://doi.org/10.1142/9789814447300_0002
Complementary DNA microarray and high density oligonucleotide arrays opened the opportunity for massively parallel biological data acquisition. Application of these technologies will shift the emphasis in biological research from primary data generation to complex quantitative data analysis. Reverse engineering of time-dependent gene-expression matrices is amongst the first complex tools to be developed. The success of reverse engineering will depend on the quantitative features of the genetic networks and the quality of information we can obtain from biological systems. This paper reviews how the (1) stochastic nature, (2) the effective size, and (3) the compartmentalization of genetic networks as well as (4) the information content of gene expression matrices will influence our ability to perform successful reverse engineering.
https://doi.org/10.1142/9789814447300_0003
Liang, Fuhrman and Somogyi (PSB98, 18-29, 1998) have described an algorithm for inferring genetic network architectures from state transition tables which correspond to time series of gene expression patterns, using the Boolean network model. Their results of computational experiments suggested that a small number of state transition (INPUT/OUTPUT) pairs are sufficient in order to infer the original Boolean network correctly. This paper gives a mathematical proof for their observation. Precisely, this paper devises a much simpler algorithm for the same problem and proves that, if the indegree of each node (i.e., the number of input nodes to each node) is bounded by a constant, only O(log n) state transition pairs (from 2n pairs) are necessary and sufficient to identify the original Boolean network of n nodes correctly with high probability. We made computational experiments in order to expose the constant factor involved in O(log n) notation. The computational results show that the Boolean network of size 100,000 can be identified by our algorithm from about 100 INPUT/OUTPUT pairs if the maximum indegree is bounded by 2. It is also a merit of our algorithm that the algorithm is conceptually so simple that it is extensible for more realistic network models.
https://doi.org/10.1142/9789814447300_0004
We propose a differential equation model for gene expression and provide two methods to construct the model from a set of temporal data. We model both transcription and translation by kinetic equations with feedback loops from translation products to transcription. Degradation of proteins and mRNAs is also incorporated. We study two methods to construct the model from experimental data: Minimum Weight Solutions to Linear Equations (MWSLE), which determines the regulation by solving under-determined linear equations, and Fourier Transform for Stable Systems (FTSS), which refines the model with cell cycle constraints. The results suggest that a minor set of temporal data may be sufficient to construct the model at the genome level. We also give a comprehensive discussion of other extended models: the RNA Model, the Protein Model, and the Time Delay Model.
https://doi.org/10.1142/9789814447300_0005
Large-scale gene expression data sets are revolutionizing the field of functional genomics. However, few data analysis techniques fully exploit this entirely new class of data. We present a linear modeling approach that allows one to infer interactions between all the genes included in the data set. The resulting model can be used to generate interesting hypotheses to direct further experiments.
https://doi.org/10.1142/9789814447300_0006
Since A. M. Turing's paper proposing a mathematical basis for pattern formation in developing organisms many mathematical approaches have been proposed to model biological phenomenon. Continued laboratory study and recent improvements in measurement capabilities have provided an immense quantity of raw gene expression data. The level of data now available demands the development of well-characterized and tested computational tools. Thus, we have examined one mathematical model's sensitivity to errors in estimating its' parameters. Errors in parameter estimation can arise from noise in the laboratory measurements and recasting of laboratory data. We elected to examine the rule-based mathematical model of Mjolsness et al for its' sensitivity to errors in estimated parameters. We have used the technique of sensitivity equations as generally applied in nonlinear systems analysis.
https://doi.org/10.1142/9789814447300_0007
A stochastic model of ColE1 plasmid replication is presented. It is implemented by using UltraSAN, a simulation tool based on an extension of stochastic Petri nets (SPNs). It allows an exploration of the variation in plasmid number per bacterium, which is not possible using a deterministic model. In particular, the rate at which plasmid-free bacteria arise during bacterial division is explored in some detail since spontaneous plasmid loss is a widely observed empirical phenomenon. The rate of spontaneous plasmid loss provides an evolutionary explanation for the maintainance of Rom protein. The presence of Rom acts to reduce variance in plasmid copy number, thereby reducing the rate of plasmid loss at bacterial division. The ability of stochastic models to link biochemical function with evolutionary considerations is discussed.
https://doi.org/10.1142/9789814447300_0008
The formation of Drosophila wings and legs are major research topics in Drosophila development, and several hypotheses, such as the polar-coordinate model and the boundary model, has been proposed to explain mechanisms behind these phenomena. A series of recent studies have revealed complex interaction among genes involved in establishing three principal axes (A-P, D-V, and P-D) of leg formation. In this paper, we present a simulation system for leg formation, simulating the genes interactions involved. We use this simulator to investigate a mathematical framework of leg formation which is otherwise well-founded from a molecular perspective. Particularly, we focus on the formation of the expression patterns of dpp, wg, dll, dac, al, en, hh and ci genes, which are involved in the development of the third instar Drosophila leg disc. The most interesting part of this research is showing how the coaxial gene expression patterns behind the P-D axis can be formed, and how positional information, as postulated in the polar-coordinate model, can be conveyed to each cell. Our results suggest that P-D axis can be formed by a set of genes with different activation thresholds the process involves different chemical gradients of dpp and wg products, forming a bi-polar contour. Interestingly, this combination of chemical gradients can specify unique positions of cells for the hemisphere, leaving the A-P axis determiner to decide only whether the cells are anterior or posterior. All in all, our so-called Bi-Polar Model describes axial formation of the leg disc well.
https://doi.org/10.1142/9789814447300_0009
Ligand-binding sites in homologous protein domains can diverge greatly during evolution. This poses a particularly interesting problem in those cases where the ligand-binding site is situated in, or close to, the domain core, or where ligand-docking induces dramatic conformational changes. These features are present in many receptors and enzymes; the hormone-binding domain of the nuclear receptors for steroids and retinoids, for example, exhibits both characteristics. It is therefore of great interest to determine how binding sites for diverse ligands evolve in core regions of structurally dynamic domains. Are evolutionary changes locally restricted to the ligand-binding site, or are they distributed throughout the domain? We describe here an information-theoretic approach for the study of covariation between ligand-contacting residues and compensatory mutations that preserve the structural integrity and the conformational dynamics of ligand-binding domains. We apply this method to the analysis of the nuclear receptor ligand-binding domain and show that the ligand-contacting residues in the hormone-binding pocket are evolutionarily linked to an extensive network of covarying positions.
https://doi.org/10.1142/9789814447300_0010
We designed a Java applet called Net Work which enables a user to interactively construct and visualize a genetic network of interest, and to and to evaluate and explore its dynamics in the framework of a Boolean network model. Net Work displays the mechanism of gene interactions at the level of gene expression and enables the visualization of large genetic networks. Net Work can serve as an interactive interface to tools for the analysis of genetic network structure and behavior.
https://doi.org/10.1142/9789814447300_0011
Systematic gene expression analyses provide comprehensive information about the transcriptional response to different environmental and developmental conditions. With enough gene expression data points, computational biologists may eventually generate predictive computer models of transcription regulation. Such models will require computational methodologies consistent with the behavior of known biological systems that remain tractable. We represent regulatory relationships between genes as linear coefficients or weights, with the “net” regulation influence on a gene's expression being the mathematical summation of the independent regulatory inputs. Test regulatory networks generated with this approach display stable and cyclically stable gene expression levels, consistent with known biological systems. We include variables to model the effect of environmental conditions on transcription regulation and observed various alterations in gene expression patterns in response to environmental input. Finally, we use a derivation of this model system to predict the regulatory network from simulated input/output data sets and find that it accurately predicts all components of the model, even with noisy expression data.
https://doi.org/10.1142/9789814447300_0012
The development and growth of molecular databases over the last decade has brought a growing problem to the biocomputing community. Our ability to analyze, summarize and extract information from these databases has lagged far behind our ability to collect and store data. As well, traditional methods for handling data (either automated or manual) cannot be effectively applied because of the volume and complexity of these emerging databases…
https://doi.org/10.1142/9789814447300_0013
There are various cases where the biological function of an RNA molecule involves a reversible change of conformation. paRNAss is a software approach to the prediction of such structural switching in RNA. It is based on three hypotheses about the secondary structure space of a switching RNA molecule, which can be evaluated by RNA folding and structure comparison. In the positive case, the predicted structures must be verified experimentally. Additionally, we give an animated visualization of an energetically favourable transition between the predicted structures. paRNAss is available via the Bielefeld Bioinformatics Server.
This paper explains the underlying model and shows that the approach performs well in a variety of applications.
https://doi.org/10.1142/9789814447300_0014
The detection of motifs within and among families of protein sequences can provide useful information regarding the function, structure and evolution of a protein. With the increasing number of computer programs available for motif detection, a comparative evaluation of the programs from a biological perspective is warranted. This study uses a set of 20 reverse transcriptase (RT) protein sequences to test and compare the ability of 7 different computational methods to locate the ordered-series-of-motifs that are well characterized in the RT sequences. The results provide insight to biologists as to the usage, value, and reliability of the numerous methods available.
https://doi.org/10.1142/9789814447300_0015
We consider the problem of obtaining the maximum a posteriori probability (MAP) estimate of a consensus ancestral sequence for a set of DNA sequences. Our maximization method, called ASA (dnA Sequence Alignment), can be applied to the refinement of noisy regions of a DNA assembly, to the alignment of genomic functional sites, or to the alignment of any set of DNA sequences related by a star-like phylogeny. Along with the optimal consensus, ASA finds suboptimal solutions together with their relative probabilities. The probabilistic approach makes it possible to establish the limits to which an ancestor can in principle be recovered from diverged sequences. In simulations on rather short synthetic sequences (of length up to 80) with different coverage and error rates ranging from 5% to 30%, ASA restored the consensus from noisy observations essentially as best as is theoretically possible for the given error rates. We also illustrate the performance of ASA on the alignment of E.Coli promoters and the Alu-Sb subfamily of human repeat sequences. Since our model is a special case of a profile HMM, we give a comparison between these two approaches, as well as with other DNA alignment methods.
https://doi.org/10.1142/9789814447300_0016
Hidden Markov Models (HMMs) provide a flexible method for representing protein sequence data. Highly divergent data require a more complex approach to HMM generation than previously demonstrated. We describe a strategy of motif anchoring and sub-class modeling that aids in the construction of more informative HMMs as determined by a new algorithm called a stability measure.
https://doi.org/10.1142/9789814447300_0017
We propose a novel method to detect 5'splice sites of eukaryotic mRNA. We have grouped the 5'splice splice sites into various classes. The clustered sites are represented by a set of PWMs. The clustering algorithm is similar to k-means clustering algorithm but the distance definition and the training score function were arranged. The clustered PWMs were applied to 5'splice site detection. The results showed an improvement in comparison with traditional single PWM. The result of the clusters suggests there are new motifs of 5'splice sites.
https://doi.org/10.1142/9789814447300_0018
Recognition of short peptides of 8 to 10 mer bound to MHC class I molecules by cytotoxic T lymphocytes forms the basis of cellular immunity. While the sequence motifs necessary for binding of intracellular peptides to MHC have been well studied, little is known about sequence motifs that may cause preferential affinity to the T cell receptor and/or preferential recognition and response by T cells. Here we demonstrate that computational learning systems can be useful to elucidate sequence motifs that affect T cell activation. Knowledge of T cell activation motifs could be useful for targeted vaccine design or immunotherapy. With the BONSAI computational learning algorithm, using a database of previously reported MHC bound peptides that had positive or negative T cell responses, we were able to identify sequence motif rules that explain 70% of positive T cell responses and 84% of negative T cell responses.
https://doi.org/10.1142/9789814447300_0019
In recent years, there has been an explosive amount of molecular biology information obtained and deposited in various databases. Identifying and interpreting interesting patterns from this massive amount of information has become an essential component in directing further molecular biology research.
The goal of this research is to discover structural regularities in protein sequences by applying the SUBDUE discovery system to databases found in the Brookhaven Protein Data Bank. In this paper we report the results of applying SUBDUE to several classes of protein structures and discuss the potential significance of these results to the study of proteins.
https://doi.org/10.1142/9789814447300_0020
Genomic science and structural biology meet in the relationship between the sequence and the structure of nucleic acids. The structure that supports each function is preserved in the process of evolution as specific sequences. Particularly, the same sequence which appears in a different place such as a palindromic or repetitive sequence has biophysical meaning: recognition site of dimers, forming stem-loops, and contributions to global structure of nucleic acids. Also, the genetic network, transduction pathway, and tissue specificity largely depend on these. Although the relationship between them can be found experimentally, there is increasing demand for automated analysis. Especially, it is desirable to extract the same character sequences of arbitrary length (especially, very long ones) which co-occur at an arbitrary separation. We propose an algorithm to identify the maximum match sequence at each position with a calculation cost of O(N logN) and memory space of O(N). Applying it to some sequences, we found unexpectedly large palindromes and repeats in DNA.
https://doi.org/10.1142/9789814447300_0021
To improve the accuracy of rapid homology searching it is common practice to filter all queries to mask low complexity regions prior to searching. We show in this paper, through a large-scale study of querying the PIR database, that applying popular filtering techniques unselectively to all queries may reduce retrieval effectiveness. We also show that masking queries with our new technique, cafefilter, which uses the overall distribution of motifs in a database, is at least as effective as current popular query filtering tools in large-scale tests.
https://doi.org/10.1142/9789814447300_0022
Over the last two decades there have been tremendous advances with quantitative experimental techniques in physiology and signal transduction. On the cellular level these techniques include the scanning confocal microscopy, single channel recordings from intracellular channels, fluorescent dyes that allow visualization of the spatial and temporal dynamics of second messengers, and bioengineered indicators that can be specifically targeted to particular intracellular organelles…
https://doi.org/10.1142/9789814447300_0023
This paper describes a computational framework for cell biological modeling and simulation that is based on the mapping of experimental biochemical andelectrophysiological data onto experimental images. The framework is designed to enable the construction of complex general models that encompass the general class of problems coupling reaction and diffusion.
https://doi.org/10.1142/9789814447300_0024
This simulation study presents an inquiry into the mechanisms by which a strong electric shock halts life-threatening cardiac arrhythmias. It examines the “extension of refractoriness” hypothesis for defibrillation which postulates that the shock induces an extension of the refractory period of cardiac cells thus blocking propagating waves of arrhythmia and fibrillation. The present study uses a model of the defibrillation process that represents a sheet of myocardium as a bidomain with unequal anisotropy ratios. The tissue consists of curved fibers in which spiral wave reentry is initiated. The defibrillation shock is delivered via two line electrodes that occupy opposite tissue boundaries. Simulation results demonstrate that a large-scale region of depolarization is induced throughout most of the tissue. This depolarization extends the refractoriness of the cells in the region. In addition, new wavefronts are generated from the regions of induced hyperpolarization that further restrict the spiral wave pathway and cause its termination.
https://doi.org/10.1142/9789814447300_0025
Notions of information and/or complexity have been applied to the analysis of a broad spectrum of biologically relevant problems, such as protein structure prediction, DNA motif discovery and compression, and neural spike trains…
https://doi.org/10.1142/9789814447300_0026
Suppose that a biologist wishes to study some local property P of genetic sequences. If he can design (with a computer scientist) an algorithm C which efficiently compresses parts of the sequence which satisfy P, then our algorithm TURBOOPTLIFT locates very quickly where property P occurs by chance on a sequence, and where it occurs as a result of a significant process. Under some conditions, the time complexity of TURBOOPTLIFT is O(n log n). We illustrate its use on the practical problem of locating approximate tandem repeats in DNA sequences.
https://doi.org/10.1142/9789814447300_0027
The combination of a wealth of structural data and impressive computational power provides detailed information pertaining to the structure and dynamics of biomacromolecules. A natural inclination is to incorporate this information into models to gain added predictive power on protein folding and stability. There has been considerable recent interest in developing “knowledge-based” potentials to describe internal interactions in proteins. In these approaches, probability distribution functions are inferred from existing knowledge. A common assumption has been the “quasi-chemical approximation” or “Boltzmann device”. This method relates statistical mechanical probabilities to observed frequencies. The validity of this approach is discussed in detail from a statistical mechanics perspective. Because statistical mechanics is a form of statistical inference based on a lack of knowledge of the system, the “Boltzmann device” does not have a rigorous theoretical justification. In the present work, a statistical mechanics based on partial knowledge of the system is employed. This statistical mechanical scheme uses the minimum description length (MDL) of phase space as its main tool. With this approach, “knowledge-based” potentials can be derived in a rigorous fashion. In practical calculations, these potentials are best obtained using Bayesian inference methods similar to those used in image reconstruction.
https://doi.org/10.1142/9789814447300_0028
An understanding of the regularities in the side chain conformations of proteins and how these are related to local backbone structures is important for protein modeling and design. Previous work using regular secondary structures and regular divisions of the backbone dihedral angle data has shown that these rotamers are sensitive to the protein's local backbone conformation. In this preliminary study, we demonstrate a method for combining a more general backbone structure model with an objective clustering algorithm to investigate the effects of backbone structures on side chain rotamer classes and distributions. For the local structure classification, we use the Structural Building Blocks (SBB) categories, which represent all types of secondary structure, including regular structures, capping structures, and loops. For classification of side chain data, we use Minimum Message Length (MML) clustering from information theory. We show an example of how MML clustering on data classified by backbone SBBs can reveal different distributions of rotamer classes among the SBBs. Using these preliminary results, some of the characteristics of a rotamer library created using MML clustering on SBB dependent rotamer data are demonstrated.
https://doi.org/10.1142/9789814447300_0029
This paper describes a new approach to problem solving by splitting up problem component parts between software and hardware. Our main idea arises from the combination of two previously published works. The first one proposed a conceptual environment of concept modelling in which the machine and the human expert interact. The second one reported an algorithm based on reconfigurable hardware system which outperforms any kind of previously published genetic data base scanning hardware or algorithms. Here we show how efficient the interaction between the machine and the expert is when the concept modelling is based on reconfigurable hardware system. Their cooperation is thus achieved with an real time interaction speed. The designed system has been partially applied to the recognition of primate splice junctions sites in genetic sequences.
https://doi.org/10.1142/9789814447300_0030
Mutual correlation between segments of DNA or protein sequences can be detected by Smith-Waterman local alignments. We present a statistical analysis of alignment of such sequences, based on a recent scaling theory. A new fidelity measure is introduced and shown to capture the significance of the local alignment, i.e., the extent to which thecorrelated subsequences are correctly identified. It is demonstrated how the fidelity maybe optimized in the space of penalty parameters using only the alignment score data of a single sequence pair.
https://doi.org/10.1142/9789814447300_0031
A minimum description length (MDL) and stochastic complexity approach for model selection in robust linear regression is studied in this paper. Computational aspects and implementation of this approach to practical problems are the focuses of the study. Particularly, we provide both algorithms and a package of S language programs for computing the stochastic complexity and proceeding with the associated model selection. A simulation study is then presented for illustration and comparing the MDL approach with the commonly used AIC and BIC methods. Finally, an application is given to a physiological study of triathlon athletes.
https://doi.org/10.1142/9789814447300_0032
In reconstruction of phylogenetic trees from molecular data, it has been pointed out that multifurcate phylogenetic trees are difficult to be correctly reconstructed by the conventional methods like maximum likelihood method(ML). In order to resolve this problem, we have been engaged in developing a new phylogenetic tree reconstruction method, based on the minimum complexity principle widely used in the inductive inference. Our method, which we call “minimum model-based complexity(MBC) method”, has been proved so far to be efficient in estimating multifurcate branching when the tree is described in the form of rooted one. In this study, we make further investigations about the efficiency of MBC method in estimating the multifurcation in unrooted phylogenetic trees. To do so, we conduct computer simulation in which the estimations by MBC method are compared with those by ML, AIC and statistical test approach. The results show that MBC method also provides good estimations even in the case of multifurcate unrooted trees and suggest that it could be generally used for reconstruction of phylogenetic tree having arbitrary multifurcations.
https://doi.org/10.1142/9789814447300_0033
Data continues to accumulate rapidly from the various genome projects and from experimental methods such as X-ray crystallography, NMR spectroscopy, and electron and confocal microscopy. The vast volume of sequence, structural, and functional data, the wide variety of analyses and annotations to be performed, and the variety of organisms and projects represented in this deluge of information combine to present the bioinformatics community with unprecedented challenges…
https://doi.org/10.1142/9789814447300_0034
Proteinmorphosis is a physically-based interactive modeling system for simulating large or small conformational changes of proteins and protein complexes. It takes advantage of the cross-linked one-dimensional nature of protein chains. The user can, based on her chemical knowledge, pull pairs of points (lying either on a single protein or on different molecules) together by specifying geometric distance constraints. The resulting conformation(s) of the molecule(s) of interest is computed by an efficient finite element formalism taking into account elasticity of the protein backbone, van der Waals repulsions, hydrogen bonds, salt bridges and the imposed distance constraints. The conformational change is computed incrementally and the result can be visualized as an animation; complete interactivity is provided to position and view the proteins as desired by the user. Physical properties of regions on the protein can also be chosen interactively. The conformational change of calmodulin upon peptide binding is examined as a first experiment. It is found that the result is satisfactory in reproducing the conformational change that follows on peptide binding. We use Proteinmorphosis to study the cooperative hemoglobin oxygen binding mechanism in a second, more sophisticated, experiment. Different modeling strategies are designed to understand the allosteric (cooperative) binding process in this system and the results are found to be consistent with existing hypotheses.
https://doi.org/10.1142/9789814447300_0035
Protein fold recognition (sometimes called threading) is the prediction of a protein's 3-dimensional shape based on its similarity to a protein of known structure. Fold predictions are low resolution; that is, no effort is made to rotate the protein's component amino acid side chains into their correct spatial orientations. The goal is simply to recognize the protein family member that most closely resembles the target sequence of unknown structure and to create a sensible alignment of the target to the known structure (i.e., a structure-sequence alignment). To facilitate this type of structure prediction, we have designed a low resolution molecular graphics tool. ProtAlign introduces the ability to interact with and edit alignments directly in the 3-dimensional structure as well as in the usual 2-dimensional layout. It also contains several functions and features to help the user assess areas within the alignment. ProtAlign implements an open pipe architecture to allow other programs to access its molecular graphics capabilities. In addition, it is capable of “driving” other programs. Because amino acid side chain orientation is not relevant in fold recognition, we represent amino acid residues as abstract shapes or glyphs much like Lego (tm) blocks and we borrow techniques from comparative flow visualization using streamlines to provide clean depictions of the entire protein model. By creating a low resolution representation of protein structure, we are able to at least double the amount of information on the screen. At the same time, we create a view that is not as busy as the corresponding representations using traditional high resolution visualization methods which show detailed atomic structure. This eliminates distracting and possibly misleading visual clutter resulting from the mapping of protein alignment information onto a high resolution display of the known structure. This molecular graphics program is implemented in OpenGL to facilitate porting to other platforms.
https://doi.org/10.1142/9789814447300_0036
We present and evaluate PROMUSE: an integrated visualization/sonification system for analyzing pairwise protein structural alignments (superpositions of two protein structures in three-dimensional space). We also explore how the use of sound can enhance the perception and recognition of specific aspects of the local environment at given positions in the represented molecular structure.
Sonification presents several opportunities to researchers. For those with visual impairment, data sonification can be a useful alternative to visualization. Sonification can further serve to improve understanding of information in several ways. One use for data sonification is in tasks such as background monitoring, in which case sounds can be used to indicate thresholding events. With PROMUSE, data represented visually may be enhanced or disambiguated by adding sound to the presentation. This aspect of data representation is particularly important for showing features that are difficult to represent visually, due to occlusion or other factors. Another feature of our system is that by representing some variables through sound and others visually, the amount of information that may be represented simultaneously is extended. Our tool aims to augment the power of data visualization rather than replace it.
To maximize the utility of our sonifications to represent data, we employed musical voices and melodic components with unique characteristics. We also used sound effects such as panning a voice to the left or right speaker and changing its volume to maximize the individuality of the sonification elements. By making the sonification parameters distinct, we allow the user to focus on those portions of the sonification necessary to resolve possible ambiguities in the visual display.
Sonifications of low level data such as raw protein or DNA sequences tend to sound random, and not very musical. We chose instead to sonify an analysis of data features, and thereby present a higher level view of the data. We also used brief melodic phrases rather than single notes in order to generate sounds that were more pleasing and musically idiomatic.
To validate the utility of our system, we present the results of an experiment in which PROMUSE was used to test the use of sound as an aid for clarifying visual information. We also compare the overall effectiveness of visual versus aural information delivery.
https://doi.org/10.1142/9789814447300_0037
Computer-based multimedia technology for distance learning and research has come of age – the price point is acceptable, domain experts using off-the-shelf software can prepare compelling materials, and the material can be efficiently delivered via the Internet to a large audience. While not presenting any new scientific results, this paper outlines experiences with a variety of commercial and free software tools and the associated protocols we have used to prepare protein documentaries and other multimedia presentations relevant to molecular biology. A protein documentary is defined here as a description of the relationship between structure and function in a single protein or in a related family of proteins. A description using text and images which is further enhanced by the use of sound and interactive graphics. Examples of documentaries prepared to describe cAMP dependent protein kinase, the founding structural member of the protein kinase family for which there is now over 40 structures can be found at http:\\franklin.burnham-inst.org/rcsb. A variety of other prototype multimedia presentations for molecular biology described in this paper can be found at http:\\franklin.burnham-inst.org/rcsb
https://doi.org/10.1142/9789814447300_0038
The BioJAKE program has been created for the visualization, creation and manipulation of metabolic pathways. It has been designed to provide a familiar and easy-to-use interface while still allowing for the input and manipulation of complex and detailed metabolic data. In recognition of the detailed and diverse sources of data available across the Internet, it also provides a mechanism by which remote database queries can be stored and performed with respect to individual molecules within a pathway. This remote database access functionality is offered in addition to comprehensive local database creation, management and querying capability. The program has been developed in Java so as to provide for platform independence and maximum extendibility.
https://doi.org/10.1142/9789814447300_0039
One of the challenges in biocomputing is to enable the efficient use of a wide variety of fast-evolving computational methods to simulate, analyze, and understand the complex properties and interactions of molecular systems. Our laboratory investigates several areas including molecular visualization, protein-ligand docking, protein-protein docking, molecular surfaces, and the derivation of phenomenological potentials. In this paper we present an approach based on the Python programming language to achieve a high level of integration between these different computational methods and our primary visualization system, AVS. This approach removes many limitations of AVS while increasing dramatically the inter-operability of our computational tools. Several examples are shown to illustrate how this approach enables a high level of integration and inter-operability between different tools, while retaining modularity and avoiding the creation of a large monolithic package that is difficult to extend and maintain.
https://doi.org/10.1142/9789814447300_0040
Computer-aided drug design (CADD) is an exciting and diverse discipline where various aspects of applied and basic research merge and stimulate each other. The latest technological advances (QSAR/QSPR, structure-based design, combinatorial library design, cheminformatics & bioinformatics); the growing number of chemical and biological databases; and an explosion in currently available software tools are providing a much improved basis for the design of ligands and inhibitors with desired specificity.
https://doi.org/10.1142/9789814447300_0041
A new field-based similarity forcing procedure for matching conformationally-flexible molecules is presented. The method extends earlier work on similarity matching of molecules based upon the program MIMIC, by directly coupling a similarity function to a molecular mechanics force field. In this way conformational energetics are fully accounted for in the similarity matching process. Simultaneous similarity/conformational searches can then be undertaken within a Monte Carlo or molecular dynamics framework. Here, a Monte Carlo approach is used to provide a simple example of two HIV-1 reverse transcriptase inhibitors, nevirapine and αAPA, that illustrates the basic characteristics of the method and suggests areas for further investigation.
https://doi.org/10.1142/9789814447300_0042
The thermodynamics of ligand-protein molecular recognition is investigated by the energy landscape approach for two systems: methotrexate(MTX)-dihydrofolate reductase(DHFR) and biotin-streptavidin. The temperature-dependent binding free energy profile is determined using the weighted histogram analysis method. Two different force fields are employed in this study: a simplified model of ligand-protein interactions and the AMBER force field with a soft core smoothing component, used to soften the repulsive part of the potential. The results of multiple docking simulations are rationalized from the shape of the binding free energy profile that characterizes the thermodynamics of the binding process.
https://doi.org/10.1142/9789814447300_0043
Empirical and theoretical approaches to drug discovery have often been perceived as mutually exclusive. Our experience has rather demonstrated that they can be complementary. The structure-based approach to design of compound libraries is clearly helpful; however, testing large libraries continues to reveal unanticipated actives in many of our programs. A rationale for these observations is offered.
https://doi.org/10.1142/9789814447300_0044
A new method (particularly suited to the analysis of High Throughput Screening data) is presented for the determination of quantitative structure activity relationships. The method, termed “Binary QSAR,” accepts binary activity measurements (e.g., pass/fail or active/inactive) and molecular descriptor vectors as input. A Bayesian inference technique is used to predict whether or not a new compound will be active or inactive. Experiments were conducted on a data set of 1947 molecules. The results show that the method exhibits high accuracy and is robust to measurement errors.
https://doi.org/10.1142/9789814447300_0045
A new method for 3-D similarity is presented based on the multiple potential 4-point 3-D pharmacophores expressed by ligands and complementary to receptors. These are calculated for ligands taking conformational flexibility into account, and for receptors through the use of complementary site-points. Through this common frame of reference both ligand-ligand and ligand-receptor similarity studies are possible. The application of the method to selectivity between different serine proteases (thrombin, factor Xa and trypsin) is discussed, and the need to use 4-point pharmacophores rather than 3-point pharmacophores is illustrated. A novel refinement to the potential pharmacophore method that uses a “special” feature to give a relative measure of similarity/diversity is also discussed.
https://doi.org/10.1142/9789814447300_0046
Computer simulations offer critical insights into the reaction of biological macromolecules, especially when the molecular shapes are too complex to be amenable to analytical solution. In this work, the Weighted-Ensemble Brownian (WEB) Dynamics simulation algorithm is adapted to a reaction of two unlike biological molecules, with the interaction modeled by a two-parameter system: a spherical molecular depositing on a target region of an infinite cylinder with a periodic boundary conditions. The original algorithm of Huber and Kim1 is streamlined for this class of reactive models. The reaction rate constant is calculated as a function of relative sizes of the reactive to non-reactive regions of the cylindrical molecule. An analytical expression for the rate constant is also obtained from the solution of the diffusion equation for the special case of a constant-flux boundary condition. Good agreement between analytical and simulation results validates the applicability of WEB Dynamics to a reaction of molecules of complicated shape. On the other hand, the simple form of our analytical expression is useful as a testing case for other simulation and numerical techniques.
https://doi.org/10.1142/9789814447300_0047
Protein structure prediction from sequence remains a major unsolved problem despite decades of work by the best minds in science. However, the availability of genome-scale data is changing the basis for the computational analysis of biological systems, and organizational patterns previously obscured by limited data sets now are becoming apparent. The opportunity to relate sequence, structure, and function at the full organism level is at hand.
https://doi.org/10.1142/9789814447300_0048
Various bioinformatics comparison problems require optimizing several different properties simultaneously. Often linear objective functions combine the values for different properties of solution candidates into a single score to allow for multivariate optimization. In this context, an essential question is how each property should be weighted. Frequently, no apparent measure is available to serve as a model for the score. However, if preferences of certain solution candidates over others in a training set are available, the implied partial ordering may be used to best possibly adjust the weights. We apply different strategies to optimize the parameterization of empirical scoring functions used for two molecular comparison problems, protein threading and small molecule superposition. Using well established evaluation methods, it can be shown that the results of both comparison methods are significantly improved by systematically choosing appropriate weights for the scoring function contributions.
https://doi.org/10.1142/9789814447300_0049
It is shown that there are two types of conserved residues in evolutionary and functionally related proteins whose sequences have been well diverged in evolution. The first group consists of residues forming the active center, while the second (first established in this work) has nothing to do with function and therefore should be related to protein structure and/or protein folding. The lattter group consists of 4 residues in c-type cytochromes and 6 residues in globins. All these residues belong to α-helices and occupy positions (i, i+4) or (i, i+3), stabilizing one helical turn in some helices. These residues form an interface between the N-and C-terminal helices in c-type cytochromes and helices A, G, H in globins. These helical complexes form early in protein folding and are relatively stable in both equilibrium and kinetic folding intermediates. The attractive hypothesis is that these helices form folding nuclei in protein in the frame of the nucleation-growth mechanism of protein folding.
https://doi.org/10.1142/9789814447300_0050
An approach to construct low resolution models of protein structure from sequence information using a combination of different methodologies is described. All possible compact self-avoiding Cα conformations (≈ 10 million) of a small protein chain were exhaustively enumerated on a tetrahedral lattice. The best scoring 10,000 conformations were selected using a lattice-based scoring function. All-atom structures were then generated by fitting an off-lattice four-state ϕ/ψ model to the lattice conformations, using idealised helix and sheet values based on predicted secondary structure. The all-atom conformations were minimised using ENCAD and scored using a second hybrid scoring function. The best scoring 50, 100, and 500 conformations were input to a consensus-based distance geometry routine that used constraints from each the conformation sets and produced a single structure for each set (total of three). Secondary structures were again fitted to the three structures, and the resulting structures were minimised and scored. The lowest scoring conformation was taken to be the “correct” answer. The results of application of this method to twelve proteins are presented.
https://doi.org/10.1142/9789814447300_0051
It is commonly assumed that a protein must attain a stable, folded conformation in order to carry out its specific biological function. Not all proteins conform to this simple view of protein structure and function, however. Certain regions within proteins, and in some cases entire proteins, are not ordered into a unique tertiary structure, but instead appear to exist as ensembles of structures…
https://doi.org/10.1142/9789814447300_0052
Co-chaperonins from diverse organisms exhibit mobile loops which fold into a β hairpin conformation upon binding to the chaperonin. GroES, Gp31, and human Hsp10 mobile loops exhibit a preference for the β hairpin conformation in the free co-chaperonins, and the conformational dynamics of the human Hsp10 mobile loop appear to be restricted by nascent hairpin formation. Backbone conformational entropy must weigh against binding of cochaperonins to chaperonins, and thus the conformational preferences of the loops may strongly influence chaperonin-binding affinity. Indeed, subtle mutations in the loops change GroEL-binding affinity and cause defects in chaperonin function, and these defects can be suppressed by mutations in GroEL which compensate for the changes in affinity. The fact that high-affinity co-chaperonin binding impairs chaperonin function has implications for the mechanism of chaperonin-assisted protein folding.
https://doi.org/10.1142/9789814447300_0053
The anti-cancer drug taxol is known to bind to and induce the polymerization of tubulin and has recently been shown to bind to the anti-apoptotic protein Bcl-2, but not to its homolog, Bcl-XL. Libraries of random peptides displayed on the surface of a bacteriophage were screened to select those exhibiting affinity for taxol. The sequences of these peptides were compared to sequences of proteins involved in mitosis and apoptosis. No significant similarities were detected between the sequences of tubulins and the taxol-selected peptides. However, a high level of similarity exists between the selected peptides and the disordered loop of Bcl-2. Conversely, there was little similarity between the sequences of the selected peptides and Bcl-XL. These results indicate that peptides displayed on the surface of a bacteriophage can mimic the ligand-binding behavior of a disordered protein loop and that comparison of the sequences of affinity-selected peptides with protein sequences can be predictive for ligand binding.
https://doi.org/10.1142/9789814447300_0054
A combination of experimental NMR 3Jαβ coupling constant measurements and theoretical predictions from a statistical model for a random coil have been used to characterise the conformations of amino acid side-chains in an unfolded fibronectin binding protein. The statistical model uses the distribution of torsion angles in a data base of native folded protein structures to provide a description of the torsion angle populations of each residue in a random coil. For all but three of the residues studied a close agreement is observed between the experimental 3Jαβ data and the model predictions (correlation coefficient 0.90; RMSD 0.70 Hz). In these cases the populations about the χ1 torsion angles in the conformational ensemble defining the fibronectin binding protein are well described by those present in the protein data base. For Phe 69, Asp 92 and Asp 105 however significant deviations are observed between the predictions and experimental data. Each of these side-chains is found to be involved in persistent non-random structural features arising from clustering of hydrophobic groups or interactions between charged side-chains. The analysis demonstrates the detailed insight that can be provided into conformationally disordered states by combining experimental and theoretical approaches.
https://doi.org/10.1142/9789814447300_0055
On the basis of available x-ray structures, A-class glutathione S-transferases (GSTs) contain at their C- termini a short α-helix that provides a ‘lid’ over the active site in the presence of the reaction products, glutathione-conjugates. However, in the ligand-free enzyme this helix is disordered and crystallographically invisible. An aromatic cluster including Phe-10, Phe-220, and the catalytic Tyr-9 within the C-terminal strand control the order of this helix. Here, preliminary x-ray crystallographic analyses of the wild type and F220Y rGSTAl-1 in the presence of GSH are described Also, a transition state analysis is presented for ligand-dependent formation of the helix, based on variable temperature stopped-flow fluorescence. Together, the results suggest that the ligand-dependent ordering of the C-terminal strand occurs with a transition state that is highly desolvated, but with few intramolecular hydrogen bonds or electrostatic interactions. However, substitutions at Phe-220 modulate the activation parameters through interactions with the side chain of Tyr-9.
https://doi.org/10.1142/9789814447300_0056
Recombinant forms of the N-terminal domain of the cell adhesion receptor CD2 adopt a variety of folds by exchange of (α-sheets between adjacent polypeptide chains. Although these interdigitated forms are normally metastable, we have used site-directed mutagenesis to alter the kinetics of formation and relative stabilities of these states, leading to spontaneous formation of monomeric, dimeric, trimeric and tetrameric intertwined folded states. A characteristic feature of these fold-disorder-alternative fold transitions is the independence of each domain folding event, as deduced from kinetic analysis of folding data. Structures for fully interdigitated trimeric and tetrameric forms have been modelled, consistent with both the crystallographic and kinetic data. Although the biological role of these alternative folded states remains unclear, these structures form a remarkable demonstration of the fluidity of structure generated from a single polypeptide chain.
https://doi.org/10.1142/9789814447300_0057
The amino-terminus of eucaryotic DNA topoisomerase I and the carboxy-terminus of eucaryotic DNA topoisomerase II contain sequences that are enriched in charged amino acid residues, hyper-sensitive to protease digestion, not required for the in vitro topoisomerase activities, able to tolerate insertion and deletion mutations, and thus may have a disordered structure. In an interesting contrast to the catalytically essential core domain, the sequences in these terminal hydrophilic domains are not conserved among the topoisomerases from different species. However, many lines of evidence, including those presented here, demonstrate that the topoisomerase tail domains have critical intracellular functions. The biological functions of the amino-terminus of topoisomerase I include the nuclear import and targeting to the transcriptionally active loci. The carboxy-terminus of topoisomerase II also contains the sequences necessary for nuclear localization and possibly sequences necessary for other critical functions.
https://doi.org/10.1142/9789814447300_0058
Advances in structural biology have provoked a re-evaluation of the biological significance of the disordered state of proteins. We believe that the rules that govern structure, stability and kinetics in the molecular recognition between disordered polypeptide chains can be elucidated by studying processes that couple association with folding. The reassembly of single domain proteins by fragment complementation provides an excellent opportunity to study them. Since almost the complete sequence is available, although not on a single chain, most of the complementary fragments are expected to reassemble. However, that happens not to be the case. We have chosen E. coli thioredoxin (Trx), a small, single α/β-domain protein, as a model system to study the effect of the site and number of cleavages on the reassembly of complementary fragments. We have shown at atomic detail the reassembly after cleavage of a loop (1-73, 74-108)1 and after cleavage of an α-helix (1-37, 38- 108).2 Although both sets of fragments produce native-like complexes, there are clear differences in the interface geometry, apparent stability of the folded state and mechanism of association/folding: (i) the apparent equilibrium dissociation constant for 1-37/38-108 complex (4 μM) is higher than the one for 1-73/74-108 complex (49 nM), (ii) the apparent rate constants of non-self-association are similar (about 103 M −1 s−1), and (iii) only the 1-37 fragment self-associates under these experimental conditions. Here the competition between self- and non-self-association leads to an apparently less stable 1-37/38-108 complex.