Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we investigate the central problem of finding recombination events. It is commonly assumed that a present population is a descendent of a small number of specific sequences called founders. The recombination process consists in given two equal length sequences, generates a third sequence of the same length by concatenating the prefix of one sequence with the suffix of the other sequence. Due to recombination, a present sequence (called a recombinant) is thus composed of blocks from the founders. A major question related to founder sequences is the so-called Minimum Mosaic problem: using the natural parsimony criterion for the number of recombinations, find the "best" founders. In this article, we prove that the Minimum Mosaic problem given haplotype recombinants with no missing values is NP-hard when the number of founders is given as part of the input and propose some exact exponential-time algorithms for the problem, which can be considered polynomial provided some extra information. Notice that Rastas and Ukkonen proved that the Minimum Mosaic problem is NP-hard using a somewhat unrealistic mutation cost function. The aim of this paper is to provide a better complexity insight of the problem.
We propose a model selection method to estimate the relation of multiple SNPs, environmental factors and the binary disease trait. We applied the combination of logistic regression and genetic algorithm for this study. The logistic regression model can capture the continuous effects of environments without categorization, which causes the loss of the information. To construct an accurate prediction rule for binary trait, we adopted Akaike's information criterion (AIC) to find the most effective set of SNPs and environments. That is, the set of SNPs and environments that gives the smallest AIC is chosen as the optimal set. Since the number of combinations of SNPs and environments is usually huge, we propose the use of the genetic algorithm for choosing the optimal SNPs and environments in the sense of AIC. We show the effectiveness of the proposed method through the analysis of the case/control populations of diabetes, Alzheimer's disease and obesity patients. We succeeded in finding an efficient set to predict types of diabetes and some SNPs which have strong interactions to age while it is not significant as a single locus.
We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.
Genome Analyzer (GenoA) with a relational database back-end, was developed to extract information from mammalian genomic sequences. This data mining and visualization tool-set enables laboratory bench scientists to identify and assemble virtual cDNA from genomic exon sequences, and provides a starting point to identify potential alternative splice variants and polymorphisms in silico. The study described in this paper demonstrates the use of GenoA to study human brain hyperpolarization-activated cation channel genes HCN1 and HCN3.
A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. In a seminal paper, Wang et al.1 studied the problem of constructing a phylogenetic network, allowing recombination between sequences, with the constraint that the resulting cycles must be disjoint. We call such a phylogenetic network a "galled-tree". They gave a polynomial-time algorithm that was intended to determine whether or not a set of sequences could be generated on galled-tree. Unfortunately, the algorithm by Wang et al.1 is incomplete and does not constitute a necessary test for the existence of a galled-tree for the data. In this paper, we completely solve the problem. Moreover, we prove that if there is a galled-tree, then the one produced by our algorithm minimizes the number of recombinations over all phylogenetic networks for the data, even allowing multiple-crossover recombinations. We also prove that when there is a galled-tree for the data, the galled-tree minimizing the number of recombinations is "essentially unique". We also note two additional results: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation per site is allowed; second, the site compatibility problem (which is NP-hard in general) can be solved in polynomial time for any set of sequences that can be derived on a galled tree.
Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in general phylogenetic networks.
This paper greatly extends the conference version that appears in an earlier work.8 PowerPoint slides of the conference talk can be found at our website.7
The problem of resolving genotypes into haplotypes, under the perfect phylogeny model, has been under intensive study recently. All studies so far handled missing data entries in a heuristic manner. We prove that the perfect phylogeny haplotype problem is NP-complete when some of the data entries are missing, even when the phylogeny is rooted. We define a biologically motivated probabilistic model for genotype generation and for the way missing data occur. Under this model, we provide an algorithm, which takes an expected polynomial time. In tests on simulated data, our algorithm quickly resolves the genotypes under high rates of missing entries.
The existence of haplotype blocks transmitted from parents to offspring has been suggested recently. This has created an interest in the inference of the block structure and length. The motivation is that haplotype blocks that are characterized well will make it relatively easier to quickly map all the genes carrying human diseases. To study the inference of haplotype block systematically, we propose a statistical framework. In this framework, the optimal haplotype block partitioning is formulated as the problem of statistical model selection; missing data can be handled in a standard statistical way; population strata can be implemented; block structure inference/hypothesis testing can be performed; prior knowledge, if present, can be incorporated to perform a Bayesian inference. The algorithm is linear in the number of loci, instead of NP-hard for many such algorithms. We illustrate the applications of our method to both simulated and real data sets.
The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.
Two grand challenges in the postgenomic era are to develop a detailed understanding of heritable variation in the human genome, and to develop robust strategies for identifying the genetic contribution to diseases and drug responses. Haplotypes of single nucleotide polymorphisms (SNPs) have been suggested as an effective representation of human variation, and various haplotype-based association mapping methods for complex traits have been proposed in the literature. However, humans are diploid and, in practice, genotype data instead of haplotype data are collected directly. Therefore, efficient and accurate computational methods for haplotype reconstruction are needed and have recently been investigated intensively, especially for tightly linked markers such as SNPs. This paper reviews statistical and combinatorial haplotyping algorithms using pedigree data, unrelated individuals, or pooled samples.
Evolutionary trends have been examined in 146 HIV-1 forms (2662 copies, 2311 isolates) polymorphic for the TATA box using the "DNA sequence→affinity for TBP" regression (TBP is the TATA binding protein). As a result, a statistically significant excess of low-affinity TATA box HIV-1 variants corresponding to a low level of both basal and TAT-dependent expression and, consequently, slow replication of HIV-1 have been detected. A detailed analysis revealed that the excess of slowly replicating HIV-1 is associated with the subtype E-associated TATA box core sequence "CATAAAA". Principal Component Analysis performed on 2662 HIV-1 TATA box copies in 70 countries revealed the presence of two principal components, PC1 (75.7% of the variance) and PC2 (23.3% of the variance). They indicate that each of these countries is specifically associated with one of the following trends in HIV-1 evolution: neutral drift around the normal TATA box; neutral drift around the slowly replicating TATA box core sequence (phylogenetic inertia); an adaptive increase in the frequency of the slowly replicating form.
Haplotypes can provide significant information in many research fields, including molecular biology and medical therapy. However, haplotyping is much more difficult than genotyping by using only biological techniques. With the development of sequencing technologies, it becomes possible to obtain haplotypes by combining sequence fragments. The haplotype reconstruction problem of diploid individual has received considerable attention in recent years. It assembles the two haplotypes for a chromosome given the collection of fragments coming from the two haplotypes. Fragment errors significantly increase the difficulty of the problem, and which has been shown to be NP-hard. In this paper, a fast and accurate algorithm, named FAHR, is proposed for haplotyping a single diploid individual. Algorithm FAHR reconstructs the SNP sites of a pair of haplotypes one after another. The SNP fragments that cover some SNP site are partitioned into two groups according to the alleles of the corresponding SNP site, and the SNP values of the pair of haplotypes are ascertained by using the fragments in the group that contains more SNP fragments. The experimental comparisons were conducted among the FAHR, the Fast Hare and the DGS algorithms by using the haplotypes on chromosome 1 of 60 individuals in CEPH samples, which were released by the International HapMap Project. Experimental results under different parameter settings indicate that the reconstruction rate of the FAHR algorithm is higher than those of the Fast Hare and the DGS algorithms, and the running time of the FAHR algorithm is shorter than those of the Fast Hare and the DGS algorithms. Moreover, the FAHR algorithm has high efficiency even for the reconstruction of long haplotypes and is very practical for realistic applications.
The identification of haplotypes, which encode SNPs in a single chromosome, makes it possible to perform a haplotype-based association test with disease. Given a set of genotypes from a population, the process of recovering the haplotypes, which explain the genotypes, is called haplotype inference (HI). We propose an improved preprocessing method for solving the haplotype inference by pure parsimony (HIPP), which excludes a large amount of redundant haplotypes by detecting some groups of haplotypes that are dispensable for optimal solutions. The method uses only inclusion relations between groups of haplotypes but dramatically reduces the number of candidate haplotypes; therefore, it causes the computational time and memory reduction of real HIPP solvers. The proposed method can be easily coupled with a wide range of optimization methods which consider a set of candidate haplotypes explicitly. For the simulated and well-known benchmark datasets, the experimental results show that our method coupled with a classical exact HIPP solver run much faster than the state-of-the-art solver and can solve a large number of instances that were so far unaffordable in a reasonable time.
The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
Mapping short reads to a reference genome is an essential step in many next-generation sequencing (NGS) analyses. In plants with large genomes, a large fraction of the reads can align to multiple locations of the genome with equally good alignment scores. How to map these ambiguous reads to the genome is a challenging problem with big impacts on the downstream analysis. Traditionally, the default method is to assign an ambiguous read randomly to one of the many potential locations. In this study, we explore two alternative methods that are based on the hypothesis that the possibility of an ambiguous read being generated by a location is proportional to the total number of reads produced by that location: (1) the enrichment method that assigns an ambiguous read to the location that has produced the most reads among all the potential locations, (2) the probability method that assigns an ambiguous read to a location based on a probability proportional to the number of reads the location produces. We systematically compared the performance of the proposed methods with that of the default random method. Our results showed that the enrichment method produced better results than the default random method and the probability method in the discovery of single nucleotide polymorphisms (SNPs). Not only did it produce more SNP markers, but it also produced SNP markers with better quality, which was demonstrated using multiple mainstay genomic analyses, including genome-wide association studies (GWAS), minor allele distribution, population structure, and genomic prediction.
Single nucleotide polymorphisms (SNPs) are the most frequently occurring genetic variations. Biologists use identified SNPs to investigate genetic diseases and heredity markers. They are also used to prevent side effects of medication. Thus, SNPs play an important role in personalized medicine. However, many association studies provide only the relationship among SNPs, diseases and cancers, without giving an SNP ID. In order to identify SNPs in a sequence, this research built dbSNP, SNP fasta and SNP flanking marker databases for the rat, mouse and human genome from the NCBI database. The proposed method utilizes SNP flanking markers that are extracted from a SNP fasta sequence and combines a Boyer–Moore algorithm with a dynamic programming method. The Boyer–Moore algorithm helps to select possible SNPs from the SNP fasta database using unknown sequences, and the dynamic programming method will then validate these SNPs. This method is very reliable retrieving SNP IDs from an unknown sequence. The experimental results show that this method is indeed able to determine exact SNP IDs from a sequence. It constitutes a novel application for the identification of SNP IDs from the literature and can be used in systematic association studies.
Identification of causal noncoding single nucleotide polymorphisms (SNPs) is important for maximizing the knowledge dividend from human genome-wide association studies (GWAS). Recently, diverse machine learning-based methods have been used for functional SNP identification; however, this task remains a fundamental challenge in computational biology. We report CERENKOV3, a machine learning pipeline that leverages clustering-derived and molecular network-derived features to improve prediction accuracy of regulatory SNPs (rSNPs) in the context of post-GWAS analysis. The clustering-derived feature, locus size (number of SNPs in the locus), derives from our locus partitioning procedure and represents the sizes of clusters based on SNP locations. We generated two molecular network-derived features from representation learning on a network representing SNP-gene and gene-gene relations. Based on empirical studies using a ground-truth SNP dataset, CERENKOV3 significantly improves rSNP recognition performance in AUPRC, AUROC, and AVGRANK (a locus-wise rank-based measure of classification accuracy we previously proposed).
We report the development of a homogenous assay for the genotyping of single-nucleotide polymorphisms (SNPs), utilizing a fluorescent dsDNA-binding dye. Termed TM-shift genotyping, this method combines multiplex allele-specific PCR with sequence differentiation based on the melting temperatures of amplification products. Allele-specific primers differing in length were used with a common reverse primer in a single-tube assay. PCR amplification followed by melt curve analysis was performed with a fluorescent dsDNA-binding dye on a real-time capillary thermocycler. Genotyping was carried out in a single-tube homogeneous assay in 25 minutes. We compared the accuracy and efficiency of this TM genotyping method with conventional restriction fragment genotyping of a novel single nucleotide polymorphism in the Jagged1 (JAG1) gene. The flexibility, economy and accuracy of this new method for genotyping polymorphisms could make it useful for a variety of research and diagnostic applications.
Eighty percent of DNA outside protein coding regions was shown biochemically functional by the ENCODE project, enabling studies of their interactions. Studies have since explored how convergent downstream mechanisms arise from independent genetic risks of one complex disease. However, the cross-talk and epistasis between intergenic risks associated with distinct complex diseases have not been comprehensively characterized. Our recent integrative genomic analysis unveiled downstream biological effectors of disease-specific polymorphisms buried in intergenic regions, and we then validated their genetic synergy and antagonism in distinct GWAS. We extend this approach to characterize convergent downstream candidate mechanisms of distinct intergenic SNPs across distinct diseases within the same clinical classification. We construct a multipartite network consisting of 467 diseases organized in 15 classes, 2,358 disease-associated SNPs, 6,301 SNPassociated mRNAs by eQTL, and mRNA annotations to 4,538 Gene Ontology mechanisms. Functional similarity between two SNPs (similar SNP pairs) is imputed using a nested information theoretic distance model for which p-values are assigned by conservative scale-free permutation of network edges without replacement (node degrees constant). At FDR≤5%, we prioritized 3,870 intergenic SNP pairs associated, among which 755 are associated with distinct diseases sharing the same disease class, implicating 167 intergenic SNPs, 14 classes, 230 mRNAs, and 134 GO terms. Co-classified SNP pairs were more likely to be prioritized as compared to those of distinct classes confirming a noncoding genetic underpinning to clinical classification (odds ratio ∼3.8; p≤10-25). The prioritized pairs were also enriched in regions bound to the same/interacting transcription factors and/or interacting in long-range chromatin interactions suggestive of epistasis (odds ratio ∼ 2,500; p≤10-25). This prioritized network implicates complex epistasis between intergenic polymorphisms of co-classified diseases and offers a roadmap for a novel therapeutic paradigm: repositioning medications that target proteins within downstream mechanisms of intergenic disease-associated SNPs. Supplementary information and software: http://lussiergroup.org/publications/disease_class
Last years the study of the influence of genetic factors in the susceptibility to some common diseases has been obtaining satisfactory results. These results contribute to the prevention of these diseases as well as to design personalized treatments. The present work introduces a technique based on 2D representations of the genetic data that can contribute to find these disease associations. A real case application is presented in which we analyze the relation between the allele pair values of 748 Single Nucleotide Polymorphisms (SNPs) and the susceptibility to seven common psychiatrical disorders.