Search name | Searched On | Run search |
---|---|---|
[Keyword: Proteomics] AND [All Categories: Bioinformatics & Computational Biol... (23) | 28 Mar 2025 | Run |
[Keyword: Permeation] AND [All Categories: Analytical Chemistry] (1) | 28 Mar 2025 | Run |
[Keyword: Statistics] AND [All Categories: Quantum Mechanics and Quantum Information] (3) | 28 Mar 2025 | Run |
[Keyword: Statistics] AND [All Categories: Entrepreneurship] (1) | 28 Mar 2025 | Run |
[Keyword: Microprobe] AND [All Categories: Physical Chemistry] (1) | 28 Mar 2025 | Run |
You do not have any saved searches
Two-dimensional gel electrophoresis (2DGE) images are an important support for the analysis of proteins in proteomics. The registration of 2DGE images is considered as one of key elements in protein identification while it is a difficult problem. This paper proposes a new accurate nonlinear registration approach for 2DGE images, based on the exploitation of both spot distance measure and spot intensity. The method consists of three steps: multi-resolution affine registration, spot pairing and thin-plate spline interpolation. The results on both simulated and real gel images show that the proposed method significantly improves registration accuracy in comparison with thin-plate spline registration techniques.
Proteomics research programs typically comprise the identification of protein content of any given cell, their isoforms, splice variants, post-translational modifications, interacting partners and higher-order complexes under different conditions. These studies present significant analytical challenges owing to the high proteome complexity and the low abundance of the corresponding proteins, which often requires highly sensitive and resolving techniques. Mass spectrometry plays an important role in proteomics and has become an indispensable tool for molecular and cellular biology. However, the analysis of mass spectrometry data can be a daunting task in view of the complexity of the information to decipher, the accuracy and dynamic range of quantitative analysis, the availability of appropriate bioinformatics software and the overwhelming size of data files. The past ten years have witnessed significant technological advances in mass spectrometry-based proteomics and synergy with bioinformatics is vital to fulfill the expectations of biological discovery programs. We present here the technological capabilities of mass spectrometry and bioinformatics for mining the cellular proteome in the context of discovery programs aimed at trace-level protein identification and expression from microgram amounts of protein extracts from human tissues.
As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.
The precise excision of introns from mRNAs is executed by the spliceosome, a cellular machinery composed by five small nuclear RNAs and hundreds of proteins. In the last few years, several groups have used proteomics and computational biology tools to characterize the components of the human spliceosome. These reports have identified basically all known splicing factors and several new proteins. The composition of the human spliceosome confirms the link between splicing and other steps in gene expression. Here we comment on these reports and discuss the perspectives for the coming years.
In proteomics, tandem mass spectrometry is the key technology for peptide sequencing. However, partially due to the deficiency of peptide identification software, a large portion of the tandem mass spectra are discarded in almost all proteomics centers because they are not interpretable. The problem is more acute with the lower quality data from low end but more popular devices such as the ion trap instruments.
In order to deal with the noisy and low quality data, this paper develops a systematic machine learning approach to construct a robust linear scoring function, whose coefficients are determined by a linear programming. A prototype, PRIMA, was implemented. When tested with large benchmarks of varying qualities, PRIMA consistently has higher accuracy than commonly used software MASCOT, SEQUEST and X! Tandem.
Cisplatin-induced drug resistance is known to involve a complex set of cellular changes whose molecular mechanism details remain unclear. In this study, we developed a systems biology approach to examine proteomics- and network-level changes between cisplatin-resistant and cisplatin-sensitive cell lines. This approach involves experimental investigation of differential proteomics profiles and computational study of activated enriched proteins, protein interactions, and protein interaction networks. Our experimental platform is based on a Label-free liquid Chromatography/mass spectrometry proteomics platform. Our computational methods start with an initial list of 119 differentially expressed proteins. We expanded these proteins into a cisplatin-resistant activated sub-network using a database of human protein-protein interactions. An examination of network topology features revealed the activated responses in the network are closely coupled. By examining sub-network proteins using gene ontology categories, we found significant enrichment of proton-transporting ATPase and ATP synthase complexes activities in cisplatin-resistant cells in the form of cooperative down-regulations. Using two-dimensional visualization matrixes, we further found significant cascading of endogenous, abiotic, and stress-related signals. Using a visual representation of activated protein categorical sub-networks, we showed that molecular regulation of cell differentiation and development caused by responses to proteome-wide stress as a key signature to the acquired drug resistance.
Determining glycan structures is vital to comprehend cell-matrix, cell-cell, and even intracellular biological events. Glycan sequencing, which determines the primary structure of a glycan using tandem mass spectrometry (MS/MS), remains one of the most important tasks in proteomics. Analogous to peptide de novo sequencing, glycan de novo sequencing determines the structure without the aid of a known glycan database. We show in this paper that glycan de novo sequencing is NP-hard. We then provide a heuristic algorithm and develop a software program to solve the problem in practical cases. Experiments on real MS/MS data of glycopeptides demonstrate that our heuristic algorithm gives satisfactory results on practical data.
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.
In proteomics, useful signal may be unobserved or lost due to the lack of confident peptide-spectral matches. Selection of differential spectra, followed by associative peptide/protein mapping may be a complementary strategy for improving sensitivity and comprehensiveness of analysis (spectra-first paradigm). This approach is complementary to the standard approach where functional analysis is performed only on the finalized protein list assembled from identified peptides from the spectra (protein-first paradigm). Based on a case study of renal cancer, we introduce a simple spectra-binning approach, MZ-bin. We demonstrate that differential spectra feature selection using MZ-bin is class-discriminative and can trace relevant proteins via spectra associative mapping. Moreover, proteins identified in this manner are more biologically coherent than those selected directly from the finalized protein list. Analysis of constituent peptides per protein reveals high expression inconsistency, suggesting that the measured protein expressions are in fact, poor approximations of true protein levels. Moreover, analysis at the level of constituent peptides may provide higher resolution insight into the underlying biology: Via MZ-bin, we identified for the first time differential splice forms for the known renal cancer marker MAPT. We conclude that the spectra-first analysis paradigm is a complementary strategy to the traditional protein-first paradigm and can provide deeper level insight.
Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. t-test and recursive feature elimination. Moreover, due to high test variability, selecting the top proteins based on p-value ranks — even when restricted to high-abundance proteins — does not improve reproducibility. Statistical testing based on networks are believed to be more robust, but this does not always hold true: The commonly used hypergeometric enrichment that tests for enrichment of protein subnets performs abysmally due to its dependence on unstable protein pre-selection steps. We demonstrate here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics. These have originally been introduced and tested extensively on genomics data. We show here that they are highly stable, reproducible and select relevant features when applied to proteomics data. It is also evident from these results that use of statistical feature testing on protein expression data should be executed with due caution. Careless use of networks does not resolve poor-performance issues, and can even mislead. We recommend augmenting statistical feature-selection methods with concurrent analysis on stability and reproducibility to improve the quality of the selected features prior to experimental validation.
Proteomic challenges, stirred up by the advent of high-throughput technologies, produce large amount of MS data. Nowadays, the routine manual search does not satisfy the “speed” of modern science any longer. In our work, the necessity of single-thread analysis of bulky data emerged during interpretation of HepG2 proteome profiling results for proteoforms searching. We compared the contribution of each of the eight search engines (X!Tandem, MS-GF+, MS Amanda, MyriMatch, Comet, Tide, Andromeda, and OMSSA) integrated in an open-source graphical user interface SearchGUI (http://searchgui.googlecode.com) into total result of proteoforms identification and optimized set of engines working simultaneously. We also compared the results of our search combination with Mascot results using protein kit UPS2, containing 48 human proteins. We selected combination of X!Tandem, MS-GF+ and OMMSA as the most time-efficient and productive combination of search. We added homemade java-script to automatize pipeline from file picking to report generation. These settings resulted in rise of the efficiency of our customized pipeline unobtainable by manual scouting: the analysis of 192 files searched against human proteome (42153 entries) downloaded from UniProt took 11h.
Reverse Phase Protein Arrays (RPPA) is a high-throughput technology used to profile levels of protein expression. Handling the large datasets generated by RPPA can be facilitated by appropriate software tools. Here, we describe RPPAware, a free and intuitive software suite that was developed specifically for analysis and visualization of RPPA data. RPPAware is a portable tool that requires no installation and was built using Java. Many modules of the tool invoke R to utilize the statistical features. To demonstrate the utility of RPPAware, data generated from screening brain regions of a mouse model of Down syndrome with 62 antibodies were used as a case study. The ease of use and efficiency of RPPAware can accelerate data analysis to facilitate biological discovery. RPPAware 1.0 is freely available under GNU General Public License from the project website at http://downsyndrome.ucdenver.edu/iddrc/rppaware/home.htm along with a full documentation of the tool.
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated the objectivity of the FCS p-value, and followed up on the value of MPP from predicted complexes. Our results suggest that (1) FCS p-values are non-objective, and are confounded strongly by complex size, (2) best recovery performance do not necessarily lie at standard p-value cutoffs, (3) while predicted complexes may be used for augmenting MPP, they are inferior to real complexes, and are further confounded by issues relating to network coverage and quality and (4) moderate sized complexes of size 5 to 10 still exhibit considerable instability, we find that FCS works best with big complexes. While FCS is a powerful approach, blind reliance on its non-objective p-value is ill-advised.
Research suggests that individuals who experience prolonged exposure to stress may be at higher risk for developing psychological stress disorders. Currently, psychological stress is primarily evaluated by professional physicians using rating scales, which may be prone to subjective biases and limitations of the scales. Therefore, it is imperative to explore more objective, accurate, and efficient biomarkers for evaluating the level of psychological stress in an individual. In this study, we utilized 4D data-independent acquisition (4D-DIA) proteomics for quantitative protein analysis, and then employed support vector machine (SVM) combined with SHAP interpretation algorithm to identify potential biomarkers for psychological stress levels. Biomarkers validation was subsequently achieved through machine learning classification and a substantial amount of a priori knowledge derived from the knowledge graph. We performed cross-validation of the biomarkers using two batches of data, and the results showed that the combination of Glyceraldehyde-3-phosphate dehydrogenase and Fibronectin yielded an average area under the curve (AUC) of 92%, an average accuracy of 86%, an average F1 score of 79%, and an average sensitivity of 83%. Therefore, this combination may represent a potential approach for detecting stress levels to prevent psychological stress disorders.
Determining the structure of a protein is not an easy task, which usually involved a time-consuming and costly process in the web lab. Using computational methods to predict a protein's tertiary structure from its primary structure (the amino acid sequence) is desirable. Disordered regions are segments of a protein that do not have a fixed conformation, which makes the structure prediction harder. Also, these disordered regions are functionally important for a protein. In this research, we would like to identify such regions with a focus on selecting a proper feature set. Three feature selection methods, namely F-score, information gain (IG), and k-medoids clustering, are used for feature selection. The support vector machine (SVM) is then used for classification. The results show that the classification accuracy can be raised with a smaller feature set. The k-medoids clustering feature selection can reduce the number of features from 440 to 150 and improve the accuracy from 84.66 to 86.81% in five-fold cross validation. It also has a more stable performance than F-score and IG.
Integration of transcriptomic and proteomic data should reveal multi-layered regulatory processes governing cancer cell behaviors. Traditional correlation-based analyses have demonstrated limited ability to identify the post-transcriptional regulatory (PTR) processes that drive the non-linear relationship between transcript and protein abundances. In this work, we ideate an integrative approach to explore the variety of post-transcriptional mechanisms that dictate relationships between genes and corresponding proteins. The proposed workflow utilizes the intuitive technique of scatterplot diagnostics or scagnostics, to characterize and examine the diverse scatterplots built from transcript and protein abundances in a proteogenomic experiment. The workflow includes representing gene-protein relationships as scatterplots, clustering on geometric scagnostic features of these scatterplots, and finally identifying and grouping the potential gene-protein relationships according to their disposition to various PTR mechanisms. Our study verifies the efficacy of the implemented approach to excavate possible regulatory mechanisms by utilizing comprehensive tests on a synthetic dataset. We also propose a variety of 2D pattern-specific downstream analyses methodologies such as mixture modeling, and mapping miRNA post-transcriptional effects to explore each mechanism further. This work suggests that the proposed methodology has the potential for discovering and categorizing post-transcriptional regulatory mechanisms, manifesting in proteogenomic trends. These trends subsequently provide evidence for cancer specificity, miRNA targeting, and identification of regulation impacted by biological functionality and different types of degradation. (Supplementary Material - https://github.com/arunima2/PTRE_PSB_2020)
Molecular mechanisms characterizing cancer development and progression are complex and process through thousands of interacting elements in the cell. Understanding the underlying structure of interactions requires the integration of cellular networks with extensive combinations of dysregulation patterns. Recent pan-cancer studies focused on identifying common dysregulation patterns in a confined set of pathways or targeting a manually curated set of genes. However, the complex nature of the disease presents a challenge for finding pathways that would constitute a basis for tumor progression and requires evaluation of subnetworks with functional interactions. Uncovering these relationships is critical for translational medicine and the identification of future therapeutics. We present a frequent subgraph mining algorithm to find functional dysregulation patterns across the cancer spectrum. We mined frequent subgraphs coupled with biased random walks utilizing genomic alterations, gene expression profiles, and protein-protein interaction networks. In this unsupervised approach, we have recovered expert-curated pathways previously reported for explaining the underlying biology of cancer progression in multiple cancer types. Furthermore, we have clustered the genes identified in the frequent subgraphs into highly connected networks using a greedy approach and evaluated biological significance through pathway enrichment analysis. Gene clusters further elaborated on the inherent heterogeneity of cancer samples by both suggesting specific mechanisms for cancer type and common dysregulation patterns across different cancer types. Survival analysis of sample level clusters also revealed significant differences among cancer types (p < 0.001). These results could extend the current understanding of disease etiology by identifying biologically relevant interactions.
Supplementary Information: Supplementary methods, figures, tables and code are available at https://github.com/bebeklab/FSM_Pancancer.
Subcellular protein localization is important for understanding functional states of cells, but measuring and quantifying this information can be difficult and typically requires high-resolution microscopy. In this work, we develop a metric to define surface protein polarity from immunofluorescence (IF) imaging data and use it to identify distinct immune cell states within tumor microenvironments. We apply this metric to characterize over two million cells across 600 patient samples and find that cells identified as having polar expression exhibit characteristics relating to tumor-immune cell engagement. Additionally, we show that incorporating these polarity-defined cell subtypes improves the performance of deep learning models trained to predict patient survival outcomes. This method provides a first look at using subcellular protein expression patterns to phenotype immune cell functional states with applications to precision medicine.
De novo peptide sequencing that determines the amino acid sequence of a peptide via tandem mass spectrometry (MS/MS) has been increasingly used nowadays in proteomics for protein identification. Current de novo methods generally employ a graph theory which usually produces a large number of candidate sequences and causes heavy computational cost while trying to determine a sequence with less ambiguity. In this paper, we will present an efficient and effective de novo sequencing algorithm which greatly reduces the number of candidate sequences. By utilizing certain properties of b- and y-ion series in MS/MS spectrum, we first propose a reliable two-way parallel searching algorithm to filter out the peptide candidates which are further pruned by an intensity evidence based screening criterion. Then, the best candidate is singled out using a scoring function by consideration of total intensity evidence within certain local region. Results of our algorithm were compared with those of PEAKS, a well-known de novo sequencing software. Experimental results demonstrated the efficiency and potency of our approach.
MALDI-based Imaging Mass Spectrometry (IMS) is an analytical technique that provides the opportunity to study the spatial distribution of biomolecules including proteins and peptides in organic tissue. IMS measures a large collection of mass spectra spread out over an organic tissue section and retains the absolute spatial location of these measurements for analysis and imaging. The classical approach to IMS imaging, producing univariate ion images, is not well suited as a first step in a prospective study where no a priori molecular target mass can be formulated. The main reasons for this are the size and the multivariate nature of IMS data. In this paper we describe the use of principal component analysis as a multivariate pre-analysis tool, to identify the major spatial and mass-related trends in the data and to guide further analysis downstream. First, a conceptual overview of principal component analysis for IMS is given. Then, we demonstrate the approach on an IMS data set collected from a transversal section of the spinal cord of a standard control rat.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.