Quantitative biomedical data analysis is a fast-growing interdisciplinary area of applied and computational mathematics, statistics, computer science, and biomedical science, leading to new fields such as bioinformatics, biomathematics, and biostatistics. In addition to traditional statistical techniques and mathematical models using differential equations, new developments with a very broad spectrum of applications, such as wavelets, spline functions, curve and surface subdivisions, sampling, and learning theory, have found their mathematical home in biomedical data analysis.
This book gives a new and integrated introduction to quantitative medical data analysis from the viewpoint of biomathematicians, biostatisticians, and bioinformaticians. It offers a definitive resource to bridge the disciplines of mathematics, statistics, and biomedical sciences. Topics include mathematical models for cancer invasion and clinical sciences, data mining techniques and subset selection in data analysis, survival data analysis and survival models for cancer patients, statistical analysis and neural network techniques for genomic and proteomic data analysis, wavelet and spline applications for mass spectrometry data preprocessing and statistical computing.
Sample Chapter(s)
Chapter 1: An Overview on Variable Selection for Longitudinal Data (200 KB)
https://doi.org/10.1142/9789812772121_fmatter
The following sections are included:
https://doi.org/10.1142/9789812772121_0001
During the past two decades, there have been many new developments in longitudinal data analysis. Authors have made many efforts on developing diverse models, along with inference procedures, for longitudinal data. More recently, researchers in longitudinal modeling have begun addressing the vital issue of variable selection. Model selection criteria such as AIC, BIC, Cp, LASSO and SCAD can be extended to longitudinal data, although care is required to adapt the classical ideas and formulas to deal with within-subject correlation. This chapter presents a review on recent developments on variable selection criteria for longitudinal data.
https://doi.org/10.1142/9789812772121_0002
In statistics, model selection has a long standing history, while new results in this area still keep coming. At this point, it is nearly impossible and not helpful to give a comprehensive survey on all the available theorems. We take a special angle: from where more results are likely to be generated. Our perspective is based on some recent interesting findings in applied mathematics; namely, in some cases a subset of NP hard problems can be solved effectively by some convex optimization approaches, which only require polynomial time. We discuss the potential of this approach. For users who would like to know more about the existing ideas in model selection, we provide a summary in the end.
https://doi.org/10.1142/9789812772121_0003
This article illustrates how to develop state space models for AIDS epidemic in homosexual populations. A generalized Bayesian procedure is proposed to estimate the unknown parameters and the state variables. As an application, the model and the method are applied to the AIDS incidence data of homosexual and bisexual men of Switzerland. The analysis of these data clearly indicates that the model and methods can solve many difficult problems which are not possible by other currently available models and approaches.
https://doi.org/10.1142/9789812772121_0004
By surveying recent studies by molecular biologists and cancer geneticists, in this chapter we have proposed general stochastic models of car-cinogenesis and provided biological evidences for these models. Because most of these models are quite complicated far beyond the scope of the MVK two-stage model, the traditional Markov theory approach becomes too complicated to obtain analytical results. To develop these stochastic models, in this chapter we thus propose an alternative approach through stochastic differential equations. Given observed cancer incidence data, we further combine these stochastic models with statistical models to develop state space models for carcinogenesis. By using these state space models, we then develop a generalized Bayesian procedure to estimate the unknown parameters and to predict state variables via multi-level Gibbs sampling procedures. In this chapter we have used the multi-event model as an example to illustrate our modeling approach and some basic theories.
https://doi.org/10.1142/9789812772121_0005
Outcome-dependent sampling (ODS) is a cost effective way to enhance study efficiency. The case-control design for binary outcomes is a mainstay of epidemiology research. As the field of epidemiology expanding and evolving, an increasing number of studies are conducted using the ODS design with a “continuous” outcome. In an ODS design, observations made on a judiciously chosen subset of the base population can provide nearly the same statistical efficiency as observing the entire base population. Different statistical inference procedures are needed in order to reap the benefits of such sampling. We review recently developed methods that account for the ODS design. These methods are all semi-parametric approaches.
https://doi.org/10.1142/9789812772121_0006
The high throughput capabilities of protein mass fingerprints measurements have made mass spectrometry one of the standard tools for proteomic research, such as biomarker discovery. However, the analysis of large raw data sets produced by the time-of-flight (TOF) spectrometers creates a bottleneck in the discovery process. One specific challenge is the preprocessing and identification of mass peaks corresponding to important biological molecules. The accuracy of mass assignment is another limitation when comparing mass fingerprints with databases. Under survey conditions, where the positions of the desired mass peaks are not known beforehand, a TOF instrument requires a peak-picking procedure to distinguish mass peaks from a slowly varying background. We have developed an automated peak identification algorithm based on a maximum likelihood approach that effectively and efficiently detects peaks in a TOF spectrum. This approach produces maximum likelihood estimates of peak positions and intensities, and simultaneously develops estimates of the uncertainties in each of these quantities. Shifts in arrival time of the same peak in different spectra have been observed. Using the quantities from this peak detection procedure, different spectra can be brought into alignment.
https://doi.org/10.1142/9789812772121_0007
Microarray technology has advanced genomic research. Among various platforms, Affymetrix gene chips have been the most widely used to study thousands of genes simultaneously through mRNA expression. Analysis of Affymetrix gene expression data requires multiple steps, including data quality assessment, gene selection, and gene function classification. We describe a 2D image plot approach to assess data quality by examining array comparability. This approach uses a percentile method to group data, and then applies the 2D image plot to display the grouped microarray data with an invariant band to quantify degrees of array comparability. The method provides an efficient way of visually identifying incomparable arrays. Next, we describe a probe rank approach to selecting differentially-expressed genes. The probe rank approach uses rank scores to normalize and analyze probe intensity to control for probe effect, and uses a filter of percentage of probe fold change to account for cross-hybridization and alternative splicing. In the gene function classification, we describe an integrated bioinformatics tool to organize the genomic information of selected genes systematically so that their functional information is readily available for search objectives. The tool integrates a series of major genomic databases, such as Affymetrix's NetAffx Analysis center and Entrez Gene database. The tool classifies genes and generates readable web-based outputs for investigators to easily associate significant genes with biological pathways.
https://doi.org/10.1142/9789812772121_0008
High throughput mass spectrometry (MS) has been motivated greatly from recent developments in both chemistry and biology. Its technology has been extended to proteomics as a tool in rapid protein identification and is emerging as a leading technology in the proteomics revolution. However, key challenges still remain in the processing of proteomic MS data. It is substantial to develop a comprehensive set of mathematical and computational tools for proteomic MS data analysis. The processing goal is to effectively and correctly obtain the true information from the raw MS data for further statistical analysis. To provide a final peak list for future statistical analysis, the whole processing procedure usually takes the following steps: data registration (calibration), de-noising (smoothing), baseline correction, normalization, peak detection, and peak alignment (binning). In this chapter, a wavelet-based approach for data denoising is discussed and a so-called projecting spectrum binning (PSB) method for proteomic MS cross samples peaks alignment is introduced. Applications to real MS datasets for different cancer research projects in Vanderbilt Ingram Cancer Center show that the approach is efficient and satisfactory.
https://doi.org/10.1142/9789812772121_0009
Asthma remains one of the most common chronic childhood illnesses and a leading cause of hospital admissions. Our clinical objective was to assess the effect of gender, birth characteristics and neonatal respiratory disorders on pre-school asthma rates of (i) hospitalization and (ii) days hospitalized. The proportional rates (PR) model is a flexible adaptation to recurrent event data of the well-known Cox proportional hazards model. The PR model is related to Poisson regression, but relaxes the often untenable assumption that the events within a subject are independent. Despite having been originally proposed over a decade ago, examples of the use of the proportional rates model in the medical literature are quite rare. Moreover, little attention has been devoted to the extension of the PR model to accommodate covariate effects which vary over time. We evaluate the non-proportional rates model through simulation. We then apply the non-PR model to asthma data from a retrospective birth cohort study.
https://doi.org/10.1142/9789812772121_0010
Lung cancer is the most frequently occuring fatal cancer in the United States. By assuming a form for the hazard function for a group of lung cancer patients for survival study, the covariates in the hazard function are estimated by the maximum likelihood estimation following the proportional hazards regression analysis. Although the proportional hazards model does not give an explicit baseline hazard function, the function can be estimated by fitting the data with non-linear least square technique. The survival model is then examined by a neural network simulation. The neural network learns the survival pattern from available hospital data and gives survival prediction for random covariate combinations. The simulation results support the covariate estimation in the survival model.
https://doi.org/10.1142/9789812772121_0011
Some nonparametric regression techniques for estimating hazard or log-hazard functions and functional forms of covariate effects in Cox's proportional hazard model are introduced. Some nonparametric and semi-parametric regression models for a conditional hazard function are discussed as alternatives to the proportional hazard model.
https://doi.org/10.1142/9789812772121_0012
Boundary value problems in PDEs usually require determination of the eigenvalues and Fourier coefficients for a series, the latter of which are often intractable. A method was found that simplified both analytic and numeric solutions for Fourier coefficients based on the slope of the eigenvalue function at each eigenvalue (eigenslope). Analytic solutions by the eigenslope method resulted in the same solutions, albeit in different form, as other methods. Numerical solutions obtained by calculating the slope of the eigenvalue function at each root (hand graphing, Euler's, Runge-Kutta, and others) also matched. The method applied to all classes of separable PDEs (parabolic, hyperbolic, and elliptical), orthogonal (Sturm-Liouville) or non orthogonal expansions, and to complex eigenvalues. As an example, the widespread assumption of uniform capacitance was tested. An analytic model of cylindrical brain cell structures with an exponential distribution of membrane capacitance was developed with the eigenslope method. The stimulus-response properties of the models were compared under different configurations and shown to fit to experimental data from dendritic neurons. The long-standing question was addressed of whether the amount of variation of membrane capacitance measured in experimental studies is sufficient to markedly alter the vital neuron characteristic of passive signal propagation. We concluded that the degree of membrane capacitance variation measured in cells does not alter electrical responses at levels that are physiologically significant. The widespread assumption of uniform membrane capacitance is likely to be a valid approximation.
https://doi.org/10.1142/9789812772121_0013
A mathematical model for non-invasive pressure support ventilation (NIPSV) is presented. The model consists of two differential equations describing the volume in a one-compartment lung. In this study, we used the model to simulate spontaneously breathing patients with obstructive lung disease undergoing pressure support ventilation at 22 cmH2O, and PEEP of 5 cmH2O. NIPSV can give rise to unintended instability with potential adverse effect. Tidal volume instability is defined as a situation when the tidal volume, delivered over 100 consecutive breaths, creates a coefficient of variation higher than 10%, or a skipped breath occurs. To explore the tidal volume instability, we investigated the variability of tidal volume (VT) delivery during NIPSV under combinations of respiratory resistance, R = 10, 15, 20 and 25 cmH2O/L/s, compliance, C = 0.06, 0.08, 0.10 and 0.12 L/cmH2O, and frequency, f = 14, 16, 18, 20, and 22 breaths/min at inspiratory flow cut-off levels of 5% to 80%, and pressure triggering levels of 1, 3, 5, 10 and 15 cmH2O. We discovered that lower pressure sensitivity, higher lung compliance, higher flow resistance, and higher breathing frequency increasing the likelihood of instability.
https://doi.org/10.1142/9789812772121_0014
This paper discusses mathematical models dealing with the growth of solid tumors. Tumor growth is a very complex process, involving many different phenomena, which occur at different scales: subcellular, cellular, and extracellular scales. We survey models that address the problem at: subcellular scale, cellular scale, and extracellular scale. Then after we discuss multi-scale models and unification of models results from different scales.
https://doi.org/10.1142/9789812772121_0015
Artificial Neural Networks are models of interacting neurons that can be used as classifiers with large data sets. They can also be used for feature extraction and for reducing the dimensionality of large data sets. Dendritic electrotonic models can be used to suggest more robust artificial neural network models that are amenable to data mining and feature extraction.
https://doi.org/10.1142/9789812772121_0016
Multifractality present in high-frequency pupil diameter measurements, usually connected with the irregular scaling behavior and self-similarity, is modeled with statistical accuracy and discriminatory power. The Multifractal Discrimination Model (MDM) is proposed to determine ocular pathology based on the pupillary response behavior (PRB) exhibited by older adults with and without ocular disease during the performance of a computer-based task. The MDM consists of two parts: (1) a discriminatory summary of the multifractal spectrum and (2) a combined k-nearest-neighbor classifier. The multifractal spectrum is used to discriminate the PRB from four groups of older adult users, differing in ocular pathology. Spectral Mode, Broadness, and left Slope (the M.B.S. summary), three measures characterizing the multifractal spectrum of observations, are proposed as distinguishing features of PRB across the groups. The combined k-nearest neighbor classifier is shown to be a valid classifier for the accurate prediction of ocular pathology from the PRB measurements.
https://doi.org/10.1142/9789812772121_bmatter
The following sections are included: