There have been major developments in the field of statistics over the last quarter century, spurred by the rapid advances in computing and data-measurement technologies. These developments have revolutionized the field and have greatly influenced research directions in theory and methodology. Increased computing power has spawned entirely new areas of research in computationally-intensive methods, allowing us to move away from narrowly applicable parametric techniques based on restrictive assumptions to much more flexible and realistic models and methods. These computational advances have also led to the extensive use of simulation and Monte Carlo techniques in statistical inference. All of these developments have, in turn, stimulated new research in theoretical statistics.
This volume provides an up-to-date overview of recent advances in statistical modeling and inference. Written by renowned researchers from across the world, it discusses flexible models, semi-parametric methods and transformation models, nonparametric regression and mixture models, survival and reliability analysis, and re-sampling techniques. With its coverage of methodology and theory as well as applications, the book is an essential reference for researchers, graduate students, and practitioners.
Sample Chapter(s)
Chapter 1: Modelling Some Norwegian Soccer Data (341 KB)
https://doi.org/10.1142/9789812708298_fmatter
The following sections are included:
https://doi.org/10.1142/9789812708298_0001
Results of Norwegian Elite Division soccer games are studied for the year 2003. Previous writers have modelled the number of goals a given team scores in a game and then moved on to evaluating the probabilities of a win, a tie and a loss. However in this work the probabilities of win, tie and loss are modelled directly. There are attempts to improve the fit by including various explanatories.
https://doi.org/10.1142/9789812708298_0002
The objects studied in survival and event history analysis are stochastic phenomena developing over time. It is therefore natural to use the highly developed theory of stochastic processes. We argue that this theory should be used more in event history analysis. Some specific examples are treated: Markov chains, martingale-based counting processes, birth type processes, diffusion processes and Lévy processes. Some less well known applications are given, with the internal memory of the process as a connecting issue.
https://doi.org/10.1142/9789812708298_0003
Doksum and Gasko (1990) described a one-to-one correspondence between regression models for binary outcomes and those for continuous time survival analyses. This correspondence has been exploited heavily in the analysis of current status data (Jewell and van der Laan 2004), Shiboski (1998)). Here, we explore similar correspondences for complex survival models and categorical regression models for polytomous data. We include discussion of competing risks and progressive multi-state survival random variables.
https://doi.org/10.1142/9789812708298_0004
Standard use of Cox's regression model and other relative risk regression models for censored survival data requires collection of covariate information on all individuals under study even when only a small fraction of them die or get diseased. For such situations risk set sampling designs offer useful alternatives. For cohort data, methods based on martingale residuals are useful for assessing the fit of a model. Here we introduce grouped martingale residual processes for sampled risk set data, and show that plots of these processes provide a useful tool for checking model-fit. Further we study the large sample properties of the grouped martingale residual processes, and use these to derive a formal goodness-of-fit test to go along with the plots. The methods are illustrated using data on lung cancer deaths in a cohort of uranium miners.
https://doi.org/10.1142/9789812708298_0005
The aim of this paper is to create a platform for developing an interface between the mathematical theory of reliability and the mathematics of finance. This we are able to do because there exists an isomorphic relationship between the survival function of reliability, and the asset pricing formula of fixed income investments. This connection suggests that the exponentiation formula of reliability theory and survival analysis be reinterpreted from a more encompassing perspective, namely, as the law of a diminishing resource. The isomorphism also helps us to characterize the asset pricing formula in non-parametric classes of functions, and to obtain its crossing properties. The latter provides bounds and inequalities on investment horizons. More generally, the isomorphism enables us to expand the scope of mathematical finance and of mathematical reliability by importing ideas and techniques from one discipline to the other. As an example of this interchange we consider interest rate functions that are determined up to an unknown constant so that the set-up results in a Bayesian formulation. We may also model interest rates as “shot-noise processes”, often used in reliability, and conversely, the failure rate function as a Lévy process, popular in mathematical finance. A consideration of the shot noise process for modelling interest rates appears to be new.
https://doi.org/10.1142/9789812708298_0006
The performance (lifetime, failure rate, etc.) of a coherent system in iid components is completely determined by its “signature” and the common distribution of its components. A system's signature, defined as a vector whose ith element is the probability that the system fails upon the ith component failure, was introduced by Samaniego (1985) as a tool for indexing systems in iid components and studying properties of their lifetimes. In this paper, several new applications of the signature concept are developed for the broad class of mixed systems, that is, for stochastic mixtures of coherent systems in iid components. Kochar, Mukerjee and Samaniego (1999) established sufficient conditions on the signatures of two competing systems for the corresponding system lifetimes to be stochastically ordered, hazard-rate ordered or likelihood-ratio ordered, respectively. Partial results are obtained on the necessity of these conditions, but all are shown not to be necessary in general. Necessary and sufficient conditions (NASCs) on signature vectors for each of the three order relations above to hold are then discussed. Examples are given showing that the NASCs can also lead to information about the precise number and locations of crossings of the systems' survival functions or failure rates in (0, ∞) and about intervals over which the likelihood ratio is monotone. New results are established relating the asymptotic behavior of a system's failure rate, and the rate of convergence to zero of a system's survival function, to the signature of the system.
https://doi.org/10.1142/9789812708298_0007
In this article we review recent work on generalizations of the total time on test transform, and on stochastic orders that are based on these generalizations. Applications in economics, statistics, and reliability theory, are described as well.
https://doi.org/10.1142/9789812708298_0008
The unrestricted least squares estimator for the means of a two-way layout is usually inadmissible under quadratic loss and the model of ho-moscedastic independent Gaussian errors. In statistical practice, this least squares estimator may be modified by fitting hierarchical submodels and, for ordinal factors, by fitting polynomial submodels. ASP, an acronym for Adapative Shrinkage on Penalty bases, is an estimation (or denoising) strategy that chooses among submodel fits and more general shrinkage or smoothing fits to two-way layouts without assuming that any submodel is true. ASP fits distinguish between ordinal and nominal factors; respect the ANOVA decomposition of means into overall mean, main effects, and interaction terms; and are designed to reduce risk substantially over the unrestricted least squares estimator. Multiparametric asymptotics, in which the number of factor-level pairs tends to infinity, and numerical case studies both support the methodology.
https://doi.org/10.1142/9789812708298_0009
Varying-coefficient partially linear (VCPL) models are very useful tools. This chapter focuses on inferences for the VCPL model when the errors are serially correlated and modeled as an AR process. A penalized spline least squares (PSLS) estimation is proposed based on the penalized spline technique. This approach is then improved by a weighted PSLS estimation. We investigate the asymptotic theory under the assumption that the number of knots is fixed, though potentially large. The weighted PSLS estimators of all parameters are shown to be -consistent, asymptotically normal and asymptotically more efficient than the un-weighted ones. The proposed method can be used to make simultaneous inference for the parametric and nonparametric components by virtue of the sandwich formula for the joint covariance matrix. Simulations are conducted to demonstrate the finite sample performance of the proposed estimators. A real data analysis is used to illustrate the application of the proposed method.
https://doi.org/10.1142/9789812708298_0010
This chapter develops a flexible dimension-reduction model that incorporates both discrete and continuous covariates. Under this model, some covariates, Z, are related to the response variable, Y, through a linear relationship, while the remaining covariates, X, are related to Y through k indices which depend only on X′B and some unknown function g of X′B. To avoid the curse of dimensionality, k should be much smaller than p. This is often realistic as the key features of a high dimensional variable can often be extracted through a low-dimensional subspace. We develop a simple approach that separates the dimension reduction stage to estimate B from the remaining model components when the two covariates Z and X are independent. For instance, one can apply any suitable dimension reduction approach, such as the average derivative method, projection pursuit regression or sliced inverse regression, to get an initial estimator for B which is consistent at the rate, and then estimate the regression coefficient of Z and the link function g through a profile approach such as partial regression. All three estimates can be refined by iterating the procedure once. Such an approach is computationally simple and yields efficient estimates for both parameters at the
rate. We provide both theoretical proofs and empirical evidence.
https://doi.org/10.1142/9789812708298_0011
An extension of rank based partial likelihood method of Cox (1975) for general transformation model was introduced by Doksum (1987), and Chaudhuri, Doksum and Samarov (1997) introduced average derivative quantile regression estimates of parameters in semiparametric single index regression models that generalize transformation models. An important requirement for rank and quantile based methods to be applicable to any such model is an intrinsic monotonicity property of the underlying link function. In this note, we explore certain extensions of such semiparametric single index models for multivariate life time data and the possibility of estimation of index coefficients by average derivative quantile regression techniques. Monotonicity properties of the link functions associated with such models are also investigated.
https://doi.org/10.1142/9789812708298_0012
We consider the problem of model-selection-type aggregation of arbitrary density estimators using MISE risk. Given a collection of arbitrary density estimators, we propose a data-based selector of the best estimator in the collection and prove a general ready-to-use oracle inequality for the selected aggregate estimator. We then apply this inequality to the adaptive estimation of a multivariate density in a “multiple index” model. We show that the proposed aggregate estimator adapts to the unknown index space of unknown dimension in the sense that it allows us to estimate the density with the optimal rate attainable when the index space is known.
https://doi.org/10.1142/9789812708298_0013
A general class of semiparametric transformation models is considered. A second order differential equation of Sturm-Liouville type is derived that determines the semiparametric information on the Euclidean parameter involved. Under quite general conditions properties are proved of the solution of the resulting boundary value problem. A frailty model of Clayton and Cuzick for survival data is studied in some detail.
https://doi.org/10.1142/9789812708298_0014
The chapter considers the semiparametric transformation model and compare the finite sample properties of the modified partial likelihood estimator with a simple un-weighted estimating equation estimator. For the semiparametric transformation model, resampling methods may be used to provide uniform confidence bands for the nonparametric baseline function and the survival function. It is also shows how a score process (defined by the estimating equations) may be used to validate the assumption about constant proportional odds. Sometimes the transformation model will not be sufficiently flexible to deal with for example time-varying effects, and an extension of the transformation model is suggested. The extended model specifies a time-varying regression structure for the transformation, and this may be thought of as a first-order Taylor series expansion of a completely non-parametric covariate dependent baseline. Special cases include a stratified version of the usual semiparametric transformation model. The method is illustrated by a simulation study. The added flexibility increases the practical use of the model considerably.
https://doi.org/10.1142/9789812708298_0015
In this paper we discuss, via some specific examples, some of the issues associated with embedding a standard model in a larger family of models, indexed by an additional parameter. The examples considered are the Box-Cox transformation family, a family of models for binary responses that includes the logit and complementary log log as special cases, and a new family that includes two formulations of cure models as special cases. We discuss parameter interpretations, inflation in variance due to the addition of the extra parameter, predictions on an observable scale, ratios of parameters and score tests. We review the literature on these topics for the Box-Cox and binary response models and provide more details for the cure model.
https://doi.org/10.1142/9789812708298_0016
We consider a random-design regression model with vector-valued observations and develop nonparametric estimation of smooth conditional moment functions in the predictor variable. This includes estimation of higher order mixed moments and also functionals of the moments, such as conditional covariance, correlation, variance, and skewness functions. Our asymptotic analysis targets the limit distributions. We find that some seemingly reasonable procedures do not reproduce the identity or other linear functions without undesirable bias components, i.e., they are linearly biased. Alternative linearly unbiased estimators are developed which remedy this bias problem without increasing the variance. A general linearly unbiased estimation scheme is introduced for arbitrary smooth functionals of moment functions.
https://doi.org/10.1142/9789812708298_0017
This paper establishes an asymptotic representation for regression and autoregression rank score statistics of the serial type. Applications include rank-based versions of the Durbin-Watson test, tests of AR(p) against AR(p + 1) dependence, or detection of the presence of random components in AR processes.
https://doi.org/10.1142/9789812708298_0018
We present a convolution smoother for nonparametric regression. Its asymptotic behavior is examined, and its asymptotic total squared error is found to be smaller than standard kernel estimators, such as Nadaraya-Watson and local linear regression. Results based on some simulation studies are given, including a comparison with a fourth order kernel. Asymptotic normality for the proposed estimator is proved.
https://doi.org/10.1142/9789812708298_0019
We review gene mapping, or inference for quantitative trait loci, in the context of recent research in semi-parametric and non-parametric inference for mixture models. Gene mapping studies the relationship between a phenotypic trait and inherited genotype. Semi-parametric gene mapping using the exponential tilt covers most standard exponential families and improves estimation of genetic effects. Non-parametric gene mapping, including a generalized Hodges-Lehmann shift estimator and Kaplan-Meier survival curve, provide a general framework for model selection for the influence of genotype on phenotype. Examples and summaries of reported simulations show the power of these methods when data are far from normal.
https://doi.org/10.1142/9789812708298_0020
Work in the last two decades on Bayesian nonparametric methods for mixture models finds that a posterior distribution is a double mixture. One first selects a partition of the objects based on a distribution on the partitions, and then performs a traditional parametric posterior analysis on the data corresponding to each cluster of the given partition. It is known that such a partition distribution favors partitions for which the clustering process is defined by predictive quantities such as predictive densities or weights. If a posterior distribution is a statistical guide to the unknown, this partition distribution could be used as a basis for a statistical model for clustering in which the partition is a parameter. The corresponding maximum likelihood estimator or posterior mode is used as an estimator of the partition. We also discuss methods to approximate these estimators based on a weighted Chinese restaurant process. A numerical example on a leukemia data set is given.
https://doi.org/10.1142/9789812708298_0021
This paper describes briefly how one may utilize a class of species sampling mixture models derived from Doksum's (1974) neutral to the right processes. For practical implementation we describe an ordered/ranked variant of the generalized weighted Chinese restaurant process.
https://doi.org/10.1142/9789812708298_0022
We investigate the posterior distribution of a percentile and several percentiles In the Dirichlet process nonparametric setting. Our main result is an asymptotic expansion for the posterior distribution of a percentile that has a leading normal term. We also introduce a procedure for sampling from the posterior distribution.
https://doi.org/10.1142/9789812708298_0023
This chapter deals with nonparametric inference for quantiles from a Bayesian perspective, using the Dirichlet process. The posterior distribution for quantiles is characterised, enabling also explicit formulae for posterior mean and variance. Unlike the Bayes estimator for the distribution function, our Bayes estimator for the quantile function is a smooth curve. A Bernshtein–von Mises type theorem is given, exhibiting the limiting posterior distribution of the quantile process. Links to kernel-smoothed quantile estimators are provided. As a side product we develop an automatic nonparametric density estimator, free of smoothing parameters, with support exactly matching that of the data range. Nonparametric Bayes estimators are also provided for other quantile-related quantities, including the Lorenz curve and the Gini index, for Doksum's shift curve and for Parzen's comparison distribution in two-sample situations, and finally for the quantile regression function in situations with covariates.
https://doi.org/10.1142/9789812708298_0024
A major aim of most income distribution studies is to make comparisons of income inequality across time for a given country and/or compare and rank different countries according to the level of income inequality. However, most of these studies lack information on sampling errors, which makes it difficult to judge the significance of the attained rankings. This chapter derives the asymptotic properties of the empirical rank-dependent family of inequality measures. A favorable feature of this family of inequality measures is that it includes the Gini coefficients, and that any member of this family can be given an explicit and simple expression in terms of the Lorenz curve. By relying on a result of Doksum (1974), it is easily demonstrated that the empirical Lorenz curve converges to a Gaussian process. This result forms the basis of the derivation of the asymptotic properties of the empirical rank-dependent measures of inequality.
https://doi.org/10.1142/9789812708298_0025
Different studies on the same objects under the same conditions often result in nearly uncorrelated ranking of the objects, especially when dealing with large number of objects.The problem arises mainly from the fact that the data contain only a small proportion of “interesting” or “important” objects which hold the answers to the scientific questions. This paper proposes a modified Kendall rank-order association test for evaluating the repeatability of two studies on a large number of objects, most of which are undifferentiated. Since the repeatability between two datasets is reflected in the association between the two sets of observed values, evaluating the extent and the significance of such association is one way to measure the strength of the signals in the data. Due to the complex nature of the data, we consider ranking association which is distribution-free. Using simulation results, we show that the proposed modification to the classic Kendall rank-order correlation coefficient has desirable properties that can address many of the issues that arise in current statistical studies.
https://doi.org/10.1142/9789812708298_0026
Stock prices have been modeled in the literature as either discrete or continuous versions of geometric Brownian motions (GBM). This chapter uses rank statistics of the GBM to define a new exotic derivative called a stochastic corridor. This rank statistic measures how much time, during a given period, the stock prices stay below the price of a prefixed day. The properties of the stochastic corridor and its applications in finance are studied.
https://doi.org/10.1142/9789812708298_0027
We review and complement a general approach for Monte Carlo computations of conditional expectations given a sufficient statistic. The problem of direct sampling from the conditional distribution is considered in particular. This can be done by a simple parameter adjustment of the original statistical model if certain conditions are satisfied, but in general one needs to use a weighted sampling scheme. Several examples are given in order to demonstrate how the general method can be used under different distributions and observation plans. In particular we consider cases with, respectively, truncated and type I censored samples from the exponential distribution, and also conditional sampling for the inverse Gaussian distribution. Some new theoretical results are presented.
https://doi.org/10.1142/9789812708298_0028
Monte-Carlo estimation of an integral is usually based on the method of moments or on an estimating equation. Recently, Kong et al. (2003) proposed a likelihood based theory, which puts Monte-Carlo estimation of integrals on a firmer, less ad hoc, basis by formulating the problem as a likelihood inference problem for the baseline measure with simulated observations as data. In this paper, we provide further exploration and development of this theory. After an overview of the likelihood formulation, we first demonstrate the power of the likelihood-based method by presenting a universally improved importance sampling estimator. We then prove that the formal, infinite-dimensional Fisher-information based variance calculation given in Kong et al. (2003) is asymptotically the same as the sampling based “sandwich” variance estimator. Next, we explore the gain in Monte Carlo efficiency when the baseline measure can be parameterized. Furthermore, we show how the Monte Carlo integration problem can also be dealt with by the method of empirical likelihood, and how the baseline measure parameter can be properly profiled out to form a profile likelihood for the integrals only. As a byproduct, we obtain four equivalent conditions for the existence of unique maximum likelihood estimate for mixture models with known components. We also discuss an apparent paradox for Bayesian inference with Monte Carlo integration.
https://doi.org/10.1142/9789812708298_0029
A confidence distribution for a scalar parameter provides confidence intervals by its quantiles. A confidence net represents a family of nested confidence regions indexed by degree of confidence. Confidence nets are obtained by mapping the deviance function into the unit interval. For high-dimensional parameters, product confidence nets, represented as families of simultaneous confidence bands, are obtained from bootstrapping utilizing the abc-method. The method is applied to Norwegian personal income data.
https://doi.org/10.1142/9789812708298_0030
L1 penalties have proven to be an attractive regularization device for nonparametric regression, image reconstruction, and model selection. For function estimation, L1 penalties, interpreted as roughness of the candidate function measured by their total variation, are known to be capable of capturing sharp changes in the target function while still maintaining a general smoothing objective. We explore the use of penalties based on total variation of the estimated density, its square root, and its logarithm – and their derivatives – in the context of univariate and bivariate density estimation, and compare the results to some other density estimation methods including L2 penalized likelihood methods. Our objective is to develop a unified approach to total variation penalized density estimation offering methods that are: capable of identifying qualitative features like sharp peaks, extendible to higher dimensions, and computationally tractable. Modern interior point methods for solving convex optimization problems play a critical role in achieving the final objective, as do piecewise linear finite element methods that facilitate the use of sparse linear algebra.
https://doi.org/10.1142/9789812708298_0031
The bounded normal mean problem has important applications in non-parametric function estimation. It is to estimate the mean of a normal distribution with mean restricted to a bounded interval. The minimax risk for such a problem is generally unknown. It is shown in Donoho, Liu and MacGibbon (1990) that the linear minimax risk provides a good approximation to the minimax risk. We show in this note that a better approximation can be obtained by a simple truncation of the minimax linear estimator and that the minimax linear estimator is itself inadmissible. The gain of the truncated minimax linear estimator is significant for moderate size of the mean interval, where no analytical expression for the minimax risk is available. In particular, we show that the truncated minimax linear estimator performs no more than 13% worse than the minimax estimator, comparing with 25% for the minimax linear estimator.
https://doi.org/10.1142/9789812708298_0032
A distribution function F is more peaked about a point a than the distribution G is about the point b if F((x + a)−) − F(−x + a) ≥ G((x + b)−) − G(−x + b) for every x > 0. The problem of estimating symmetric distribution functions F or G, or both, under this constraint is considered in this paper. It turns out that the proposed estimators are projections of the empirical distribution function onto suitable convex sets of distribution functions. As a consequence, the estimators are shown to be strongly uniformly consistent. The asymptotic distribution theory of the estimators is also discussed.
https://doi.org/10.1142/9789812708298_bmatter
The following sections are included:
Sample Chapter(s)
Chapter 1: Modelling Some Norwegian Soccer Data (341k)