Minimum description length revisited
Abstract
This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of MDL estimators. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC versus BIC and cross-validation versus Bayes can, to a large extent, be viewed from a unified perspective.
1. Introduction
The Minimum Description Length (MDL) Principle1,2,3,4 is a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. Broadly speaking, it states that the best explanation for a given set of data is provided by the shortest description of that data. In 2007, one of us published the book The Minimum Description Length Principle (Ref. 4 referred to as G07 from now on), giving a detailed account of most works in the MDL area that had been done until then. During the last 10 years, several new practical MDL methods have been designed, and there have been exciting theoretical developments as well. It therefore seemed time to present an up-to-date combined introduction and review.
Why read this overview?
While the MDL idea has been shown to be very powerful in theory, and there have been a fair number of successful practical implementations, massive deployment has been hindered by two issues: first, in order to apply MDL, one needs to have basic knowledge of both statistics and information theory. To remedy this situation, here we present, for the first time, the MDL Principle without resorting to information theory: all the material can be understood without any knowledge of data compression, which should make it a much easier read for statisticians and machine learning researchers novel to MDL. A second issue is that many classical MDL procedures are either computationally highly intensive (for example, MDL variable selection as in Example 4 below) and hence less suited for our big data age, or they seem to require somewhat arbitrary restrictions of parameter spaces (e.g., NML with v≡1 as in Sec. 2). Yet, over the last 10 years, there have been exciting developments — some of them very recent — which mostly resolve these issues. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC versus BIC and cross-validation versus Bayes can, to some extent, be viewed from a unified perspective; as such, this paper should also be of interest to researchers working on the foundations of statistics and machine learning.
History of the field, recent advances and overview of this paper
MDL was introduced in 1978 by Jorma Rissanen in his paper Modeling by the Shortest Data Description. The paper coined the term MDL and introduced and analyzed the two-part code for parametric models. The two-part code is the simplest instance of a universal code or, equivalently, universal probability distribution, the cornerstone concept of MDL theory. MDL theory was greatly extended in the 1980s, when Rissanen published a sequence of ground-breaking papers at a remarkable pace, several of which introduced new types of universal distributions. It came to full blossom in the 1990s, with further major contributions from, primarily, Jorma Rissanen, Andrew Barron and Bin Yu, culminating in their overview paper3 and the collection5 with additional chapters by other essential contributors such as Kenji Yamanishi. The book G07 provides a more exhaustive treatment of this early work, including discussion of important precursors/alternatives to MDL such as MML,6 “ideal”, Kolmogorov complexity-based MDL7 and Solomonoff’s theory of induction.8 Universal distributions are still central to MDL. We introduce them in a concise yet self-contained way, including substantial underlying motivation, in Sec. 2, incorporating the extensions to and new insights into these basic building blocks that have been gathered over the last 10 years. These include more general formulations of arguably the most fundamental universal code, the Normalized Maximum Likelihood (NML) Distribution, including faster ways to calculate it as well. We devote a separate section to new universal codes, with quite pleasant properties for practical use, most notably the switch distribution (Sec. 3.1), which can be used for model selection combining almost the best of AIC and BIC; and the Reverse Information Projection (RIPr)-universal code (Sec. 3.3) specially geared to hypothesis testing with composite null hypotheses, leading to several advantages over classical Neyman–Pearson tests. In Sec. 4 we review recent developments on fast calculation of NML-type distributions for model selection for graphical models (Bayesian networks and the like), leading to methods which appear to be more robust in practice than the standard Bayesian ones. Recent extensions of MDL theory and practical implementations to latent variable and irregular models are treated in Sec. 5. Then, in Sec. 6 we review developments relating to consistency and convergence properties of MDL methods. First, while originally MDL estimation was formulated solely in terms of discretized estimators (reflecting the fact that coding always requires discretization), it has gradually become clear that a much larger class of estimators (including maximum likelihood for “simple” models, and, in some circumstances, the Lasso — see Example 4) can be viewed from an MDL perspective, and this becomes clearest if one investigates asymptotic convergence theorems relating to MDL. Second, it was found that MDL (and Bayes), without modification, can behave sub-optimally under misspecification, i.e., when all models under consideration are wrong, but some are useful — see Sec. 6.3. Third, very recently, it was shown how some of the surprising phenomena underlying the deep learning revolution in machine learning can be explained from an MDL-related perspective; we briefly review these developments in Sec. 6.4. Finally, we note that G07 presented many explicit open problems, most of which have been resolved — we mention throughout the text whenever a new development solved an old open problem, deferring some of the most technical issues to the Appendix.
Notational preliminaries
We shall mainly be concerned with statistical models (families of probability distributions) of the form ℳ={pθ:θ∈Θ} parametrized by some set Θ which is usually but not always a subset of Euclidean space; and families of models{ℳγ:γ∈Γ}, where each ℳγ={pθ:θ∈Θγ} is a statistical model, used to model the data zn:=(z1,…,zn) with each zi∈𝒵, for some outcome space 𝒵. Each pθ represents a probability density function (pdf) or probability mass function, defined on sequences of arbitrary length. With slight abuse of notation we also denote the corresponding probability distribution by pθ(rather than the more common Pθ). In the simple case that the data are i.i.d. according to each pθ under consideration, we have pθ(zn)=∏ni=1pθ(zi).
We denote the maximum likelihood (ML) estimator given the model ℳ={pθ:θ∈Θ} by ˆθML, whenever it exists and is unique; the ML estimator relative to model ℳγ is denoted by ˆθML|γ. We shall, purely for simplicity, generally assume its existence and uniqueness, although nearly all results can be generalized to the case where it does not. We use ˘θ to denote more general estimators, and ˆθv to denote what we call the MDL estimator with luckiness function v, see (5).
2. The Fundamental Concept: Universal Modeling
MDL is best explained by starting with one of its prime applications, model comparison — we will generalize to prediction and estimation later, in Secs. 2.3 and 2.4. Assume then that we are given a finite or countably infinite collection of statistical models ℳ1,ℳ2,…, each consisting of a set of probability distributions. The fundamental idea of MDL is to associate each ℳγ with a single distribution ˉpγ, often called a universal distribution relative to ℳγ. We call the minus-log-likelihood −logˉpγ(Zn) the code length of dataZn under the universal code ˉpγ. This terminology, and how MDL is related to coding (lossless compression of data), is briefly reviewed in Secs. 2.3 and 2.4; but a crucial observation at this point is that the main MDL ideas can be understood abstractly, without resorting to the code length interpretation. We also equip the model indices Γ:={1,2,…,γmax}(where we allow |Γ|=γmax=∞) with a distribution, say π; if the number of models to be compared is small (e.g., bounded independently of n or at most a small polynomial in n), we can take π to be uniform distribution — for large (exponential in n) and infinite Γ, see Sec. 2.3 and Example 4. We then take, as our best explanation of the given data zn, the model ℳγ minimizing
(1) The Bayesian universal distribution
The reader may recognize this as being formally equivalent to the standard Bayesian way of model selection, the Bayes factor method9 as long as the γ are defined as Bayesian marginal distributions, i.e., for each γ, we set ˉpγ=pbayeswγ, where
Example 1 (Bernoulli). Let ℳ={pθ:θ∈[0,1]} represent the Bernoulli model, extended to n outcomes by independence. We then have for each zn∈{0,1}n that pθ(zn)=θn1(1−θ)n0, where n1=∑ni=1zi and n0=n−n1. Most standard prior distributions one encounters in the literature are beta priors, for which w(θ)∝θα(1−θ)β, so that pbayesw(zn)∝∫θn1+α(1−θ)n0+βdθ. Note that pbayesw is not itself an element of the Bernoulli model. One could use pbayesw to compare the Bernoulli model, via (1), to, for example, a first-order Markov model, with Bayesian marginal likelihoods defined analogously. We shall say a lot more about the choice of prior below.
Example 2 (Gauss and general improper priors). A second example is the Gaussian location family ℳgauss with fixed variance (say 1), in which 𝒵=ℝ, and pθ(zn)∝exp(∑ni=1(zi−θ)2/2). A standard prior for such a model is the uniform prior, w(θ)=1, which is improper (it does not integrate, hence does not define a probability distribution). Improper priors cannot be directly used in (2), and hence they cannot be directly used for model comparison as in (1) either. Still, we can use them in an indirect manner, as long as we are guaranteed that, for all ℳγ under consideration, after some initial number of m observations, the Bayesian posterior wγ(θ|zm) is proper. We can then replace pbayeswγ(zn) in (2) by pbayeswγ(zm+1,…,zn|zm):=∫pθ(zm+1,…,zn)wγ(θ|zm)dθ. We extend all these conditional universal distributions to distributions on 𝒵n by defining pbayeswγ(z1,…,zn):=pbayeswγ(zm+1,…,zn|zm)p0(zm) for some distribution p0 on 𝒵m that is taken to be the same for all models ℳγ under consideration. We can now use (1) again for model selection based on pbayeswγ(z1,…,zn), where we note that the choice of p0 plays no role in the minimization, which is equivalent to minimizing −logπ(γ)−logpbayeswγ(zm+1,…,zn|zm).
Now comes the crux of the story, which makes MDL, in the end, quite different from Bayes: defining the ˉpγ as in (2) is just one particular way to define an MDL universal distribution — but it is by no means the only one. There are several other ways, and some of them are sometimes preferable to the Bayesian choice. Here we list the most important ones.
(2) NML or Shtarkov10 distribution, and MDL estimators
This is perhaps the most fundamental universal distribution, leading also to the definition of an MDL estimator. In its general form, the NML distribution and “MDL estimators” depend on a function v:Θ→ℝ+0. The definition is then given by
The v-MDL estimator is a penalized ML estimator, which coincides with the Bayes MAP estimator based on prior v whenever v is a probability density. Although this has only become clear gradually over the last 10 years, estimators of form (5) are the prime way of using MDL for estimation; there is, however, a second, “improper” way for estimating distributions within MDL though, see Sec.2.4. In practice, we will choose v that are sufficiently smooth so that, if the number of parameters is small relative to n, ˆθv will usually be almost indistinguishable from the ML estimator ˆθML. COMP indeed measures something one could call a “complexity” — this is easiest to see if v=1, for then, if ℳ contains just a single distribution, we must have COMP(ℳ,v)=0, and the more distributions we add to ℳ, the larger COMP(ℳ,v) gets — this is explored further in Sec. 2.2.
Now suppose we have a collection of models ℳγ indexed by finite Γ and we have specified luckiness functions vγ on Θγ for each γ∈Γ, and we pick a uniform distribution π on Γ. As can be seen from the above, if we base our model choice on NML, we pick the model minimizing
Originally, the NML distribution was defined by Shtarkov10 for the special case with v≡1, leading to the rightmost definition in (3), and hence the term NML (in the modern version, perhaps “normalized penalized ML” would be more apt). This is also the version that Rissanen12 advocated as embodying the purest form of the MDL Principle. However, the integral in (3) is ill-defined for just about every parametric model defined on unbounded outcome spaces (such as ℕ,ℝ or ℝ+), including the simple normal location family. Using nonuniform v allows one to deal with such cases in a principled manner after all, see Sec. 2.5. For finite outcome spaces though, v≡1 usually “works”, and (3) is well defined, as we illustrate for the Bernoulli model (see Sec. 4 for more examples).
Example 3 (Continuation of Example 1). For the Bernoulli model, ˆθml(zn)=n1/n and COMP(ℳ,v) as in (7) with v≡1 can be rewritten as log∑nn1=0(nn1)(n1/n)n1(n0/n)n0, which, as we shall see in Sec. 2.2, is within a constant of (1/2)logn. As reviewed in that sub-section, the resulting pnmlv is asymptotically (essentially) indistinguishable from pbayeswJ where the latter is equipped with Jeffreys’ prior, defined as wJ(θ)∝√|I(θ)|=θ−1/2(1−θ)−1/2, with I(θ) being the Fisher information at θ.
(3) The two-part (sub-)distribution1
Here one first discretizes Θ to some countable subset ¨Θ which one equips with a probability mass function w; in contrast to the v above, this function must sum to 1. One then considers
(4) The prequential plug-in distribution14,15
Here, one first takes any reasonable estimator for the given model . One then defines
Except for the just mentioned “hybrids”, these first four universal distributions were all brought into MDL theory by Rissanen; they are extensively treated by G07, in which one chapter is devoted to each, and to which we refer for details. The following two are much more recent.
(5) The switch distribution switch(Ref. 16)
In a particular type of nested model selection, this universal distribution behaves arguably better than the other ones. It will be treated in detail in Sec. 3.1.
(6) Universal distributions ripr based on the Reverse Information Projection
These universal distributions17 lead to improved error bounds and optional stopping behavior in hypothesis testing and allow one to forge a connection with group-invariant Bayes factor methods; see Sec. 3.3.
2.1. Motivation
We first give a very high-level motivation that avoids direct use of data compression arguments. For readers interested in data compression, Sec. 2.3 does make a high-level connection, but for more extensive material we refer to G07. We do, in Sec. 2.4, give a more detailed motivation in predictive terms, and, in Sec. 6, we shall review mathematical results indicating that MDL methods are typically consistent and enjoy fast rates of convergence, providing an additional motivation in itself.
Consider then models , where for simplicity we assume discrete data, and let be the maximum likelihood estimator within . Define “the fit of the model to the data” in the standard way, as , the likelihood assigned to the data by the best-fitting distribution within the model. Now if we enlarge the model , i.e., by adding several distributions to it, can only increase; and if we make big enough such that for each , it contains a distribution with , we can even have on all data. If we simply picked the maximizing , we would be prone to severe overfitting. For example, if models are nested, then, except for very special data, we would automatically pick the largest one.
As we have seen, a central MDL idea is to instead associate each model with a single corresponding distribution , i.e., we set . Then the total probability mass on all potential outcomes cannot be larger than 1, which makes it impossible to assign overly high fit to overly many data sequences: no matter what distribution we chose, we must now have , so a good fit on some necessarily implies a worse fit on others, and we will not select a model simply because it accidentally contained some distribution that fitted our data very well — thus, measuring fit by a distribution instead of inherently prevents overfitting. This argument to measure fit relative to a model with a single is similar to Bayesian Occam’s Razor arguments18 used to motivate the Bayes factor; the crucial difference is that we do not restrict ourselves to of the form (2); inspecting the “Bayesian” Occam argument, there is, indeed, nothing in there which forces us to use distributions of Bayesian form.
The next step is thus to decide which are best associated with a given . To this end, we define the fitness ratio for data as
A few remarks concerning this high-level motivation of MDL procedures are in order.
(1) | It is clear that, by requiring to add to 1, we will be less prone to overfitting than by setting it simply to ; whether the requirement to add (at most) to 1, making essentially a probability density function, is a clever way to avoid overfitting (leading to good results in practice) is not clear yet. For this, we need additional arguments, which we very briefly review. First, the sum-to-1 requirement is the only choice for which the procedure can be interpreted as selecting the model which minimizes code length of the data (the original interpretation of MDL); second, it is the only choice which has a predictive interpretation, which we review in Sec. 2.4 below; third, it is the only choice under which time-tested Bayesian methods fit into the picture; and fourth, with this choice we get desirable frequentist statistical properties such as consistency and convergence rates, see Sec. 6. | ||||
(2) | The motivation above only applies to the NML universal distributions. How about the other five types? Originally, in the pure MDL approach mainly due to Rissanen, the NML was viewed as the optimal choice per se; other should be used only for pragmatic reasons, such as them being easier to calculate. One would then design them so as to be as close as possible to the NML distributions in terms of the fitness ratio they achieve. In the following sub-section we show that all six of them satisfy the same MDLBIC asymptotics, meaning that their fitness ratio is never smaller than a constant factor of the NML one, either again in the worst-case over all or in some weaker expectation sense. Thus, they are all “kind of ok” in a rather weak sense, and in practice one would simply revert to the one that is closest to NML and still usable in practice; with the Bayesian , as we shall see, one can even get arbitrarily close to NML as gets larger. This classical story notwithstanding, it has become more and more apparent that in practice one sometimes wants or needs properties of model selection methods that are not guaranteed by NML — such as near-optimal predictions of future data or strong frequentist Type-I error guarantees. This translates itself into universal codes and that, for some special sequences, achieve much higher fitness ratio than , while for all sequences having only very slightly smaller fitness ratio. This more recent and pragmatic way of MDL is briefly reviewed in Secs. 3.1 and 3.3. This raises the question how we should define a universal distribution: what choices for are still “universal” (and define an MDL method) and what choices are not? Informally, every distribution that for no has is “universal” relative to . For parametric models such as exponential families, the “” is partially formalized by requiring that at the very least, they should satisfy (14) below (G07 is much more precise on this). | ||||
(3) | Third, we have not yet said how one should choose the “luckiness function” — and one needs to make a choice to apply MDL in practice. The interpretation of is closely tied to the predictive interpretation of MDL, and hence we postpone this issue to the end of Sec. 2.4. | ||||
(4) | Fourth, the motivation so far is incomplete — we still need to explain why and how to incorporate the distribution on model index . This is done in Sec. 2.3 below. |
2.2. Asymptotic expansions
Now let be defined relative to a single parametric model . It turns out that all universal codes we mentioned have in common that, for “sufficiently regular” -dimensional parametric models, the log-likelihood for given data satisfies the following celebrated asymptotics, often called the MDL or BIC expansion: for all “sufficiently regular” data sequences , there exists a constant independent of such that for all ,
The NMLCOMP expansion and the Jeffreys (Fisher information) integral
For the case that the model is a -dimensional exponential family and is the NML or Bayes distribution, we can be significantly more precise and evaluate the constant in (14) up to : we get, under some weak additional regularity conditions on and ,
Analogously to (in fact much easier mathematically), we can expand using a classical Laplace approximation; under the same conditions as before, with now the additional restriction that there exists an arbitrary INECCSI subset of such that for all large , the data have ML estimator within , we find that
2.3. Unifying model selection and estimation
Suppose we are given a countable collection of models . Recall that the basic idea above was to associate each individual model with a single distribution . It seems reasonable to do the same at the level of “meta-parameters” : we set and in complete analogy to (3), we define the meta-universal distribution
Let us assume that all have already been defined. Then we can use any such that the overarching in (17) exists. We can now formulate a general MDL Principle for model selection: we start with a (potentially huge) set of candidate distributions . We next carve up into interesting sub-models with , so that . We then associate each with a universal distribution , and we equip as defined above with luckiness function [note that , a countable union of (usually) uncountable sets, consists of all distributions under consideration, while is a countable set]. We then base the selection of a sub-model on (17). What we earlier called the “general MDL Principle” underneath (1) was the special case in which , i.e., is a probability mass function. Via (9) we see that for any such probability mass function , the denominator in (17) is well-defined, hence is a valid luckiness function.
Now consider the special case in which every is chosen to be an NML distribution for some luckiness functions . We take some function (which we will relate to the above later on) and we set, for , . We can use for parameter estimation on the joint parameters just as we did earlier for parametric models, by using the MDL estimator picking the minimizing, over ,
Two-part versus one-part codes: The role of data compression
In the oldest (1978) version of MDL, only two-part codes on countable sets were used: the minimum over was taken over a discretized grid and was a probability mass function over this grid; then for all , and [see (9)] and they were both approximated by 0. From the Kraft inequality20 we see where the name “two-part code” comes from: this inequality says that for every probability mass function on a countable set , there exists a lossless code such that for all , the number of bits needed to encode , is given by . Thus the resulting method can be interpreted as picking the minimizing the two-stage code length of the data, where first the parameters are encoded using bits, and then is encoded “with the help of ”, using bits [in fact, the encoding of itself has two sub-stages here so we really have a two-part code where the first part itself has two parts as well].
The discretization involved in using a probability mass functioncode for continuous-valued makes things (unnecessarily, as was gradually discovered over the last 30 years) very complicated in general. Also, if one combines the choice of with the choice of , the approximation of as 0 introduces some sub-optimalities. Thus, one would like to code the data in a way that avoids these two issues. It turns out that this can be achieved by replacing two-part by one-part codes for the data, namely, to use codes with length : assuming for simplicity that data are discrete, the same Kraft inequality implies that there must also be a code, directly defined on , which achieves code length for each given by . Thus, even though for general luckiness functions this code length cannot be decomposed into two sub-code lengths, it remains a valid code length and the name MDL for the resulting procedure remains, we feel, justified. In the past, it was sometimes thought by some MDL fans that two-part codes on countable sets would somehow lead to inherently better estimates than estimators for general luckiness functions as in (5). However, after 30 years it turned out there is nothing either conceptually or mathematically that indicates the need for two-part codes and countable sets: for any luckiness function , the resulting procedure has a code length interpretation, and Grünwald and Mehta13 show that all consistency and convergence results that hold for two-part estimators also hold for general MDL estimators (Sec. 6.1) — thus invalidating the conjecture in Open Problem 13 of G07 that postulated a special status for luckiness functions that are probability mass functions on countable sets. For the luckiness function on the discrete structure , however, it is quite reasonable to choose a probability mass function: no probability mass is wasted [since the denominator in (17) plays no role in choosing ], and designing by thinking about the code lengths comes very naturally, as the following example illustrates.
Example 4 (Variable selection: -versus -penalties). Suppose that each data point where denotes the variable to be predicted and is a vector of covariates or “features” that may or may not help for predicting . We consider a linear model , decomposed into sub-models expressing,
In light of the above, a straightforward way to use MDL here is to pick the minimizing
As to the left part of the code length (19), if the variance is known, a natural luckiness function to use is a ()-dimensional Gaussian with mean 0 and variance for some (usually diagonal) covariance matrix . This gives (see Chap. 12 of G07)
Note that the codeprior we used here induces sparsity: if there exists a with mostly zero components that already fits the data quite well, we will tend to select it, since, for , increases approximately linearly in . That does not mean that we necessarily believe that the “truth” is sparse — it just expresses that we hope that we can already make reasonably good predictions with a small number of features.
An alternative, and very popular, approach to this problem is the celebrated Lasso,22 in which we consider only the full model and we pick the minimizing
2.4. Log-loss prediction and universal distributions
Now consider the simple case again with a finite set of models where is small compared to and we use the uniform prior , picking the maximizing . It was the fundamental insight of Rissanen14 and Dawid15 that such model choice by maximizing for a single distribution can be motivated in a different way as well — in essence, it selects the model with the best predictive performance on unseen data. This approach shows that MDL is quite similar in spirit to cross-validation, the main difference with leave-one-out cross-validation being that the cross in cross-validation is replaced by a forward and that the loss function used to measure prediction error is restricted to be the logarithmic score, also commonly known as log-loss (which, however, is often used in cross-validation as well).
Formally, the log-loss of predicting a single outcome with a distribution is defined as : the larger the probability density, the smaller the loss. If one predicts a sequence of outcomes with predictions , then the cumulative log-loss is defined as the sum of the individual losses: .
Now, if we adopt a probabilistic world view and represent our beliefs about by a probability distribution , then the obvious way to make sequential predictions is to set , so that . For arbitrary probability distributions, we have, by the formula for conditional probability: for all , . Taking logarithms gives
Conversely, if we are given an arbitrary sequential prediction strategy which when input with a sequence of arbitrary length outputs a prediction for the next outcome in the form of a probability distribution on , we can define and then further define . A simple calculation shows that we must have , so we have constructed a probability distribution which once again satisfies (23). The fundamental insight here is that, when the log-loss is used, every probability distribution defines a sequential prediction strategy and — perhaps more surprisingly — vice versa, every sequential prediction strategy defines a probability distribution, such that on all sequences of outcomes, the minus-log-likelihood is equal to the cumulative loss.
Example 5. Consider again the Bernoulli model . Each element defines a prediction strategy which, no matter what happened in the past, predicts that the probability that the next outcome is equal to . It incurs cumulative loss, on sequence with ones and zeros, given by . The Bayesian universal distribution with Jeffreys’ prior that we considered in Example 1 satisfies, as was already mentioned, , so it “learns” the probability of 1 based on past data and does not treat the data as i.i.d. any more. The asymptotic expansion (16) then shows that its cumulative loss is of order
We may now ask: given a parametric model , what distribution (i.e., prediction strategy) in leads to the best predictions of data ? For simplicity we will assume that data are i.i.d. according to all . We have to distinguish between the best sequential prediction strategy with hindsight and the best prediction strategy that can be formulated before actually seeing the data. The former is given by the achieving
We now also see where the idea to use the prequential plug-in distribution instead of comes from: if calculating is too difficult, or if the horizon (which is needed to calculate ) is unknown, we might simply pick any estimator which we think is “reasonable” and replace our prediction by — if the estimator was chosen cleverly, we can expect the resulting cumulative regret to be small. Reconsidering (14), we see that all the universal distributions, viewed as prediction strategies, with the right choice of luckiness functions, priors andor estimates, can be made to achieve a logarithmic (in ) worst-case regret — since the cumulative log-loss achieved by the best predictor in hindsight usually grows linearly in , a logarithmic regret is quite satisfactory. Returning to our Bernoulli example, we see that the cumulative log-loss obtained by , the best with hindsight, is equal to , where is the binary entropy, . Note that, in line with the above discussion, is linear in unless tends to 0 or 1, but the regret of with Jeffreys’ prior is logarithmic in .
We thus get a novel interpretation of MDL: it associates each model with a sequential prediction strategy that is designed to achieve small regret compared to the hindsight-optimal prediction strategy within ; it then picks the model for which the corresponding prediction strategy achieves the smallest cumulative loss on the data.
Related works
Dawid (see Ref. 15 and many subsequent works) suggests to use this prequential model choice also with respect to loss functions other than the logarithmic loss; minimax-optimal cumulative prediction strategies without making stochastic assumptions about the data, with log-loss but (mainly) with other loss functions, are one of the main topics in machine learning theory; see for example Ref. 24; but there they are generally not used for model comparison or selection.
Why the logarithmic score?
Why does it make sense to minimize cumulative log-loss? Outside of the MDL world, the log-loss is often used for two reasons: first, it is (essentially) the only local proper scoring rule.25 Second, it has an interpretation in terms of money: for every sequential prediction strategy, there is a corresponding “sequential investment” strategy such that the smaller the cumulative log-loss, the larger the monetary gains made with this strategy (“Kelly Gambling”17,20).
Within the MDL field however, the use of the log-loss comes from the Kraft inequality, which directly relates it to lossless data compression. As we already saw before Example 4, for any sequential prediction strategy, i.e., every distribution on sequences of length , there is a lossless code such that, for all sequences of length ,
MDL prediction and “improper” estimation.
As is clear from the prequential interpretation of MDL given above, once a universal distribution has been fixed, one can use it to predict given by . At least for i.i.d. data, we can estimate the underlying “true” distribution based on such predictions directly, by simply interpreting as an estimate of ! This is different from the previous form of MDL estimation described in Sec. 2.3, which was based on MDL (penalized ML) estimators . Note that this standard MDL estimator is “in-model” or proper (to use machine learning terminology26), whereas is out-model or improper: in general, there may be no such that . For example, with Bayes universal distributions, will be a mixture of distributions in rather than a single element; see G07 for more discussion.
2.5. The luckiness function
The choice of a luckiness function is somewhat akin to the choice of a prior in Bayesian statistics, yet — as explained at length in Chap. 17 of G07 — there are very important differences, both technically (luckiness functions are not always integrable) and philosophically. Basically, a luckiness function just determines for what type of data one will be “lucky” [ large] and get small cumulative regret based on small samples (and presumably, good model selection results as well), and for what data one will be less lucky and get good results only when the dataset grows much larger — may thus be chosen for purely pragmatic reasons. For example, as in Example 4 (see the italicized text there), if one assigns a large value to some model within a large collection of models where is sparse, one may do this because one hopes that this sub-model will already lead to reasonable predictions of future data, even though one feels that, at the same time, when more data becomes available, a model with a much larger number of nonzero parameters may at some point almost certainly become better (Example 4). Such an interpretation is not possible with a Bayesian prior , where a large value of indicates a strong belief that is true, or at least, that predictions based on acting as if it will be true will be optimal — a Bayesian with high prior on considers likely rather than just useful — a distinction worked out in detail by Grünwald.27 Nevertheless, just like a Bayesian prior, the luckiness function has to be chosen by the userstatistician, and often contains a subjective element. Still, in contrast to Bayesian priors, since we invariably take a worst-case log-loss stance in MDL, there often is a uniquely preferable choice of luckiness function for parametric models . First, if with uniform and no clear prior knowledge or preference is available, then uniform is usually preferable over other , since it achieves the worst-case optimal prediction performance. Second, if with uniform for some we can often still set the first few, say , outcomes aside and pick a luckiness function for . The corresponding estimator (for fixed ) based on the data as given by (5) will then be equal to the ML estimator based on the full data, , and by choosing large enough, one can often get that is finite after all for all and one may then compare models by picking the maximizing . Thus, the luckiness function is now determined by the first few “start-up” data points, and one uses NML based on the remaining data points with an estimator that coincides with the ML estimator based on the full data. G07 argues why this data-driven luckiness function is the best default choice available; note that it is analogous to our use of improper Bayes priors as described in Example 2.
3. Novel Universal Distributions
3.1. The switch distribution and the AIC–BIC dilemma
The AIC–BIC dilemma (see for example Ref. 28 and the many citations therein) is a classic conundrum in the area of statistical model selection: if one compares a finite number of models, the two standard benchmark methods, with (different) asymptotic justifications, are AIC and BIC. Suppose one first selects a model using AIC or BIC. One then predicts a future data point based on, e.g., maximum likelihood estimation, or by adopting the BayesMDL predictive distribution, within the chosen model. If one compares a finite number of models, then AIC tends to select the one which is optimal for prediction (compared to BIC, the predictions converge faster, by a factor of order , to the optimal ones). On the other hand, BIC is consistent: with probability 1, it selects the smallest model containing the true distribution, for all large ; the probability that AIC selects an overly large model does not go to 0 for large . Both the predictive-optimality and the consistency property are desirable, but, like AIC and BIC, common methods all fail on one of the two. For example, MDL, with each of the four classical distributions, and Bayes factor model selection will behave like BIC for large and be consistent but prediction-sub-optimal; for any fixed , leave--out and -fold cross-validation will tend to behave like AIC and have the reverse behavior. Yang28 shows that, in general, this dilemma cannot be solved: every consistent method has to be slightly prediction-sub-optimal in some situations; he also shows that prediction by model averaging cannot solve this dilemma either.
Nevertheless, as first shown by van Erven et al.16 (who thereby solved Open Problems 8 and 17 of G17), one can design universal distributions that “almost” get the best of both worlds: basing MDL model selection on them using (1) one gets a criterion which is strongly consistent while at the same time losing only an exceedingly small-order factor in terms of prediction quality compared to the AIC-type methods. Although it can be applied to arbitrarily large model collections, the idea of this so-called switch distribution is best explained by considering the simplest case with just two nested models : one starts with two standard universal distributions (say, Bayesian or luckiness-NML) for and for . For every , defines a conditional distribution . One now picks a “prior” distribution on the integers [typically one that decreases polynomially, e.g., ], and one defines a new universal distribution for by
In case the number of models on the list is larger than two or even infinite, one has to associate each model with a separate switch distribution. The technique for doing so is described by van Erven et al.29 who also give an efficient implementation and prove consistency and prediction-optimality of the switch distribution in a weak, cumulative sense for both finite and infinite numbers of models. van der Pas and Grünwald30 mathematically show the “almost” prediction-optimality for a finite number of models.
3.2. Hybrids between NML Bayes and prequential plug-in
A Problem for NML: Unknown horizon Bayesian universal distributions with fixed priors have the property that the probability assigned to any initial sequence , where , is independent of the total length of the sequence. For other universal models, such as NML, this is not always the case. Take for example the Bernoulli model extended to sequences by independence: For sequence length , the normalizing term in the NML equals . For sequence length , the normalizing term equals . For , the NML probability of the sequence 00 is . However, for the sequence length , the probability of the initial sequence 00 is obtained as the sum of the probabilities of the sequences 000 and 001, which becomes . As shown by Ref. 21 (see also Ref. 31), there do exist cases in which NML is, like Bayes, horizon-independent, but these are very rare — see the Appendix.
The above discrepancy between the initial sequence probabilities for different sequence lengths may be a problem in situations where we need to obtain predictions without necessarily knowing the total length of the sequence, or the horizon. Another related issue is that even if the total sequence length was given, it can be computationally expensive to obtain marginal and conditional probabilities along the initial sequences. One possible solution would be to restrict to Bayesian universal distributions. However, while these solve the horizon issue, they are (a) still often computationally inefficient and (b) they lack NML-style worst-case regret interpretations. This has spurned research into universal codes that can be calculated without knowing the horizon in advance and that behave better as regards to (a) or (b), which we now review.
3.2.1. Prequential plug-in and the formula
The most straightforward way to deal with issue (a) is to use the prequential which, by construction, is horizon-independent. However, for the prequential (10) the BIC asymptotics (14) only hold in expectation if the data are sampled from one of the distributions in the model . This makes the result much weaker than for the other five universal distributions considered, for which the asymptotics hold for every individual sequence in some large set, i.e., without making any stochastic assumptions at all. One might thus wonder what happens for general data. Extending the earlier works by Takeuchi and Barron32 and Kotłowski et al.33 shows that, if data are sampled from a distribution , and is the distribution in that is closest in Kullback–Leibler (KL) divergence to , then (14) holds in expectation, with a correction term involving the variances of both distributions; for one-dimensional models, we get
Solving Open Problem 3 from G07, Kotłowski et al.33 also provide a new universal distribution, in which for any given estimator , is turned into a slightly “flattened” version , which is not in any more (it is not an in-model estimator), but it does achieve the standard asymptotics without correction. For example, in case of the normal location family with fixed variance , it coincides with a Bayesian predictive distribution based on a standard conjugate prior, which in this case is a normal with mean (the Bayes MAP estimate) but variance . More generally, becomes a hybrid between the estimator and a Bayes predictive distribution, but it has the advantage over the latter that it can be calculated without performing an integral over the parameter space. It thus provides an alternative to the NML distribution that is horizon-free and that is often faster to compute than .
Roos and Rissanen35 and Rissanen et al.36 developed other prequential, horizon-free universal codes that are non-Bayesian, yet remain more closely to NML in spirit than . They work out the details for discrete models including Bernoulli as well as linear regression models. For Bernoulli models, the resulting universal code coincides with the so-called one-step lookahead model proposed earlier by Takimoto and Warmuth.37 For linear regression models the asymptotic consistency of the resulting model selection criterion was studied by Määttä et al.38 and Määttä and Roos.39 Relatedly, Watanabe and Roos40 show that no horizon-independent strategy can be asymptotically minimax in the multinomial case and propose simple Bayesian universal models with a horizon-dependent Dirichlet prior that achieve asymptotic minimaxity and simplify earlier proposals. Among the proposed priors is with which converges to the Jeffreys’ prior but has a mild dependency on the horizon .
3.3. Hypothesis testing: Universal distributions based on the reverse information projection
Suppose we compare just two models, and , as explanations for data , a situation similar to classical null hypothesis testing, the standard method for evaluating new treatments in the medical sciences and scientific hypotheses in most applied sciences such as psychology, biology and the like: we can think of and as two hypotheses, where, usually, represents the status quo (“treatment not effective”, “coin unbiased”). In casea represents a simple (singleton) hypothesis, there is a strong additional motivation for using MDL as a hypothesis testing method, and in particular, for quantifying the evidence against in terms of
But, all this holds only for simple . Yet the tests most used in practice, such as the -test and contingency table tests, all involve composite . For composite , the no-hypercompression inequality (27) usually only holds for some , but for Type-I error guarantees and the like we would want to have it hold for all with . That is, we would like to employ universal distributions and such that we have
Example 6 (Right-Haar priors and the Bayesian -test). In a series of papers (highlights include Refs. 43 and 44), Berger and collaborators established Bayes factor testing methods for composite where the only free parameters in are “nuisance” parameters that are shared by and are governed by a group structure. A prime example is the unknown variance in the -test. Berger uses a special type of improper prior, the so-called right-Haar prior, which can be defined for every such type of nuisance parameter. While Bayes factors usually do not combine well with improper priors, the Bayes factors for group invariance parameters equipped with the right-Haar prior behave remarkably well. Grünwald et al.17 show that, even though the right-Haar priors are usually improper, they can also be understood from a purely MDL perspective: if and are equipped with the right-Haar prior on the nuisance parameters, and the prior on the additional parameters in satisfies some additional requirements, then both and can be interpreted as sequential prediction strategies, and the log of the Bayes factor can be interpreted as the code lengthcumulative-log-loss difference. Moreover, is (essentially) the RIPr for and the no-hypercompression inequality (28) that is so desirable from a frequentist perspective holds uniformly for all .
Let us consider the one-sample Bayesian -test as an example. Here is the set of all normal distributions with mean 0; the variance is a free parameter. is the set of all normal distributions with as free parameters and . The question of interest is to establish whether or not; is an unknown “nuisance” parameter — it determines the scale of the data but is not itself of intrinsic interest. In the Bayesian-test one equips both and with the improper right-Haar prior, . To complete the definition of , is equipped with a conditional prior density (given ) on the effect size . This second density has to be symmetric around zero and proper (this is what we called the “additional requirement on the prior on ”, instantiated to the case where the nuisance parameter is a variance). One now proceeds by testing using the Bayes factor . In this special case, the procedure was already suggested by Jeffreys,45 and the right-Haar prior coincides with Jeffreys’ prior on the variance. Berger et al. extend the method to general group-invariant parameter vectors such as the joint mean and variance in the two-sample -test, testing a Weibull against a log-normal and many other scenarios.
4. Graphical Models
Graphical models are a framework for representing multivariate probabilistic models in a way that encompasses a wide range of well-known model families, such as Markov chains, Markov random fields and Bayesian networks; for a comprehensive overview, see Ref. 46. A key property of a graphical model is parsimony, which can mean, for instance, a low-order Markov chain or more generally a sparse dependency graph that encodes conditional independence assumptions. Choosing the right level of parsimony in graphical models is an ideal problem for MDL model selection.
In Bayesian network model selection the prevailing paradigm is, unsurprisingly, the Bayesian one. Especially the works of Geiger and Heckerman47 and Heckerman et al.48 have been extremely influential. The main workhorse of this approach is the so-called Bayesian Dirichlet (BD) family of scores which is applicable in the discrete case where the variables being modeled are categorical. Given a data sample, such scores assign a goodness value to each model structure. Exhaustive search for the highest scoring structure is possible when the problem instance (characterized by the number of random variables) is of limited size, but heuristic search techniques such as variants of local search or “hill-climbing” can be used for larger problem.
Different forms of the BD score imply different Dirichlet priors (different hyper-parameters) for the local multinomial distributions that comprise the joint distribution. For example, in the commonly used BDeu score, the priors are determined by a single hyper-parameter, . For a variable with distinct values and parents that can take possible combinations of values (configurations), the BDeu prior is . One of the main motivations for adopting this prior is that it leads to likelihood equivalence, i.e., it assigns equal scores to all network structures that encode the same conditional independence assumptions. In light of the fact that Bayesian model selection embodies a particular formvariation of MDL, these methods fit, at least to some extent, in the MDL framework as well. However, there also exist more “pure”, non-Bayesian MDL methods for model selection in Bayesian networks; we mention Refs. 49 and 50 as early representative examples. These early methods are almost invariably based on the two-part coding framework. More recently, several studies have proposed new model selection criteria that exploit the NML distribution. One approach is a continuous relaxation of NML-type complexities proposed by Miyaguchi et al.51 in which the model selection problem takes on a tractable Lasso-type -minimization form (see also Example 4). In other approaches, NML [or usually, approximations (but not relaxations) thereof] are used directly for encoding parts of the model; we now describe these latter approaches in a bit more detail.
4.1. Factorized NML and variants
Silander et al.52 propose the factorized NML (fNML) score for Bayesian network model selection which was designed to be decomposable meaning that it can be expressed as a sum that includes a term for each variable in the network. This property facilitates efficient search among the super-exponential number of possible model structures; see, e.g., Ref. 48. The fNML score factors the joint likelihood not only in terms of the variables but also in terms of distinct configurations of the parent configurations. Each factor in the product is given by a multinomial NML probability, for which a linear-time algorithm by Kontkanen and Myllymäki53 can be used.
A similar idea where a Bayesian network model selection criterion is constructed by piecing together multiple NML models under the multinomial model was proposed recently by Silander et al.54 In the proposed quotient NML (qNML) score, the local scores corresponding to each variable in the network are defined as log-quotients of the form
Eggeling et al.55 apply similar ideas to a different model class, namely parsimonious Markov chains. There too, the likelihood is decomposed into factors depending on the configurations of other variables, and each part in the partitioning is modeled independently using the multinomial NML formula. The authors demonstrate that the fNML-style criterion they propose leads to parsimonious models with good predictive accuracy for a wide range of different scenarios, whereas the corresponding Bayesian scores are sensitive to the choice of the prior hyper-parameters, which is important in the application where parsimonious Markov chains are used to model DNA binding sites.56
In all these papers, both simulated and real-world data experiments suggest that the MDL-based criteria are quite robust with respect to the parameters in the underlying data source. In particular, the commonly used Bayesian methods (such as the BDeu criterion) that are being used as benchmarks are much more sensitive and fail when the assumed prior is a poor match to the data-generating model, whereas the MDL methods are invariably very close to the Bayesian methods with the prior adapted to fit the data. This poses interesting questions concerning the proper choice of priors in the Bayesian paradigm.
In fact, the prevalence of the Bayesian paradigm and the commonly used BD scores is challenged by two recent observations: First, Silander et al.52 show that the Dirichlet prior with hyper-parameters , which is the invariant Jeffreys’ prior for the multinomial model, but not likelihood-equivalent when used in the BD score, is very close to the fNML model and consequently, enjoys better robustness properties than the BDeu score which is the likelihood-equivalent BD score variant. Second, Suzuki57 shows that the BDeu criterion is irregular, i.e., prone to extreme overfitting behavior in situations where a deterministic relationship between one variable and a set of other variables holds in the data sample. The MDL scores discussed above are regular in this respect and their robustness properties seem to be better than those of the BD scores, see Ref. 54.
4.2. Asymptotic expansions for graphical models
Asymptotic results concerning MDL-based criteria in graphical models are interesting for several reasons. For one, they lead to efficient scores that can be evaluated for thousands of different model structures. Second, asymptotic expansions can lead to insights about the relative complexity of different model structures.
Various asymptotic forms exist for the point-wise and the expected regret depending on the model class in question. For convenience we repeat the classical expansion of the NML (as well as the Bayesian marginal likelihood with Jeffreys’ prior) regretmodel complexity that applies for regular model classes for which is finite with uniform (see Sec. 2.2 above) :
For discrete data scenarios, by far the most interesting case is the multinomial model (extension of the Bernoulli distribution to an i.i.d. sequence of -valued categorical random variables) since it is a building block of a number of MDL-criteria such as fNML and qNML (see above). There are many asymptotic expansions for the NML regret under the multinomial model. Probably the most useful is the one proposed by Szpankowski and Weinberger58 :
Roos,59 and Zou and Roos60 studied the second term in the expansion (29), namely the Fisher information integral, under Markov chains and Bayesian networks using Monte Carlo sampling techniques. This approach reveals systematic differences between the complexities of models even if they have the same number of parameters.
5. Latent Variable and Irregular Models
Although thus far we have highlighted exponential family and regression applications, NML and other universal distributions can of course be used for model selection and estimation in complete generality — and many practical applications are in fact based on highly irregular models. Often, “classical” two-part distributions (based on discretized models) are used, since NML distributions often pose computational difficulties. However, Yamanishi and collaborators have managed to come up with tractable approximations of NML-type distributions for some of the most important irregular (i.e., nonexponential family) models such as hierarchical latent variable models,61 and the related Gaussian mixture models.62,63 Suzuki et al.64 provide an NML approach to nonnegative matrix factorization. Two-part codes (and corresponding MDL estimators) for mixture families that come close to achieving the minimax regret were considered very recently by Miyamoto et al.65
When it comes to asymptotic approximations for code lengthslog-likelihoods based on NML and other universal distributions — all approximations so far (in Sec. 2.2) were derived essentially assuming that the model under consideration is an exponential family. Extensions to curved exponential families and generalized linear models are relatively straightforward (see G07 for details). For more irregular models, Watanabe has proposed the widely applicable information criterion (WAIC) and the widely applicable Bayesian information criterion (WBIC), see Refs. 66 and 67, where the latter can be viewed as an asymptotic expansion of the log-likelihood based on a Bayesian universal distribution. It coincides with BIC when applied to regular models but is applicable even for singular (irregular) models. The asymptotic form of WBIC is
6. Frequentist Convergence of MDL and Its Implications
Rissanen first formulated the MDL Principle as — indeed — a Principle: one can simply start by assuming, as an axiom, that modeling by data compression (or, equivalently, sequential predictive log-loss minimization) is the right thing to do. One can also take a more conventional, frequentist approach, and check whether MDL procedures behave desirably under standard frequentist assumptions. We now review the results that show that, in general, they do — thus providing a frequentist justification of MDL ideas: with some interesting caveats, MDL model selection is typically consistent (the smallest model containing the true distribution is eventually chosen, with probability one) and MDL prediction and estimation achieves good rates of convergence (the Hellinger distance between the estimated and the true density goes to zero, with high probability, quite fast). In this section we review the most important convergence results. In particular, Sec. 6.1 shows that the link between data compression and consistent estimation is in fact very strong; and Sec. 6.4 shows that, by taking MDL as a principle, one can get useful intuitions about deep questions concerning deep learning; and the intuitions can then, as a second step, be once again validated by frequentist results.
Thus, let us assume, as is standard in frequentist statistics, that data are drawn from a distribution in one of the models under under consideration. We consider consistency and convergence properties of the main MDL procedures in their main applications: model selection, prediction and estimation.
Model selection
For model selection between a finite number of models, all universal codes mentioned here are consistent in wide generality; for example, this has been explicitly proven if the data are i.i.d. and all models on the list are exponential families, but results for more complex models with dependent data have also been known for a long time; see G07 for an overview of results. If the collection of models is countably infinite, then results based on associating each with have also been known for a long time; such results typically hold for “almost all” (suitable defined) distributions in all ; again, see G07 for a discussion of the (nontrivial) “almost all” requirement. These countable- consistency results were extended to the switch distribution by van Erven et al.29
Prediction and “improper” estimation
As to sequential prediction (Sec. 2.4), the rate of convergence results are very easy to show (see Chap. 15 of G07), but these typically only demonstrate that the cumulative-log-loss prediction error of sequentially predicting with a universal distribution behaves well as increases. Thus, since the sum of prediction errors is small, say (for parametric models) of order , for most the individual prediction error at the th sample point must be of order , since . Still, it remains an open question how to prove for individual what exactly the expected prediction error is at that specific . Since one can view each prediction as an “improper” estimate (end of Sec. 2.4), the convergence rates of the resulting estimators, which estimate the underlying distribution based on a sample of size as , usually also behave well in a cumulative sense, but again it is very hard to say anything about individual . The asymptotic expansions (15) and (16) imply that, for fixed parametric models , and achieve optimal cumulative prediction and estimation errors. If, however, they are defined relative to a full model class consisting of at least two nested models, then they may fail to achieve optimal rates by a factor. van Erven et al.29 show that sequential predictionestimation based on the switch distribution achieves the minimax-optimal cumulative predictionestimation error rates even in such cases. van der Pas and Grünwald30 show that, if only two models are compared, then the optimal obtainable rate for individual for any consistent procedure is achieved as well.
6.1. Frequentist convergence of MDL estimation
Very strong results exist concerning the convergence of MDL estimation based on an MDL estimator as given by (5). A first, classical result was already stated by the ground-breaking work,68 establishing that consistency and good convergence rates can be obtained for the special case of a two-part-code estimator based on a probability mass function , as long as satisfies for some and puts sufficient prior mass in a KL neighborhood of the true . These results were greatly extended by Zhang69,70 and further, very recently, by Grünwald and Mehta.13 The latter consider for general . Let be a statistical model and suppose data are i.i.d. with . They find that a sufficient condition for consistency is that and that for some , the following generalized model complexity
6.2. From MDL to Lasso
As illustrated in Example 4, when used for high-dimensional variable selection, the original MDL approach would be to use a mixed two-partone-part code as in (1) with a term to account for the model index . In such settings, there may well be variables of interest, each of which may or may not be included in the model, so that the minimization over requires trying out choices — which is practically infeasible. For this reason, in practice people have strongly preferred the Lasso and related methods based on -penalties, which take linear rather than exponential time in (note that the classic MDL essentially penalizes by an -penalty). However, Barron and Luo,71 and Barron et al.72 showed that, under some conditions on the true distribution (such as Gaussian noise), the Lasso method can be re-interpreted in terms of code length minimization after all; see also Ref. 73, and, for further extensions, Refs. 74 and 75. For a different approach to unify model selection with very high-dimensional models with the luckiness NML, see Miyaguchi and Yamanishi.23
Although some of the details may differ, it seems that most of these works are subsumed by the aforementioned result of Grünwald and Mehta13 who show that general penalized estimators can be re-interpreted as minimizing a one-part code length as long as is bounded, and can be proven consistent under the (still quite weak) condition that as in (32) is bounded for some . Thus, the connection between MDL and general (including Lasso and other -penalties, but also with entirely different penalties) penalization methods is substantially stronger than it seemed before these developments took place.
Supervised machine learning
Importantly, all the works mentioned here except Ref. 13 cannot show convergence under misspecification — for example, when applied to the Lasso, they would require an assumption of normal noise (corresponding to the squared error used in the Lasso fit, which is equivalent to the log-loss under a normal distribution for the noise). In practice though, the Lasso (with the squared error) is often used in cases in which one cannot assume normally distributed errors. Reference 13 contains results that can still be used in such cases [although the formula for changes], based on ideas which we sketch in the following sub-section.
More generally, one of the major areas within machine learning is supervised learning in which one assumes that data are i.i.d. , with and , and one aims to use the data to learn a predictor function that has small expected loss or risk, defined as , where is some loss function of interest and is a member of some “predictor model” . For example, the statistical notion of “regression with random design” corresponds, in machine learning, to a supervised learning problem with and . Early MDL convergence results do not cover this “supervised” situation: they are not equipped to handle either random design or loss functions beyond the log-loss. Some of the more recent works mentioned above are able to handle random design but not general loss functions (for example, for Lasso-type applications they require the noise to be normally distributed). Reference 13 seems to be the first that can fully handle supervised learning scenarios: the convergence results can be used with random design, and they can also be used with large classes of loss functions including squared error (without normality assumption) and zeroone-loss. This is achieved by associating predictors with densities , so that the log-loss relative to density on data becomes linearly related to the loss of on ; the analysis then proceeds via analyzing convergence of MDL for the densities as a misspecified probability model.
6.3. Misspecification
As beautifully explained by Rissanen,76 one of the main original motivations for MDL-type methods is that they have a clear interpretation independent of whether any of the models under consideration is “true” in the sense that it generates the data: one chooses a model minimizing a code length, i.e., a prediction error on unseen data, which is meaningful and presumably might give something useful irrespective of whether the model is true (Rissanen even argues that the whole notion of a “true model” is misguided). This model-free paradigm also leads one to define the NML distribution as minimizing prediction error in a stringent worst-case-over-all data sense [Eq. (13)] rather than a stochastic sense. Nevertheless, it is of interest to see what happens if one samples data from a distribution for which all models under consideration are wrong, but some are quite useful in the sense that they lead to pretty good predictions. Doing this leads to rather unpleasant surprises: as first noted by Grünwald and Langford,77 MDL (and Bayesian inference) can become inconsistent: one can give examples of with countably infinite and a “true” data generating distribution such that, when data are sampled i.i.d. from , MDL will tend to select a sub-optimal model for all large — while all sub-models are wrong, one of them, is optimal in several intuitive respects (closest in KL divergence to , leading to best predictions under a number of loss functions), yet it will not be selected for large . While the models considered by Grünwald and Langford77 were quite artificial, Grünwald and van Ommen78 showed that the same can happen in a more natural linear regression setting; moreover, they also showed that even if is finite, although then eventually MDL will select the best sub-model, for even relatively large it may select arbitrarily bad sub-models. De Heide79 shows that the problem also occurs with MDL and Bayesian regression with some real-world datasets.
It turns out that the root of the problem is related to the no-hypercompression property (27). If the collection of models contains the density of the “true” distribution , then any distribution will satisfy no-hypercompression relative to the true :
6.4. PAC-MDL bounds and deep learning
One of the great mysteries of modern deep learning methods in machine learning is the following82: deep learning is based on neural network models which can have many millions of parameters. Although typically run on very large training samples , is usually still so small that the data can be fit perfectly, with zero error on the training set. Still, the trained models often perform very well on future test sets of data. How is this possible? At first sight this contradicts the tenet, shared by MDL and just about any other method of statistics, that good generalization requires the models to be “small” or “simple” [small in MDL analyses, small VC dimension or small entropy numbers in statistical learning analyses] relative to the sample size. One of several explanations (which presumably all form a piece of the puzzle) is that the local minimum of the error function found by the training method is often very broad — if one moves around in parameter space near the minimum, the fit hardly changes. Hochreiter and Schmidhuber83 already observed that describing weights in sharp minima requires high precision in order to not incur nontrivial excess error on the data, whereas flat minima can be described with substantially lower precision, thus forging a connection to the MDL idea; in fact related ideas already appear in Ref. 84. In these papers, the MDL Principle is used in a manner that is less direct than what was done thus far in this paper: we (and, usually, Barron and Rissanen) directly hunt for the shortest description of the data. In contrast, the aforementioned authors simply note that, no matter how a vector of parameters for a model was obtained, if, with the obtained vector of parameters, the data can be compressed substantially, for example by coding first the parameters and then the data with the help of the parameters, then, if we believe the MDL Principle, with these parameters the model (network) should generalize well to future data. In modern practice, neural networks are often trained with stochastic gradient descent (SGD), and it has been empirically found that networks that generalize well do tend to have parameters lying in very flat minima.
While this use of the MDL Principle seems less precise than what we reviewed earlier in this paper, it can once again be given a frequentist justification, and this justification is mathematically precise after all: the so-called PAC-Bayesian generalization bounds85 show that the generalization performance of any classifier can be directly linked to a quantity that gets smaller as soon as one needs (a) less bits to describe the parameter and as soon as one needs (b) less bits to describe the data given the parameters; both the results and their proofs are very similar to the MDL convergence results by Barron and Cover,68 Zhang,69,70 and Grünwald and Mehta.13 Although in general, the formulation is not as straightforward as a simple sum of the two description lengths (a) and (b), the connections between both the two-part code length and the Bayesian code length are quite strong, as was already noticed by Blum and Langford.86 In particular, for discrete , such PAC-Bayes bounds contain a term which can be interpreted as the number of bits needed to encode using the codes based on some distribution ; for general, uncountable , this term gets replaced by a KL divergence term that can still be related to a code length via a so-called “bits back argument” pioneered by Hinton and van Camp.84 Dziugaite and Roy,87 and Zhou et al.,82 inspired by earlier work by Langford and Caruana,88 indeed show that, for some real-world datasets, one can predict nontrivial generalization using deep neural nets by looking at the number of bits needed to describe the parameters and applying PAC-Bayesian bounds.
7. Concluding Remarks
We have given a self-contained introduction to MDL, incorporating and highlighting recent developments. Of necessity, we had to make a choice as to what to cover in detail, and there are many things we omitted. We would like to end with briefly mentioning three additional developments. First, there has always been the question about how MDL relates to other complexity notions such as those considered in the statistical learning theory literature26: Vapnik–Chervonkis dimension, entropy numbers, Rademacher complexity and so on. A major step towards understanding the relation was made by Grünwald and Mehta13 who show that for probability models with members of the form , where is an arbitrary bounded loss function, the NML complexity can be precisely bounded in terms of the Rademacher complexity defined relative to . Second, we should note that Rissanen’s own views and research agenda have steered in a direction somewhat different from the developments we describe: Rissanen89 published Information and Complexity in Statistical Modeling, which proposes foundations of statistics in which no underlying “true model” is ever assumed to exist. As Rissanen writes, “even such a well-meaning statement as “all models are wrong, but some are useful”, is meaningless unless some model is ‘true’.” Rissanen expands MDL and NML ideas in the direction of the Kolmogorov structure function, taking the idea of distinguishable distributions underlying Ref. 19 as the fundamental; while presumably compatible with the developments we describe here, the emphasis of this work is quite different.
We end with a word about applications: since 2007, numerous applications of MDL and MDL-like techniques have been described in the literature; as discussed in Sec. 6.2, highly popular methods such as Lasso and Bayes factor methods can often be seen as “MDL-like”. Even as to specific “pure” MDL applications (such as based on NML and two-part codes), the number and scope of applications are simply too large to give a succinct representative overview. However, there is one particular area which we would like to mention specifically, since that area had hardly seen any MDL applications before 2007 whereas nowadays such applications are flourishing: this is the field of data mining. Some representative publications are Refs. 90–92. Most of this work centers on the use of two-part codes, but sometimes NML and other sophisticated universal distributionscodes are used as well.93
Notes
a In this sub-section we view, for notational convenience, the elements of as probability distributions with densities or mass functions .
Appendix A. When the Original NML is Undefined: Details, Open Problems and Their Solutions
The original NML distribution with uniform relies on the existence of the Shtarkov integral ; its asymptotic expansion (15) relies on the existence of the Jeffreys integral being finite, the latter being equivalent to the requirement that Jeffreys’ prior is proper. Both are quite strong requirements; for infinite sample spaces , they “usually” — that is, in most models one considers in practice, such as normal, exponential, Poisson, etc. — do not hold; but once one restricts the parameter space to an INECCSI set, they generally do hold. This may lead one to conjecture that the Shtarkov integral is finite if and only if the corresponding Jeffreys integral is finite. Resolving this conjecture was posed as an open problem by G07; Grünwald and Harremoës94 and Bar-Lev et al.95 show that, in general, the conjecture is wrong; though, for exponential families, under a very mild additional condition, it holds true.
From a more practical perspective, one would of course like to know what universal distribution to use if the standard MDL is undefined. Several proposals floated around in the early 2000s; for an overview, see Chap. 11 of G07. By now, the dominant method has become to factor in a nonuniform weight function and calculate the luckiness NML as in (11). This method was originally called luckiness NML-2 by G07, which (among many other methods) identified several “luckiness” versions of NML that had been proposed by various authors; luckiness NML-2 turned out both more practically useable and mathematically analyzable than other methods, and in this text we simply call it luckiness NML. In particular, Suzuki and Yamanishi11 show that, for exponential family models, the -dimensional integral in the luckiness NML can be replaced by a -dimensional one, and in many cases can be performed explicitly. As we indicated in Sec. 2.5, one can sometimes set the first examples aside as start-up data to define a luckiness function, leading to conditional NML. Again, G07 defined different forms of conditional NML, and again, conditional NML-2 (directly based on luckiness NML-2) turned out to be the most natural one: Bartlett et al.21 show that for some important classes of models, the NML distributions and the Bayes marginal distributions with improper Jeffreys’ prior exactly, and not just asymptotically, coincide for each . Moreover, for the case of one-dimensional families, they completely characterize the class of models for which this holds: essentially, it holds for exponential families that are also location or scale families, i.e., the normal and gamma distributions, and monotone transformations thereof (such as, e.g., the Rayleigh distributions); as well as for one curious additional family. This correspondence between objective Bayesian and conditional NML-2 approaches notwithstanding, Kojima and Komaki96 show that “conditional NML-3”, which G07 considered the most intuitive version, but at the same time, mathematically overly complicated for practical use, can be given a practical implementation after all, thereby solving the Open Problem 7 of G07.
Acknowledgments
We would like to thank Kenji Yamanishi, Matthijs van Leeuwen and Bin Yu for providing references to new MDL work, Jun-Ichi Takeuchi for prompting us to write this paper and Jorma Rissanen and Andrew Barron for many fruitful conversations over the last 20 years. This work is part of the research programme Safe Bayesian Inference with Project No. 617.001.651, which is financed by the Dutch Research Council (NWO).