Internal validation

Split-sample, cross-validation, and the bootstrap

Several techniques are at the analyst's disposal for internal validation. In a classic JCE paper (Steyerberg, Harrell, 2001), we compared variants of split-sample validation, cross-validation, and bootstrap validation.
We found that:

  1. Split-sample validation analyses gave overly pessimistic estimates of performance, with large variability;
  2. Cross-validation on 10% of the sample had low bias and low variability;
  3. Bootstrapping provided stable estimates with low bias.

We concluded that split-sample validation is inefficient, and recommended bootstrapping for estimation of internal validity of a predictive logistic regression model. This conclusion was largely confirmed in a later study, with a minor note of caution: “with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than <..> cross-validation”. In a recent JCE Commentary (Steyerberg&Harrell, 2015), we stated that 'split-sample validation only works when not needed', i.e. in the situation of very large sample size. This is because a stable prediction model can then be developed on the training part of the data, and a reliable assessment of performance can be obtained in a large test data set. In such a situation of large training and large test part, a better solution is to develop a model on the total data set, with the apparent performance as the best estimate of future performance.

Machine learning and high-dimensional model validation

Machine learning techniques are increasingly used for prediction in high-dimensional data, i.e. all kinds of 'omics' data. Most realize that machine learning techniques are relatively data hungry, and that high-dimensional data pose far more risk of overfitting than classical situations of say 5 to 20 predictors. Hence, validation is very important.
Researchers in this field often resort to old-fashioned split sample techniques. An extreme example was published in Nature Medicine in 2018. Here, researchers split a data set with 54 patients with leukemia in a training part with 44 patients, and a validation part with 10 patients. The endpoint of interest was relapse, which occurred in only 3 patients among the 10 patients in the training set. One does not have to be a statistician to realize that such a small sample size for validation leads to a highly unreliable assessment of performance.

Sample size for validation studies

Sample size recommmendations suggest at least 100 (Vergouwe 2005) or 200 (Collins 2014) events in validation samples. Others suggested that lower numbers might suffice, depending on the specific requirements for validation. Some recent, additional simulations confirm that 100 events is an absolute minimum for reliable assessment of performance (Steyerberg 2018).
Figure: Estimates of C-statistics in 100,000 simulations of validation of a prediction model with a true C-statistic (indicating discriminative ability) of either 0.7, 0.8, or 0.9, in a situation with 500 events (1167 non-events), 100 events (233 non-events), or 3 events (7 non-events). We note an extremely wide distribution of estimates with 3 events, with a spike at 1.0.}}

Internal vs external validation: what is the purpose?

  1. Internal validation aims to quantify optimism in model performance; we consider performance for a single population.
  2. External validation aims to assess generalizability to 'similar, related populations' Justice 1999; Debray 2015).

If random splits are made, we assess internal validity; this practice should be abolished. If non-random splits are made, e.g. in time, or by place (centers, countries), we assess generalizability. Here our aim should be to quantify heterogeneity in performance rather than a single estimate of 'performance in new data' (Austin 2016, 2017; Riley 2016).

Some references on validation

Austin, geographic and temporal validation JCE 2016; BMC Diag Progn Research 2017
Bleeker, internal validation needs all modeling steps and external validation is different JCE 2013
Collins (BMC 2014) and Vergouwe ( JCE 2005) on sample size for external validation
Debray, framework for external validation JCE 2015
Justice, internal vs external validity Ann Int Med 1999
Riley, external validation in Big Data BMJ 2016
Steyerberg, internal validation JCE 2001
Steyerberg&Harrell, perspective on internal and external validation JCE 2015
Steyerberg, role of small and large sample size for independent validation JCE 2018

additional/chapter17.txt · Last modified: 2018/08/01 19:07 by ewsteyerberg = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0