We consider some extra scenarios of invalidity of model predictions in section 19.9.
Table 19.4 summarized the results with respect to calibration, discrimination, and clinical usefulness. Below more details are provided for scenario 1 (Change of setting: more or less severe z, X effects slightly different) and scenario 2 (RCT vs survey: difference in heterogeneity in x, more or less severe z, X effects slightly different).
The graph supporting the first scenario from Table 19.4 is:
Interpretation: The first scenario is that a prediction model is applied in a different context which is not fully represented in the predictors X. For example, we may try to determine the validity of a model to predict indolent prostate cancer for a screening setting, while the model was developed in a clinical settingJ Urol 2007 reference. The case-mix will be different, and may not fully be captured by the predictors. The combination of more severe case mix in z and slightly different coefficients would make the prediction model poorly calibrated (a|b=1 = 0.67, slope = 0.87), the c statistic smaller (0.78) and decision making worse than a ‘treat all’ strategy (Fig, left panel). Remarkably, a less severe case-mix in z combined with different coefficients would still be associated with substantial clinical usefulness (Net Benefit 0.098) despite the miscalibration (Fig, right panel).
The graph supporting the second scenario from Table 19.4 is:
Interpretation: The second scenario is on the setting of developing a model on data from a RCT, and applying the model in a less selected population. Again, we may think of RCTs and surveys in traumatic brain injury. Differences according to x are obvious, with a broader case-mix in surveys. But possibly some missed predictors have a more severe distribution in the surveys, since physicians may tend to exclude very poor patients from RCTs, using clinical judgments that are hard to capture in formal criteria. We find that such a combination leads to a systematically miscalibrated model (a|b=1 = 0.64), with substantial c statistic (0.88), but poor clinical usefulness (Net Benefit 0.027, Upper left panel).
If a more severe case-mix in z would coincide with a less heterogeneous distribution of x, the model performance would be poor: poor calibration, only modest discrimination, harmful for decision making (Net Benefit -0.036, Upper right panel).
A more favorable scenario is a less severe case-mix in z combined with a more heterogeneous case-mix in x. This causes a calibration-in-the-large problem, but adequate discrimination and surprisingly high clinical usefulness (Lower left panel).
Applying a model from a survey in a RCT might lead to a less severe distribution of z and less heterogeneity in x. This would imply poor calibration and only modest discrimination. But it would have some clinical usefulness (Net Benefit 0.037, Lower right panel).
We note that in addition to causing systematically poorer performance with respect to calibration and discrimination, applying a model in a different case-mix makes predictions for individual subjects more uncertain. We can see such application as extrapolation in the multivariate space as defined by the predictor values. The model is applied for patterns of X that were relatively sparse at model development.