A simulation study was done to explore benefits and limitations of imputation for predictive regression models. We consider estimates of regression coefficients in a simple linear regression model (Y~X1+X2), and also consider estimates of predictive performance.
The simple linear regression model Y~X1+X2 was discussed in section 7.3 of the book. The impact of 4 missing data mechanisms (MCAR, MAR on x, MAR on y, MNAR) was illustrated in Fig 7.1.
We first generate X1 and X2 as uncorrelated predictors. In the second series of simulations, we impose a correlation of 0.707 (covariance 50%) between X1 and X2. We consider 4 mechanisms to generate 50% missing values in X1: MCAR, MAR on x, MAR on y, MNAR, as in Fig 7.1. Below we illustrate the correlation of X1 with X2 and of X1 MAR on X2, respectively. Original data are plotted with ‘-‘; complete data are indicated with ‘o’.
We calculate the estimated regression coefficients and their estimated standard errors with different approaches to dealing with missing values. We use the mean squared error (MSE) as a summary measure for the quality of estimation of regression coefficient values. The MSE is calculated as mean(estimated b – true β)2. The MSE combines bias (systematic difference between estimated and true value of the regression coefficient) and precision (random variability of estimates). For comparison on the same scale as the coefficients, we take the square root (‘sqrt(MSE)’). Smaller values of sqrt(MSE) indicate better estimation of the regression coefficients. Furthermore, we calculate the adjusted R2 statistics to indicate the estimated predictive performance of the model with X1 and X2. Simulations were done 1000 times for data sets with 500 simulated subjects.
We first consider the hypothetical situation of complete data from the simulations (‘original data’). These results provide a gold standard reference. Next, we consider a CC analysis, where only patients with complete values on X1 (and X2 and Y) are analysed. As single imputation procedures, we consider conditional mean imputation using X2 to impute X1 values. In addition, we consider a stochastic regression imputation, where single imputations for X1 are based on random draws from an imputation model using X2 and Y. Finally, we apply the mice MI procedure with its default settings to generate 5 imputed data sets. The mice algorithm assumes linearity in the associations between variables, which is the case in our simulation design. The imputed data sets are analyzed with standard least squares methods, and results are pooled using the standard formula for MI results.
Results are shown in a large Table. Unbiased estimation of the effect of a predictor is an important goal in prediction research. Especially, we may want to estimate the effect of one predictor, adjusted for the other. This refers to estimation of the effect of X2 (with regression coefficient b2), while adjusting for X1 (which has 50% missings). We noted that b2 had unbiased estimates with all approaches if X1 and X2 were uncorrelated, except with CC or CM when a MAR on y situation had occurred . The coefficient b2 was estimated well under MNAR with all approaches considered. When X1 and X2 were correlated, the MNAR situation posed more problems for CM, SI and MI. CC analysis was the only unbiased approach then, although the MSE was not much smaller than that of other approaches. This simple simulation study confirms that a complete case (CC) analysis gives quite reasonable estimates of the regression coefficients b1 and b2 under most missing value mechanisms. The problematic situation is MAR on y, where single (SI) or multiple imputation (MI with mice) had clear advantages. In the MNAR situation, CC analysis led to unbiased regression coefficients, in contrast to all other approaches. CC analysis did however not give a good impression of the predictive performance of the regression model in the original population. This occurs since the performance of the model with CC analysis is assessed on a selection of patients in the MAR on x, MAR on y, and MNAR situations. The limited spectrum of patients in the CC analysis led to lower estimates of the predictive performance (explained variation, adjusted R2). Conditional mean (CM) imputation on average leads to quite similar estimates of the regression coefficients as CC analysis (Table 4). Coefficients were however biased under MAR on y, and under MNAR when X1 and X2 were correlated. The adjusted R2 was estimated too low, especially in the simulations with uncorrelated X1 – X2 variables. Results with SI or MI were quite similar in many aspects. The regression coefficients were on average unbiased, except under MNAR. However, SI led to an underestimation of the SEs. The SEs were estimated as if the data were complete; indeed the estimates were similar to the SE estimates for the original data. The mean squared error for estimation of the coefficients was similar for SI and MI when X1 and X2 were uncorrelated. But with correlated X1 – X2, MI had more substantial advantages in terms of MSE. The predictive performance (adjusted R2) was estimated quite well with SI or MI, except under MNAR in uncorrelated data. Overall, MI gave good results in this simple simulation study: regression coefficients were at least as well estimated as a CC analysis, and the estimated SEs were quite correct. As a next best, SI was reasonable, with slightly poorer estimation of regression coefficients, but good estimation of predictive performance. Both SI and MI rely on MCAR, MAR on x, or MAR on y. Under MNAR, CC analysis is unbiased for the regression coefficients, but underestimates predictive performance for the original, complete data.