Prevalence Shift and ICU Mortality Miscalibration

Background and Motivation

When a machine learning model trained at one hospital is deployed at another, a phenomenon known as dataset shift causes its performance to degrade. The most commonly reported metric, AUROC, which measures whether the model correctly ranks higher-risk patients above lower-risk ones, tends to hold up reasonably well across institutions. Calibration, which measures whether the predicted probability numbers themselves are accurate, tends to collapse.

One underappreciated cause of this calibration collapse is prevalence shift, also called label shift or prior probability shift. This occurs when the fraction of patients who actually die differs between the hospital the model was trained on and the hospital it is deployed at. A model trained at a hospital with 10.8% mortality implicitly encodes that rate as a prior belief. When deployed at a hospital with only 3% mortality, it will systematically overestimate risk for all patients. When deployed at a hospital with 18% mortality, it will systematically underestimate risk.

Crucially, this form of miscalibration is invisible to AUROC. Because AUROC only measures the relative ranking of patients, not the absolute probability values, it remains unaffected even when the probability estimates are badly wrong. This creates a dangerous blind spot: a model can appear adequate by the standard reported metric while generating systematically biased risk estimates that mislead clinical decision-making. This paper investigates that blind spot empirically at scale: 185 real hospitals, 159,224 patients, four model architectures, and a theoretically grounded correction that requires no labeled data from the new hospital.

Proposed Correction

The paper derives a closed-form Bayesian prevalence correction. Under the assumption that the relationship between patient features and outcomes, conditional on the outcome, is stable across institutions (a standard assumption in the label shift literature), the model's predicted probability can be mathematically adjusted using only the local hospital's observed mortality rate. No labeled patient data from the new hospital, no retraining, and no access to the original training data are required beyond the model's output probabilities and a single population-level statistic: the local mortality rate.

The correction is model-agnostic and can be applied post-hoc to any probabilistic classifier. Any hospital with access to its own administrative mortality statistics, which every hospital maintains, can apply it immediately.

Experimental Setup

Models are trained on MIMIC-IV (45,587 patients, 10.8% mortality) and evaluated across 185 hospitals in the eICU Collaborative Research Database (159,224 patients total). Hospital-level mortality rates range from 0% to 19.7% with a mean of 5.2%, providing the natural variation needed to study prevalence shift at scale. The four models evaluated are Logistic Regression, Random Forest, XGBoost, and LightGBM.

Five calibration approaches are compared: uncorrected raw predictions, the proposed zero-label Bayesian correction, and three supervised methods (Platt scaling, isotonic regression, and temperature scaling) each fitted on 30% labeled data from each hospital. A synthetic causal validation experiment holds the conditional feature distribution fixed while artificially varying prevalence across a controlled range, allowing causality to be established rather than merely correlation.

Results

ECE Correlates Strongly with Prevalence Gap; AUROC Does Not

The central empirical finding is a strong, statistically significant correlation between a hospital's prevalence gap, how far its mortality rate deviates from the training prevalence, and its Expected Calibration Error, across all four model architectures. Pearson correlations range from r = 0.52 for LightGBM to r = 0.75 for Random Forest, all with p < 10^-14. The relationship is consistent and approximately linear.

In stark contrast, AUROC shows weak, inconsistent, and largely non-significant correlation with the same prevalence gap. For LightGBM, the correlation is essentially zero (r = -0.027, p = 0.748). This empirically confirms the theoretical dissociation: the metric that clinical ML studies most commonly report is blind to the shift that most commonly causes calibration failure.

Bayesian Correction: Zero Labels, Substantial Improvement

Applying the proposed Bayesian correction using only each hospital's local mortality rate reduces mean ECE by 56 to 65% for the three well-calibrated model families. LightGBM ECE drops from 0.085 to 0.037; XGBoost from 0.080 to 0.028; Logistic Regression from 0.071 to 0.027. The corrected ECE values are directly comparable to those achieved by Platt scaling and isotonic regression fitted on 30% labeled data from each hospital (ECE approximately 0.022 to 0.024). The zero-label correction achieves roughly equivalent results to supervised methods while requiring none of their infrastructure.

Random Forest's larger residual error after correction (ECE = 0.261 from 0.464) reflects a separate source of miscalibration, inherent overconfidence from probability compression, that prevalence correction alone cannot address. This is consistent with findings from the preceding papers in this research programme.

The correction is most beneficial precisely where it is most needed: hospitals with prevalence gaps exceeding 0.10 achieve 93.5% ECE reduction; hospitals with gaps between 0.05 and 0.10 achieve 68.9%; hospitals where prevalence is already close to the training rate see only 24.5% reduction, and the correction does not harm them.

Causal Validation

The synthetic experiment provides causal rather than merely correlational evidence. When only prevalence is varied while all other aspects of the data distribution are held constant, raw ECE rises linearly and predictably with the size of the prevalence gap, matching the theoretical derivation closely. The Bayesian correction reduces ECE back to near zero throughout the entire prevalence range tested, confirming that it fully recovers calibration when the conditional stability assumption holds.

Prevalence Attribution

An empirical attribution analysis estimates what fraction of each hospital's calibration error is attributable to prevalence shift specifically. For the three well-calibrated models, prevalence shift explains approximately 55 to 65% of the total calibration error observed. The remaining 35 to 45% is attributable to other forms of dataset shift, feature distribution differences, measurement heterogeneity, and concept drift, that require labeled data or more sophisticated domain adaptation methods to address.

Contribution and Significance

The paper makes five contributions. First, it provides the first large-scale, multi-hospital empirical quantification of the relationship between outcome prevalence and calibration error in ICU mortality prediction, across 185 hospitals and four model families. Second, it formally proves and empirically confirms that AUROC cannot detect prevalence shift, establishing that calibration metrics are a necessary complement to discrimination metrics for deployment assessment. Third, it derives and validates a zero-label Bayesian correction that matches the performance of supervised recalibration methods, making practical calibration maintenance accessible to resource-constrained settings. Fourth, it establishes causality through controlled synthetic experiments. Fifth, it provides a practical attribution framework quantifying how much of observed miscalibration is attributable to prevalence shift versus other sources.

The practical implication is direct: any hospital can apply this correction using only its own observed mortality rate, with no machine learning expertise, no labeled patient data, and no model retraining. For new ICUs, low-volume centers, and institutions in low- and middle-income countries, this provides a principled and immediately deployable pathway to calibration maintenance.

In the context of the broader research programme, this paper identifies the primary causal mechanism behind the calibration collapse documented in Paper 4, and provides a mathematically grounded, operationally simple remedy. Where Paper 4 said calibration collapses and can be fixed, this paper says why it collapses and how to fix it with nothing more than a single hospital statistic.

Limitations

The eICU database predominantly comprises US academic medical centers; generalisation to community hospitals, international systems, and non-ICU tasks requires further validation. The conditional stability assumption underlying the correction, that feature-outcome relationships are consistent across institutions within each outcome class, may not hold perfectly, explaining the 35 to 45% residual error after correction. Random Forest remains problematic even after correction due to overconfidence unrelated to prevalence, and is not recommended for deployment contexts where prevalence shift is anticipated. Temperature was excluded due to 89% missingness in eICU. Future work should augment the Bayesian correction with domain adaptation techniques to simultaneously address feature drift, validate on additional international databases, and develop automated prevalence monitoring systems for continuous deployment environments.

Prevalence Shift and ICU Mortality Miscalibration Across Institutions