Back to Research

Under Review

Severity-aware Evaluation of ICU Mortality Prediction Models using Risk-Weighted Error Metrics

This paper is a direct and substantially extended continuation of the preceding work on the Clinical Risk-Weighted Score (CRWS). While the earlier paper established the foundational argument that standard metrics fail to account for asymmetric error costs in ICU mortality prediction, this work advances that framework by incorporating patient-level clinical severity into the evaluation metric itself, validating across three independent datasets and five model classes, and demonstrating a practical deployment strategy that improves clinical safety without any model retraining.

Background and Motivation

Standard evaluation metrics for ICU mortality prediction, including F1-score and AUC-ROC, assess model performance under the implicit assumption that all prediction errors carry equal clinical consequence. This assumption is flawed in two respects. First, a missed death is far more consequential than a false alarm. Second, and less commonly recognised, not all missed deaths are equally consequential: failing to identify a patient with a SOFA score of 19 (mortality risk exceeding 70%) represents a substantially greater clinical failure than missing a patient with a SOFA score of 2 (mortality risk below 10%). No widely adopted evaluation framework accounts for this patient-level severity dimension, and none can be applied retrospectively to an already-trained model without modifying its training pipeline.

Proposed Framework

This paper introduces the Risk-Weighted Evaluation Metric (RWEM), a severity-aware evaluation framework that incorporates patient acuity, measured via the Sequential Organ Failure Assessment (SOFA) score, directly into the error scoring at evaluation time. RWEM assigns a higher penalty to missed deaths among critically ill patients and a lower penalty to missed deaths among lower-severity patients, scaled continuously by the patient's normalised SOFA score.

A tunable parameter alpha controls the degree of severity sensitivity. When alpha is set to zero, RWEM reduces to a standard asymmetric cost-sensitive misclassification rate. When alpha is set to one, the primary experimental setting, severity weighting is fully active. The cost ratio between false negatives and false positives is set at 10:1, grounded in published economic evidence on the differential costs of missed high-severity ICU cases. Critically, RWEM requires no modification to any trained model; it is applied exclusively at evaluation time, making it immediately applicable to any existing clinical AI system.

A complementary Severity-Aware Threshold (SAT) strategy is also introduced, which selects the RWEM-optimal decision threshold independently for each SOFA severity band at inference time, again without model retraining.

Experimental Setup

Three publicly available ICU databases are used. MIMIC-IV v3.1 covers 72,001 admissions with 12.0% in-hospital mortality and serves as the primary dataset. MIMIC-III v1.4 covers 45,278 admissions with 11.8% mortality and provides temporal validation from the same institution a decade earlier. The eICU Collaborative Research Database covers 12,135 patients across multiple hospital sites with 29.3% mortality and provides multi-institutional external validation. Five models are evaluated: Logistic Regression, Random Forest, XGBoost, LightGBM, and LSTM. Fourteen clinical features are used for MIMIC-IV and MIMIC-III, including SOFA score, vital signs, and laboratory values. Statistical validation employs 1,000-iteration bootstrap confidence intervals and pairwise Wilcoxon signed-rank tests.

Results

Ranking Divergence

The central finding is consistent divergence between model rankings under standard metrics and under RWEM across all three datasets. On MIMIC-IV, 4 out of 5 models change rank; on MIMIC-III and eICU, all 5 out of 5 models change rank.

The most striking example concerns LightGBM. On MIMIC-IV, LightGBM achieves the highest weighted F1, AUC, and H-measure simultaneously; it is the clear winner under every standard evaluation criterion. Under RWEM-Adaptive, it ranks fourth. The reason is structural: LightGBM's leaf-wise growth concentrates discriminative power on the numerically dominant mild-severity subgroup (84.6% of patients), at the cost of missing a disproportionate number of high-severity patients. Its false negatives carry the highest average SOFA score of any model. Logistic Regression, ranked last by every standard metric, misses zero deaths among patients with SOFA 16-24, the severe band with 63.8% mortality.

This pattern replicates on MIMIC-III, where LightGBM again ranks first by F1 and last by RWEM, and on eICU, where all five models diverge between the two evaluation frameworks. The temporal span of the MIMIC replication (same institution, a decade apart, independent cohorts with matched mortality rates) provides strong evidence that the divergence reflects a consistent behavioural pattern of gradient boosting methods under severity imbalance, rather than a dataset-specific artefact.

Severity-Aware Threshold Optimisation

Applying the SAT strategy to the Random Forest model on MIMIC-IV (13,888 patients, 16.6% mortality) achieves a statistically significant 26.2% reduction in severity-weighted error (95% CI: 16.8-34.0%, Wilcoxon p < 0.001, non-overlapping bootstrap confidence intervals), at the cost of a 13.5% reduction in aggregate F1-score. The per-band threshold adjustments are direct: the mild-band threshold is lowered from 0.319 to 0.192, reducing false negatives by 57%; the moderate-band threshold is lowered to 0.227, reducing false negatives by 83%. AUC is unchanged, confirming that the improvement is attributable entirely to threshold repositioning rather than any increase in model capacity.

Sensitivity and Robustness

RWEM rankings are perfectly stable across all alpha values from 0 to 1 with the full 14-feature set (Spearman rank correlation = 1.00 throughout). Instability only appears beyond alpha = 1, confirming a robust operating range. Replacing SOFA with APACHE III as the severity instrument produces zero rank changes, confirming that the findings are not specific to the choice of severity score. Feature engineering decisions that improve aggregate F1 by 53% (from minimal to full feature set) produce less than 1% change in RWEM, with ranking divergence persisting in 3 out of 4 feature configurations. A richer feature set does not necessarily produce a clinically safer model.

Two alternative strategies, a band-switching ensemble and severity-weighted training, were tested and found to offer no improvement over SAT alone, confirming that the bottleneck RWEM addresses is evaluation and threshold selection, not model training.

Contribution and Significance

The paper makes three substantive contributions. First, RWEM provides a severity-aware evaluation framework that is applicable retrospectively to any pre-trained model, requiring no changes to training pipelines, a practically important property that distinguishes it from cost-sensitive learning approaches. Second, it demonstrates empirically and across multiple independent datasets that the choice of evaluation metric alone can determine which model is selected for deployment, with clinically meaningful consequences. Third, the SAT strategy operationalises RWEM as a deployment tool, achieving a statistically validated improvement in clinical safety through threshold adjustment alone.

The broader implication is that evaluation metrics are not neutral instruments: they encode implicit assumptions about the relative importance of different errors. In severity-imbalanced clinical cohorts, optimising for aggregate discrimination tends to produce models that perform well on the majority of patients while failing systematically on those most at risk of death. RWEM makes this trade-off explicit, quantifiable, and communicable to clinical decision-makers.

Limitations

Bootstrap confidence intervals for the top four models on MIMIC-IV overlap, meaning that rank differences among those models reflect differences in the distribution of errors across severity bands rather than statistically significant aggregate superiority; the SAT result is the study's primary statistically significant finding. SOFA is measured at admission only, and does not capture temporal changes in patient condition during the ICU stay. Both MIMIC datasets originate from a single institution, representing temporal rather than geographically independent validation. The 10:1 cost ratio, while supported by sensitivity analysis and published economic data, may vary across healthcare systems. Calibration analysis was not conducted, which is a meaningful limitation for the SAT strategy, particularly in the small severe band. Future work will address dynamic severity trajectories, multi-institutional prospective validation, calibration-aware threshold selection, and extension to multi-label and time-series prediction settings.