Clinical Risk-Weighted Evaluation of ICU Mortality Models (CRWS)

Background and Motivation

Machine learning models deployed in ICU settings for mortality prediction are conventionally evaluated using the F1 score, which treats false negatives and false positives as equally costly errors. In clinical practice, this assumption is fundamentally flawed. A false negative in ICU mortality prediction represents a patient predicted to survive who subsequently dies, a missed opportunity for life-saving intervention. A false positive, by contrast, triggers an unnecessary clinical review, which is resource-intensive but rarely fatal. Despite this well-established asymmetry in error costs, virtually no published study operationalises it at the evaluation metric level, and standard metrics continue to obscure clinically dangerous model behaviour, particularly under class imbalance.

Proposed Metric

This paper introduces the Clinical Risk-Weighted Score (CRWS), a confusion-matrix-level evaluation metric that directly encodes the asymmetric cost of false negative and false positive errors through tunable penalty weights: wFN for missed deaths and wFP for false alarms. The metric is grounded in the F-beta family and reduces exactly to the standard F1 score when both weights are set to one, establishing it as a strict generalisation rather than a replacement. For ICU mortality prediction, the authors adopt wFN = 3 and wFP = 1, reflecting a clinical judgment that a missed death warrants three times the penalty of a false alarm. The weights are interpretable to clinical stakeholders in a way that abstract statistical parameters are not, and can be adapted to other tasks such as sepsis detection or readmission prediction by adjusting the cost ratio accordingly.

Experimental Setup

The study is conducted on MIMIC-IV v3.1, a publicly available critical care database spanning 65,366 unique adult ICU patients with a mortality rate of 10.84%. The prediction target is in-hospital mortality. Four classifiers are evaluated: Logistic Regression, Random Forest, XGBoost, and MLP Neural Network. Ten static tabular features are used, including ICU length of stay, age, admission type, and care unit. Class imbalance is addressed using SMOTE applied exclusively to the training partition. The evaluation protocol includes accuracy, precision, recall, F1, F2, CRWS, MCC, balanced accuracy, AUPRC, and full confusion matrix counts.

Results

The central finding is a pronounced divergence between model rankings under F1 and under CRWS. Random Forest achieves the highest accuracy among all models at 87.23%, yet produces 1,220 missed deaths out of 1,417 actual deaths in the test set, an 86.1% miss rate. Under CRWS, it ranks last with a score of 0.147. XGBoost, with the lowest accuracy at 59.32%, misses only 289 deaths and achieves the highest recall at 0.796, ranking first under CRWS with a score of 0.597.

Ranking Under F1

XGBoost > MLP > LR > Random Forest

Ranking Under CRWS

XGBoost > LR > MLP > Random Forest

The shift in ranking between LR and MLP reflects LR's superior sensitivity under risk-weighted evaluation, a distinction that F1 fails to surface. Bootstrap confidence intervals (95%, n = 500) confirm that CRWS rankings are statistically distinguishable across all model pairs, whereas F1 intervals for LR and MLP partially overlap, rendering them indistinguishable under standard evaluation.

Threshold optimisation guided by CRWS rather than F1 yields substantial reductions in missed deaths across all models. Random Forest reduces its false negative count from 1,220 to 188, an 84.4% reduction, by lowering the decision boundary to favour sensitivity. Sensitivity analysis across wFN values of 1, 2, 3, and 5 confirms that CRWS rankings remain stable beyond the F1-equivalent baseline. An ablation study comparing SMOTE against class-weighting demonstrates that SMOTE consistently produces higher CRWS scores and fewer missed deaths, even in cases where class-weighting yields a marginally higher F1.

Contribution and Significance

The paper makes a methodological contribution to clinical AI evaluation. It demonstrates that a model can simultaneously achieve the highest accuracy and the worst clinical safety profile, a phenomenon the authors term the accuracy-safety paradox. CRWS provides a principled, mathematically grounded mechanism to detect and penalise such behaviour. The metric is positioned not as a universal replacement for F1 but as a necessary complement in any evaluation pipeline where minimising high-consequence errors is the primary operational objective. Recommended parameter settings are provided for common clinical prediction tasks, derived from the clinical cost-ratio literature.

Limitations

The study is restricted to a single dataset, a single prediction task, ten static tabular features, and one primary cost assumption. Future work will extend CRWS to multi-class settings, incorporate clinical severity scores such as SOFA and APACHE as adaptive penalty weights, and validate the metric across additional tasks including sepsis onset detection and unplanned hospital readmission.