Leo Alvarino Senior Project

What the Data Revealed: Results, Clinical Evaluation, and What I Learned

Leonardo Alvarino — Mon, 30 Mar 2026 06:00:00 GMT

In this final post I’ll share the results, explain what the model actually learned, and reflect honestly on the limitations and lessons from building this project.

The Results

I trained two models and compared them directly:

Metric	Logistic Regression	XGBoost
AUC	0.742	0.885
Recall (sepsis)	0.61	0.73
Precision (sepsis)	0.05	0.10
Clinical Precision	-	0.49

XGBoost significantly outperformed the linear baseline across every metric. AUC improved from 0.742 to 0.885 — meaning the model correctly ranks a sepsis patient as higher risk than a non-sepsis patient 88.5% of the time.

Recall of 0.73 means the model catches 73% of actual sepsis cases, the most important metric in a clinical context where missing a case is far more dangerous than a false alarm.

Why Standard Metrics Mislead

The precision number deserves a careful explanation.

Standard precision of 0.10 means that for every 10 alerts the model fires, only 1 corresponds to a row officially labeled as sepsis. That sounds bad. But it misses something important.

The SepsisLabel in this dataset flips to 1 at a specific hour — but sepsis doesn’t begin at that exact moment. The physiological deterioration starts hours before the official diagnosis. When the model fires an alert at hour 37 and the label doesn’t flip until hour 41, the standard metric counts that as a false positive. Clinically, it’s an early correct detection.

To measure this more honestly, I re-evaluated the model against patient-level outcomes, whether each patient eventually developed sepsis at any point during their stay, regardless of timing. Under this evaluation, precision improves from 0.10 to 0.49. Nearly half of the model’s alerts correspond to patients who genuinely develop sepsis.

This is a more clinically meaningful way to evaluate a sequential prediction model. The label timing problem is a known limitation of standard evaluation metrics applied to sepsis prediction.

How Early Did the Model Actually Detect?

This was the most important question — not just whether the model was right, but how early.

For each patient in the test set, I measured the first hour the model fired an alert and compared it to the official onset hour. Filtering for clinically meaningful detections within a 48-hour window:

210 patients caught meaningfully early
Average: 26.1 hours before the label flipped
Median: 23 hours before the label flipped

Adding the 6-hour label shift already built into the dataset, the median early detection translates to approximately 29 hours before clinical diagnosis.

Of the 358 total sepsis patients in the test set:

78% were caught before onset
10% were caught at or after onset
12% were missed entirely

Sepsis detection timing — distribution of how many hours before onset the model fired its first alert.

What the Model Actually Learned — SHAP

SHAP feature importance — the top predictors of sepsis risk.

The SHAP analysis revealed which features drove predictions most strongly. The top predictors were:

ICULOS (hours in ICU) ranked first by a large margin. This deserves scrutiny, it means the model is partly learning that patients who stay in the ICU longer are more likely to develop sepsis. That’s statistically true but not entirely actionable. A nurse can’t intervene on how long a patient has been in the ICU. This is a legitimate limitation and a direction for future work.

Shock index ranked 4th — validating the composite feature engineering. The ratio of HR to SBP captured something neither raw HR nor SBP captured alone.

Resp_rolling6 ranked 7th — the 6-hour rolling respiratory rate average was more predictive than raw Resp, confirming that trends matter more than snapshots.

SIRS_score appeared in the top 10 — the clinically validated checklist that nurses already use was meaningful to the model as well, providing a useful sanity check.

Age ranked 10th — older patients carry higher sepsis risk, consistent with clinical literature.

Limitations

I want to be honest about what this model cannot do.

Single hospital system generalization. The PhysioNet challenge showed that models trained on data from two hospital systems performed significantly worse on a third hidden hospital system. My model was trained on data from Beth Israel Deaconess Medical Center. It would need revalidation before being applied at any other institution.

ICULOS dominance. As discussed above, the model’s heaviest reliance on time-in-ICU rather than vital sign patterns limits its actionability. A better approach would be to train only on the hours before onset, forcing the model to learn from physiological signals rather than temporal patterns.

Label uncertainty. The Sepsis-3 definition requires both clinical suspicion and organ dysfunction to be confirmed retrospectively. The label marks when sepsis was recognized, not when it began. This means the model is predicting a conservative, retrospective estimate of onset, the true physiological signal likely starts even earlier.

No external validation. The model has never been tested on patients outside the PhysioNet dataset. Any real-world deployment would require prospective validation in a clinical setting.

What I Would Do Differently

If I were to extend this project, the highest-value improvements would be:

Train on pre-onset hours only — excluding post-onset rows from training to force the model to learn early warning patterns rather than confirming an already-deteriorating patient.

Add a temporal model — XGBoost treats each row independently. A recurrent neural network or transformer architecture could learn sequential patterns across hours more naturally, potentially improving early detection.

Systematic hyperparameter tuning — I used near-default XGBoost settings. A structured search using Optuna could squeeze out additional performance, particularly on the precision-recall tradeoff.

Patient-level cross-validation — instead of a single train/test split, k-fold cross-validation at the patient level would give a more robust estimate of model performance.

Final Reflection

This project started as a portfolio piece and became something more interesting — a genuine encounter with the messiness of real clinical data and the gap between a working model and a deployable tool.

The hardest parts were not technical. They were conceptual: understanding why forward fill is more honest than interpolation, why the standard precision metric misleads in a sequential prediction setting, why ICULOS dominating SHAP is a problem worth acknowledging rather than hiding.

A model with 0.885 AUC catching sepsis a median of 23 hours before the label flips — approximately 29 hours before clinical diagnosis — is genuinely promising. The path from here to a system a nurse would actually trust and use is long and complex. But the signal is there.

What I Would Do Next

If I were to continue this project, the next steps would be:

Train on both hospital datasets. This project used only training set A (Beth Israel Deaconess Medical Center). The PhysioNet challenge also includes training set B from Emory University Hospital. Training on both would expose the model to more diverse patient populations and likely improve generalization.

Explore neural network architectures. XGBoost treats each row independently. A recurrent neural network (LSTM) or transformer architecture could learn sequential patterns across hours more naturally, potentially catching earlier and subtler trends.

Patient-level cross-validation. Instead of a single train/test split, k-fold cross-validation at the patient level would give a more statistically robust estimate of model performance.

Address ICULOS dominance. Training only on pre-onset hours would force the model to learn from physiological signals rather than relying on how long a patient has been in the ICU. —

The full code, notebooks, and analysis for this project are available at [https://github.com/leonardoalva98/senior_project/blob/main/code]. Dataset: PhysioNet CinC Challenge 2019.

Building a Sepsis Predictor

Leonardo Alvarino — Fri, 20 Mar 2026 06:00:00 GMT

In my first post, I explained why sepsis is so hard to catch early and why machine learning could help. In this post I’ll walk through exactly how I built the model: the messy data, the engineering decisions, and the trade offs along the way.

The Dataset

The data comes from the PhysioNet Computing in Cardiology Challenge 2019 — a publicly available dataset of 20,336 de-identified ICU patients from two US hospital systems. Each patient file contains one row per hour in the ICU, with columns for vital signs, lab values, and demographics.

After combining all 20,336 files into one dataframe, I had 790,215 rows and 42 columns. That’s every hourly ICU reading across all patients, a complete picture of what it looks like to monitor someone in critical care.

One important detail about the labels: the PhysioNet challenge shifts SepsisLabel six hours ahead of clinical diagnosis. This means when the label flips to 1, it’s already 6 hours before a doctor would officially recognize sepsis. This means that if my model predicted Sepsis with 3 hours of advance for patient X, it was actually 9 hours before the actual diagnosis.

The Mess: Missing Values

Vital signs like heart rate and blood pressure are recorded continuously by bedside monitors, but even those have gaps when monitors are disconnected or recalibrated. Lab values like lactate, WBC, and creatinine are drawn periodically, sometimes once a day, sometimes less — so entire stretches of a patient’s stay have no lab readings at all.

Missing value rate across all columns. Red bars were dropped (>80% missing), blue bars were forward-filled.

Here’s what the missingness looked like across key columns:

HR — 7.7% missing
Temp — 66% missing
Lactate — 97% missing
EtCO2 — 100% missing (completely empty)

Five columns — EtCO2, TroponinI, Bilirubin_direct, Fibrinogen, and Bilirubin_total — were more than 80% missing. These were dropped entirely. The rest were handled through forward filling.

Forward Fill: What It Is and Why Not Interpolation

For vital signs, I used forward fill — carrying the last known reading forward in time within each patient’s hourly record.

If a patient’s heart rate was recorded as 97 at hour 2 and the next reading wasn’t until hour 5, hours 3 and 4 get filled with 97. The assumption is that the last known value is the best estimate of what was happening in the interim.

A natural question is why not use linear interpolation — gradually transitioning between the hour 2 value and the hour 5 value. Interpolation is more mathematically elegant, but it has a fundamental problem in this context: it uses future information to fill past values. At hour 3, you wouldn’t actually know what the hour 5 reading would be. Using that future data to fill the gap would be data leakage — the model would be trained on information it couldn’t have in real deployment. Forward fill avoids this entirely.

Lab values like lactate and creatinine were left with their remaining NaNs after forward filling. XGBoost handles missing values natively, so these don’t need to be filled.

Feature Engineering: Teaching the Model to See Trends

The single most important insight driving the feature engineering was this: a snapshot is less informative than a trend.

A heart rate of 102 bpm is mildly elevated. A heart rate that has risen from 82 to 102 over the last 6 hours is a warning sign. The raw value doesn’t tell that story — the rate of change does.

For each of the five key vitals (HR, Temp, Resp, MAP, O2Sat), I created two new features using a 6-hour rolling window:

Rolling average — the mean of the last 6 hourly readings. This smooths out noise and captures the sustained level of a vital sign rather than a one-off spike.

Rate of change — the difference between the current reading and the reading 6 hours ago. A positive value means the vital is rising; negative means it’s falling. This is the early warning signal.

I also encoded three established clinical scores directly as features:

SIRS Score — counts how many of the four SIRS criteria are met at each hour (HR > 90, Temp > 38 or < 36, Resp > 20, WBC > 12 or < 4). This is the same checklist nurses already use — turning it into a single number gives the model a compact summary of clinical risk.

Shock Index — heart rate divided by systolic blood pressure. Normal is around 0.5. Above 1.0 indicates significant cardiovascular stress. This ratio captures something neither HR nor SBP captures alone.

Pulse Pressure — systolic minus diastolic blood pressure. A narrowing pulse pressure indicates the heart is struggling to maintain output.

The scatter plots below show each SIRS variable against the sepsis label. Notice how no single variable cleanly separates the two groups — this is exactly why machine learning is needed over simple threshold rules.

SIRS criteria plotted against sepsis label. No single variable cleanly separates the groups — confirming that ML is needed to find the combined pattern.

The Early Signal

Before modeling, the EDA revealed something critical. When I aligned all sepsis patients to their onset time and looked at the 12 hours before the label flipped, their vitals were already diverging from non-sepsis patients — from the very beginning of the window.

Vital signs in the 12 hours before sepsis onset (orange) vs non-sepsis patients (blue). The red dashed line marks official sepsis onset. The divergence begins well before hour 0.

This is known as label uncertainty — the label marks when sepsis was recognized, not when it started. The physiological deterioration precedes the diagnosis. Combined with the 6-hour label shift already built into the dataset, the detectable signal begins approximately 18 hours before clinical recognition.

The Class Imbalance Problem

Of the 20,336 patients in the dataset, only 1,790 — about 8.8% — ever develop sepsis. At the row level, since most hours for even sepsis patients are labeled 0, the sepsis rate drops to just 2.2%. For every sepsis row the model sees during training, it sees 45 non-sepsis rows.

If left unaddressed, the model would learn to predict “no sepsis” for everything and achieve 97.8% accuracy while being completely useless clinically.

XGBoost handles this through a parameter called scale_pos_weight, which instructs the model to treat each sepsis row as if it were 45 non-sepsis rows. This forces the model to pay equal attention to both classes despite the imbalance.

Why XGBoost

I trained two models: a logistic regression as a baseline, and XGBoost as the primary model.

Logistic regression is a linear model — it can only find straight-line relationships between features and outcome. Sepsis risk doesn’t work that way. A heart rate of 90 combined with rising lactate and dropping blood pressure is far more dangerous than any of those values in isolation. Capturing that interaction requires a non-linear model.

XGBoost is a gradient boosting algorithm that builds an ensemble of decision trees, with each tree correcting the errors of the previous ones. It has two specific advantages for this problem: it handles missing values natively (important given how sparse the lab data is), and it captures complex non-linear interactions between features automatically.

One important methodological note on the train/test split: I split at the patient level, not the row level. This means every row for a given patient ended up entirely in the training set or entirely in the test set — never both. A row-level split would allow the model to see hour 1-40 of a patient during training and then predict hours 41-54 in the test set. Since the model has already seen that patient, this inflates performance metrics. Patient-level splitting gives a more honest estimate of how the model would perform on genuinely new patients.

SHAP: Opening the Black Box

A model that outputs a risk score but can’t explain it is not useful in a clinical setting. A nurse who sees “HIGH RISK” without knowing which vitals are driving that alert has no clear action to take.

SHAP (SHapley Additive exPlanations) solves this by assigning each feature a contribution score for every individual prediction. A positive SHAP value means that feature pushed the risk score up; a negative value means it pushed it down.

SHAP feature importance — the top predictors of sepsis risk ranked by their average impact on model output.

The top features by importance were ICULOS (hours in ICU), HospAdmTime, ICU unit type, respiratory rate rolling average, age, temperature, and blood pressure. Several of the engineered features — particularly Resp_rolling6 and MAP_rolling6 — ranked in the top 10, validating the time-based feature engineering approach.

One finding worth flagging: ICULOS (hours in the ICU) ranking as the most important feature suggests the model is partly learning that patients who stay longer are more likely to develop sepsis — a real but not entirely actionable signal. This is an area for future refinement.

What’s Next

In the third and final blog post, I’ll share the model’s performance results, walk through the clinical evaluation — including how we measured hours of early detection — and reflect on what are the limitations and considerations of the results.

The full code for this project is available at [https://github.com/leonardoalva98/senior_project/tree/main/code] Dataset: PhysioNet CinC Challenge 2019.

Sepsis in the ICU — And How ML Could Help Stop It

Leonardo Alvarino — Mon, 09 Mar 2026 06:00:00 GMT

Imagine you’re a nurse managing six patients in an intensive care unit. It’s hour nine of a twelve-hour shift. You’re charting vitals, managing medications, responding to alarms. One of your patients, a 72-year-old man recovering from abdominal surgery, looks stable. His numbers aren’t alarming. Nothing is flagging. Over the next six hours, his heart rate creeps up. His blood pressure dips slightly. His breathing becomes a little faster. Each change, on its own, looks unremarkable. Together, they tell a story that won’t become obvious until it’s almost too late: he is developing sepsis.

What Is Sepsis?

Sepsis is not a disease you catch. It’s what happens when your body’s response to an infection spirals out of control. Instead of fighting the infection, the immune system begins attacking the body’s own tissues and organs. Blood pressure drops. Organs start to fail and time becomes the enemy. Every hour of delayed treatment increases the risk of death by roughly 7% and yet, because the early signs are subtle and easy to miss — especially in an ICU where every patient has abnormal vitals — sepsis is often caught too late.

In the United States alone, an estimated 350,000 people die from sepsis every year ¹, more than those who die from stroke, prostate cancer, breast cancer, and opioid overdoses combined. It is the leading cause of death in hospitals and the most expensive condition to treat, costing the healthcare system over $62 billion annually.²

Why Is It So Hard to Catch Early?

The challenge with sepsis is that it doesn’t announce itself. It develops gradually, through patterns that are hard for humans to track in real time, especially when one nurse is responsible for multiple critically ill patients simultaneously.

Current clinical tools like qSOFA and SIRS criteria give nurses simple checklists to assess sepsis risk. Check three boxes, add up the score. These tools are better than nothing, but they have real limitations. They look at a single snapshot in time. They don’t account for trends. And they rely on a nurse remembering to run the check in the first place.

Some hospitals, including those using Epic’s electronic health record system, do have automated sepsis alerts built in. But these come with a problem: alert fatigue. A study of one hospital’s sepsis alert system found that nearly 88% of alerts were cancelled or timed out without action.³ A University of Michigan study of Epic’s own Sepsis Model found it missed 67% of actual sepsis patients while still firing alerts on 18% of all hospitalized patients.⁴ An alert that gets ignored — or fires so often it loses meaning — isn’t saving anyone.

The deeper issue is explainability. Current automated alerts tell a nurse that something might be wrong, but not why. Without knowing which specific vitals are driving the concern, the nurse has no clear action to take. The signal is there. It’s just buried in noise, spread across hours of data, and too often delivered without context.

What If a Machine Was Watching?

For my data science senior project, I’m building a Sepsis Early Warning System using real ICU data from the PhysioNet Computing in Cardiology Challenge 2019, a dataset containing over 20,000 de-identified patient records from actual intensive care units.

Each patient record contains hourly readings of vital signs and lab values: heart rate, temperature, blood pressure, oxygen saturation, respiration rate, white blood cell count, lactate levels, and others. Each row is labeled — sepsis eventually developed, or it didn’t.

The goal is to train a machine learning model to recognize the pattern that precedes sepsis onset, up to 12 hours before a clinician would traditionally identify it. Not by looking at one number but by looking at how all the numbers move together over time.

My motivation for building this system goes beyond the technical challenge. ICU nurses already carry one of the most demanding workloads in healthcare, managing multiple critically ill patients, documenting vitals, responding to alarms, all within a 12-hour shift that leaves little room for error. A tool that flags risk early doesn’t just help patients. It gives nurses better information at the right moment, reducing the cognitive burden of having to catch what the data is quietly trying to say. Better tools mean better care, and a less impossible job.

Why 12 Hours Matters

Twelve hours of advance warning in the ICU is huge. It’s the difference between a nurse proactively administering antibiotics and fluids before a crisis, versus scrambling to stabilize a patient who is already in septic shock.

Early intervention means shorter ICU stays. Fewer organ failures. Less cost. Lower Deaths.

What Comes Next

This is the first of three posts documenting my journey building this system from scratch.

In the next post, I’ll get into the technical details: how I cleaned and processed 20,000 messy ICU patient files, what features the model actually learns from, and how XGBoost compares to simpler baseline approaches.

In the final post, I’ll walk through deploying the model as a live web application and reflect on what building this project taught me about the gap between research and real world clinical deployment.

References

¹ Sepsis Alliance. (2022). Sepsis Alliance Updates Key Number of Annual Sepsis Casualties. https://www.sepsis.org/news/sepsis-alliance-updates-key-number-of-annual-sepsis-casualties/

² End Sepsis. Sepsis Fact Sheet. https://www.endsepsis.org/what-is-sepsis/sepsis-fact-sheet-2/

³ Parajon et al. (2020). Study of Alert Fatigue, Effectiveness, and Accuracy in the Development of a New Sepsis Best Practice Alert. Society of Hospital Medicine. https://shmabstracts.org/abstract/study-of-alert-fatigue-effectiveness-and-accuracy-in-the-development-of-a-new-sepsis-best-practice-alert/

⁴ Infectious Disease Advisor. (2021). Epic Sepsis Model Poorly Predictive Due to Low Sensitivity, Inadequate Calibration. https://www.infectiousdiseaseadvisor.com/news/epic-sepsis-model-is-poor-predictor-and-has-tendency-to-cause-alert-fatigue/

This project was developed using the PhysioNet CinC Challenge 2019 dataset.