Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Reasoning About a Learning Problem

The preceding eight lessons have built a complete conceptual framework: functions and hypothesis spaces, parameters, training data as constraint, loss functions, generalization, the bias-variance trade-off, gradient descent, and the role of data quantity. This lesson does not introduce new concepts — it integrates all of them through a single, carefully developed problem. Reasoning rigorously about one problem end-to-end is more valuable than memorizing isolated facts.

The Problem: Predicting Hospital Readmission

A hospital wants a model that predicts, at the time of a patient's discharge, whether the patient will be readmitted within 30 days. The goal: flag high-risk patients for additional follow-up care, reducing preventable readmissions. Let us work through this problem at every layer of the framework. Step 1: Function signature. Input space X: patient features available at discharge — age, diagnosis code, number of prior admissions, medications prescribed, days hospitalized, lab values (blood pressure, glucose, hemoglobin), insurance type, discharge destination (home, rehab facility, etc.). Output space Y: probability of readmission in 30 days, in [0, 1]. This is a classification problem (binary: readmitted or not) framed probabilistically. Function: f: patient_features → probability_of_readmission. Step 2: Hypothesis space. A logistic regression model has one parameter per feature plus a bias — roughly 15-20 parameters for this feature set. The hypothesis space is all sigmoid-transformed linear combinations of the features. The model assumes the log-odds of readmission is a linear function of the input features. This is a strong assumption that may not hold. A gradient boosted decision tree has thousands of parameters and can represent complex nonlinear interactions among features. Its hypothesis space is far richer. A deep neural network has millions of parameters. Whether this expressiveness is beneficial depends on how much data is available and how much signal there is in the input features.

Architecture Encodes Assumptions

Choosing logistic regression assumes the readmission probability is a sigmoid function of a linear combination of features. Choosing a neural network makes far weaker assumptions about the functional form but requires far more data to constrain the larger hypothesis space. Neither choice is neutral — both are hypotheses about the structure of the problem.

Step 3: Training data. The hospital has 50,000 discharge records from the past five years with confirmed 30-day readmission outcomes. This is the training dataset. Representativeness concern: patients from five years ago may differ from current patients — treatment protocols change, demographics shift, new conditions emerge. The training distribution may not match the deployment distribution. Label quality: readmission within 30 days to the same hospital is objectively observed. But patients who transferred to another hospital or died would be miscoded — a real label-quality issue in this domain. Class imbalance: if only 12% of discharges result in readmission, the dataset is imbalanced. A model that always predicts 'not readmitted' achieves 88% accuracy — which sounds impressive but is clinically useless. The loss function and evaluation metric must account for this. Step 4: Loss function. Binary cross-entropy is appropriate for probabilistic binary classification. But class imbalance motivates weighting the positive class (readmitted) more heavily in the loss — e.g., giving a readmitted example 7× the loss weight of a non-readmitted example (reflecting the 88/12 split). This encodes the value judgment that missing a true readmission is more costly than a false alarm. The clinical cost asymmetry may warrant an even higher weight — a false negative (missing a high-risk patient) has worse consequences than a false positive (extra follow-up for a low-risk patient). This is a value judgment that should involve clinicians, not just data scientists.

Prompt Challenge

Write a one-paragraph problem statement for a new ML system that frames it in terms of: the function it must learn, the hypothesis space chosen, the key data challenges, and the choice of loss function.

Your prompt should…

  • Identify the input space and output space precisely using function notation
  • Name the model architecture and explain what assumption about functional form it encodes
  • State one data quality or distribution challenge and one loss function choice with justification

Generalization Strategy and Model Evaluation

Step 5: Generalization strategy. Split the 50,000 records: 70% training (35,000), 15% validation (7,500), 15% test (7,500). Critically, hold out the most recent records as the test set — not a random sample — because temporal order mimics deployment: the model is always predicting future patients from past data. A random split would allow the model to 'see the future' during training. During development, tune hyperparameters (learning rate, regularization, tree depth) using the validation set. Touch the test set exactly once. Step 6: Bias-variance diagnosis. Start with logistic regression. If training and validation accuracy are both low (e.g., AUC 0.65 for both), the model has high bias — the linear hypothesis space is not expressive enough. Move to a richer architecture. If training AUC is 0.90 and validation AUC is 0.72, the model has high variance — overfitting. Remedies: regularization (L2 penalty on weights), reducing feature count, or gathering more data. Step 7: Data quantity assessment. 50,000 examples for a 15-20 parameter logistic regression model: the hypothesis space is heavily over-constrained, so more data will help minimally. For a 10,000-parameter gradient boosted model, 50,000 examples is a reasonable match. For a deep neural network with millions of parameters, variance will be high — either regularize heavily or acquire far more data. Step 8: Honest limitations. Even a perfectly trained model predicts 30-day readmission for the average patient like those in training data. It cannot account for unprecedented circumstances (a new pandemic, a policy change), and it will underperform for demographic groups underrepresented in the training data. The model should be deployed as decision support, not as a replacement for clinical judgment.

Model Outputs Are Predictions, Not Facts

A model predicting 78% readmission probability does not mean this patient will be readmitted. It means patients with similar features were readmitted 78% of the time in the training data. Communicating this uncertainty clearly to clinicians — and designing workflows that use the output appropriately — is as important as the model's accuracy.

The team evaluates their readmission model and finds: training AUC = 0.91, validation AUC = 0.76. They propose adding L2 regularization. What is their diagnosis, and is the proposed remedy appropriate?

The hospital proposes evaluating the readmission model by accuracy: (correct predictions / total predictions). Why is accuracy a misleading metric here?

End-to-End Problem Analysis

  1. Choose one of the following ML problems and work through all eight steps from today's lesson. Write a structured one-page analysis.
  2. Option A: Predict whether a student will drop a course before the end of the semester, using data from the first two weeks of enrollment.
  3. Option B: Predict the fuel consumption (liters per 100 km) of a vehicle given engine specifications, weight, and aerodynamic coefficients.
  4. Option C: Predict whether a social media post will go viral (exceed 10,000 shares) within 24 hours of posting.
  5. For your chosen problem, address all eight steps:
  6. 1. Function signature (input space, output space)
  7. 2. Three candidate model architectures and what assumption each encodes
  8. 3. Data sources and two representativeness or label-quality concerns
  9. 4. Loss function choice and justification
  10. 5. Train/validation/test split strategy and why you split that way
  11. 6. How you would diagnose high bias vs. high variance after initial training
  12. 7. Whether more data would help for each architecture you named
  13. 8. One honest limitation of the final model in deployment
  14. Present your analysis to the class. Defend your choices when challenged.