How to evaluate a machine learning model
You are a botanist. You are interested in iris flowers and in particular its petal width
Different types of prediction
TODO Diagram of a straight line with the regression points in 2D
TODO Diagram of a wiggly line which goes through every point in 2D
How can we quantify how good a prediction is?
\[ AE = |y_i - y_p| \]
\[ MAE = \frac{1}{n} * \sum_{i=1}^{n}(y_i - y_p) \]
\[ MSE = \frac{1}{n} * \sum_{i=1}^{n}(y_i - y_p)^2 \]
\[ RMSE = \sqrt{MSE} \]
Mean Absolute Error = mean absolute differences
Mean Squared Error = measures the variance of the residuals.
Root Mean Squared Error = measures standard deviation of the residuals
R-squared (value between 0 and 1) = how much variation in y is explained by your model.
adjusted R-squared = because R^2 will go up with more variables i.e., overfit. adjusted R^2 penalises for having more predictor variables
Moving to classification
Going beyond log-loss
Model Predicted as Yes | Model Predicted as No | |
---|---|---|
True value is Yes | A | B |
True value is No | C | D |
True Positives = A
True Negatives = D
False Positives = C
False Negatives = B
Model Predicted as Yes | Model Predicted as No | |
---|---|---|
True value is Yes | A | B |
True value is No | C | D |
Total sample size = A + B + C + D
Total cases = A + B
Total not-cases = C + D
Prevalence = total cases / total sample size
Model = 1 | Model = 0 | |
---|---|---|
True = 1 | 80 | 30 |
True = 0 | 20 | 40 |
Model = 1 | Model = 0 | |
---|---|---|
True = 1 | 80 | 30 |
True = 0 | 20 | 40 |
Precision = Positive Predictive Value =
Recall = True Positive Rate =
Specificity = True Negative Rate =
Model = 1 | Model = 0 | |
---|---|---|
True = 1 | 80 | 30 |
True = 0 | 20 | 40 |
F1 score = precision * recall / precision + recall =
An ROC curve which shows the classification performance for blood different infection markers for predicting serious bacterial infection (SBI) in febrile infants. Milcent K, Faesch S, Gras-Le Guen C, et al. Use of Procalcitonin Assays to Predict Serious Bacterial Infection in Young Febrile Infants. JAMA Pediatr. 2016;170(1):62–69. doi:10.1001/jamapediatrics.2015.3210
The Youden index is the optimal cut off for balancing sensitivity and specificity.
TODO find example image of ROC with this annotated.
Putting it all together
Example 2: Future information is somehow leaked in a training dataset.
TODO clarify examples with clinical information
NOTE: Data leakage can sometimes be very difficult to detect.
Example of data leakage in Chest X-Ray (CXR) data:
Practical and Ethical Considerations
The ethics of AI/ML are complex:
AI/ML don’t always get it right, but neither do humans
Figure 1
NHS app and deprivation: See this study
In ML: consider:
Different OSs parse data differently
Reference: Bhandari Neupane et al., “Characterization of Leptazolines a–d” (2019)
Revisiting the scenarios
You are a GP.
You use an AI-based clinical decision support (CDS) tool to help manage patients with “breast lump” presentations.
The tool suggests options like invasive investigation, imaging, or watchful waiting.
A 17-year-old male patient presents with a “breast lump.”
Question: Should you trust the model’s recommendations for this patient?
You are a clinical lead for your organization.
Your goal is to improve pathways for acutely unwell patients.
You have a 100,000 GBP budget.
Two options are presented:
Question: Which option would you choose?
Nurses:
AI System:
Conclusion