AI: Evaluating an ML Model

DAIM Team

Module 4: Part 2 Learning Objectives

  • Why evaluate model and understand key metrics
  • Appreciate evaluation of regression tasks and transfer to classification tasks
  • Understand pros and cons of each metrics
  • Understand limitations of each metrics
  • More specifically, understand
    • mean
    • variance
    • standard deviation
    • standard error
    • model deviance

You fitted a model - now what?

  • What do you do next?
    • Depends on why you are building it for.
  • Consider
    • how do we get people to use this model
    • how do we generalise the findings to other datasets
    • how do we have it approved by peers and by medico-legally

How to evaluate a machine learning model

Complexity of the Model

  • In ML, you need to trade off between bias and variance:
    • Very biased model = assuming everyone is sick
    • High variant model = responding to every noise
  • This describes the difference between overfitting and underfitting

Overfitting

  • The more parameters a model has, the more information it can represent
  • However, this can lead to overfitting and inefficiency

Evaluate Your Model: How?

  • Validation is key to model reliability:
    • Validation set = marking your own homework!
    • Test dataset = using a separate, fresh dataset
  • Beware of data leakage
    • We will return to this later…

Task

You are a botanist. You are interested in iris flowers and in particular its petal width

  • You want to build a model that can predict width.
  • Your helpful colleagues have collected some data.

Theoretical: Why evaluate models?

  • Models are not truth.
    • They apporoximate the real world.
  • As a good scientist, you should have a healthy skepticism of your hypotheses
  • In extension, you should be skeptical of models you fitted.

Null hypothesis and Model Evaluation: parallels

  • Hypothesis testing: is our finding ‘by chance’?
    • Null Hypothesis: First assume there is no effect.
  • Similarly, we will start from evaluating naive model
    • We will compare our models against naive models.
  • This allow us to generalise our model findings.

Why is generalisation important?

  • Generalisation is efficient
    • Same for your own brain and for your code
    • generalisation allows for wider applicability
    • e.g., dogs = 4 legged creature with a tail and likes to bark
    • means you can transfer to all instances of dogs not just your pet
  • Saves you to memorise class animalDog vs. Dog0, Dog1, Dog2, etc…

Generalisation in models

  • This means your model findings are not memorised by the model
  • This means your model findings are applicable in other settings.
  • It works in external data that is unseen.

A good model?

  • A well-performing model in one setting may not work in others.
    • e.g. screening test for cancer vs. confirmation test.
  • A good model is generalisable and thus applicable.

Different types of prediction

Refresher: regression vs. classification

  • Regression
    • How do multiple independent variables affect a depedent variable?
  • Classification
    • Does an outcome fit into a certain class or not?
    • Y is a binary variable i.e, 0 or 1.

Theory: Residuals

  • Difference between true value and predicted value is called residual.
    • residual can be positive or negative.
  • Here the dotted lines are residuals.

Theory: Residuals

  • Intuitively, the worse your model is, the bigger your residuals would be.
    • What is the worst model?

The naive model

  • The worst model is your naive model.
    • Predicting the mean average at every value will give the largest residuals.
    • This is your null model.
  • For regression, you want a model that has the smallest residual.
    • But how small?

TODO Diagram of a straight line with the regression points in 2D

Saturated Model

  • What is a saturated model?
  • A theoretical model where there are as many parameters as data points.
  • i.e. You have 5 pairs of measurements for 5 variables.
    • If you use all 5, that is a saturated model.

TODO Diagram of a wiggly line which goes through every point in 2D

Saturated vs null models

  • Worst model (residuals = max) → very biased
  • “saturated” model (residuals = 0) → very variant.
    • Why do you not want a saturated model?

Overfitting

  • A saturated model means it is overfitted.
  • Therefore, you want a balance between a naive model and saturated model.

Comparing the models

How can we quantify how good a prediction is?

Metrics

  • We must use a metric for evaluating the model’s predictions.

Absolute Error

  1. Absolute Error, where \(y_i\) is actual value and \(y_p\) is predicted value.

\[ AE = |y_i - y_p| \]

Mean absolute error (MAE)

  • Divide the sum of the differences between each prediction and the actual value by the number of predictions.

\[ MAE = \frac{1}{n} * \sum_{i=1}^{n}(y_i - y_p) \]

Mean squared error

\[ MSE = \frac{1}{n} * \sum_{i=1}^{n}(y_i - y_p)^2 \]

  • Take the AE and square them (why?)
  • for each value of data from i=1 to n
  • then divide by sample size

Root mean squared error

\[ RMSE = \sqrt{MSE} \]

  • TODO add more explanation about squaring/unsquaring if we are going to include this level of detail

Summary: Regression metrics

  • Mean Absolute Error = mean absolute differences

  • Mean Squared Error = measures the variance of the residuals.

  • Root Mean Squared Error = measures standard deviation of the residuals

  • R-squared (value between 0 and 1) = how much variation in y is explained by your model.

  • adjusted R-squared = because R^2 will go up with more variables i.e., overfit. adjusted R^2 penalises for having more predictor variables

Moving to classification

Moving to classification

  • Residuals –> Deviance
    • Residual is specific to when y is continuous.
    • Deviance is a generalised term for residual.
  • Deviance quantifies the difference between model and observations.

Moving to classification

  • In classification problems, the ground truth is 0 or 1.
    • Pneumonia / no pneumonia
  • The model will produce a probability that the example belongs to a class
    • 0.2 - low probability of pnuemonia
  • How can we evaluate a classification model with the metrics we have discussed?

Log Loss

  • Log loss - is roughly equivalent to MSE for classification
  • Log Loss is linked to likelihood function.
    • Log loss = -1 * log of likelihood function (TODO add explanation for this)

Likelihood

Likelihood, Logloss, Deviance

  • When predicting outcomes with a classifier, we want to minimise deviance
  • Deviance = 2 _ n _ log loss
    • Thus, log loss minimisation = deviance minimisation.
    • Maximum likelihood estimation = statistics
    • Minimising Log loss = machine learning

Overall metrics

  • In regression, we use MSE and RMSE
  • In classification, we use log loss
    • These are the key most important metrics.
  • We used log-loss as our loss function to train our model in the last workshop.
    • The log-loss shows how incorrect the predictions are.

Extending to Crossentropy Loss

  • We have been disccussing classification for one class (cat vs dog)
    • This is binary classification
  • How about multiple classes?
    • e.g. car vs cat vs dog.
  • Here we generalise log loss -> crossentropy loss.

Break!

Going beyond log-loss

Discrete metrics

  • Models will output a classification probability
    • Probabilities are continuous (e.g. 0.2, 0.92)
  • To make a prediction (pneumonia/no pneumonia), we need to set a threshold
    • Where we set the threshold will affect the classification performance
  • Usually defaults to 0.5
    • This is what we see when Keras shows us the accuracy of our model

Confusion Matrix

Model Predicted as Yes Model Predicted as No
True value is Yes A B
True value is No C D

True Positives = A

True Negatives = D

False Positives = C

False Negatives = B

Confusion Matrix

Model Predicted as Yes Model Predicted as No
True value is Yes A B
True value is No C D

Total sample size = A + B + C + D

Total cases = A + B

Total not-cases = C + D

Prevalence = total cases / total sample size

Exercise

Model = 1 Model = 0
True = 1 80 30
True = 0 20 40
  • Work out the following:
    • Total Sample size, Total Cases, Prevalence
    • Accuracy

There are more metrics beyond accuracy…

  • What are the limitations with accuracy?
  • Other metrics give us a more holistic assessment of how the classifier works:
    • Precision, specificity, recall/sensitivity

Exercise - going further

Model = 1 Model = 0
True = 1 80 30
True = 0 20 40

Precision = Positive Predictive Value =

Recall = True Positive Rate =

Specificity = True Negative Rate =

Balancing evaluation

  • The F1 score
    • The (harmonic) mean average of the precision and recall.
    • Nothing to do with cars or UK resident doctors.
  • Gives an idea of the overall predictive performance of the classifier.

Exercise - going even further

Model = 1 Model = 0
True = 1 80 30
True = 0 20 40

F1 score = precision * recall / precision + recall =

What happens if we change the threshold?

  • As discussed earlier, many models predict probabilities which are continuous values
    • Classification relies on discrete, binary predictions (Yes/No)

What happens if we change the threshold?

  • How can we evaluate how the prediction of the model changes as we modify the threshold for detection?
  • This works for clinical metrics too - what level of procalcitonin best predicts bacterial infection?

Area under Curve

  • Sensitivty and Specificity is a trade off
  • If you plot the Sensitivty against the Specificity for different thresholds, a receiver-operator curve (ROC) can be generated.
  • The area under this curve is the AUC
  • 0.5 indicates random guessing, and 1.0 is perfect classification performance.

Example of AUC curve

An ROC curve which shows the classification performance for blood different infection markers for predicting serious bacterial infection (SBI) in febrile infants. Milcent K, Faesch S, Gras-Le Guen C, et al. Use of Procalcitonin Assays to Predict Serious Bacterial Infection in Young Febrile Infants. JAMA Pediatr. 2016;170(1):62–69. doi:10.1001/jamapediatrics.2015.3210

Youden index

  • The Youden index is the optimal cut off for balancing sensitivity and specificity.

  • TODO find example image of ROC with this annotated.

Putting it all together

Evaluating models

  • You now know how to fit models
  • You now know how to evaluate the models
  • What do you evaluate it on?
    • The gold standard is an external data set.
  • Short of that:
    • we will split data into training and test set.

Train-Test Split

  • Training dataset and test should receive the same treatments
  • Test set is unseen data for your model.
  • You will report performance on test set
  • Test set most closely resembles an external independent dataset.
    • But, data leakage

What is data leakage?

  • Data leakage is where you have ridiculously good results.
  • It is a false discovery.
  • Data leakage can happen for many reasons.

Examples of data leakage

  • Example 1
    • you treat training and test data with the max values of the entire dataset.
    • This somehow biases the test dataset and it causes
  • TODO clarify examples with clinical information

Examples of data leakage

  • Example 2: Future information is somehow leaked in a training dataset.

    • Lets say you are building a model that can predict patient receiving a surgery or not.
    • Lets say all sick patients who got antibiotics got surgery
    • Somehow you included antibiotics in your training dataset. It will cause leakage.
  • TODO clarify examples with clinical information

  • NOTE: Data leakage can sometimes be very difficult to detect.

Understanding Data Leakage

Example of data leakage in Chest X-Ray (CXR) data:

  • Suppose a model is trained to detect pneumonia on resus scans
    • Sicker patients often get AP films and are placed in resus
    • The model might “cheat” by detecting resus environment instead of pneumonia itself
    • Real scans may not have the AP sticker, and the model will fail in a real-world environment

Assuming no leakage

  • Test set is untouched.
  • Test set should be split right at the begining
  • Data transformation / engineering processes should be encapsulated in a reproducible manner and applied seperately to train and test sets.

Common pitfalls

  • TODO Common pitfalls
    • TODO How to deal with overfitting. (LL2)
    • TODO How to deal with multicollinearity (LL2)
    • TODO How to deal with missing data and sensitivity analysis (LL2)

Break!

Extending the workshop task

  • TODO Appreciate that the task we are performing is fairly simple and that more complex techniques exist (LL2)

Other Forms of Machine Learning

  • TODO Be able to discuss the uses, advantages, and disadvantages of more modern approaches to machine learning (stable diffusion, large language models) (LL2) (LL4)

Break!

Practical and Ethical Considerations

Scenario 1

  • You are a GP.
  • You use an AI-based clinical decision support (CDS) tool to help manage patients with “breast lump” presentations.
  • The tool helps decide amongst options: invasive investigation, imaging, or watchful waiting.
  • A 17-year-old male patient presents with a “breast lump.”
  • Should you trust the model recommendations for this patient?

Scenario 2

  • You are a clinical lead for your organization.
  • You want to improve pathways for acutely unwell patients.
  • You have a £100,000 budget for this.
  • Your team presents two options:
    • Option 1: Recruit two specialist nurses
    • Option 2: Deploy an AI-based CDS and risk management system

Gold Standards in Medical AI

Ethics of AI

  • The ethics of AI/ML are complex:

    • Should generative AI models (like ChatGPT) be allowed in research writing?
    • Should clinical judgments factor in AI recommendations?
  • AI/ML don’t always get it right, but neither do humans

    • Does AI/ML need to be better than humans?
    • Which humans? FY1 or an average consultant?

Examples of ‘Bad’ AI

  • Read the abstract of this paper (Wang 2018)
  • Is this an ethical use of image processing machine learning systems?

How Do We Do Good Stop Bad AI?

  • Recognize inherent biases and discrimination
  • Assume bias is present, then find and address it
  • Maximize individual autonomy and privacy
  • Reproducibility and transparency are crucial

Even if AI is Good?

Figure 1

NHS app and deprivation: See this study

Reproducibility: An Analogy

(a)
(b)
Figure 2: Kitchen and Recipe

Reproducibility: Important Questions

  • What type of oven?
  • What mode: fan or gas?
  • Which oil: olive oil or rapeseed?
  • Type of milk?

In ML: consider:

  • Technical details like Python version and framework
  • Clinical details like film projection and sample types

Real-Life Example

Different OSs parse data differently
Reference: Bhandari Neupane et al., “Characterization of Leptazolines a–d” (2019)

Reproducibility Analogy

How to Make Research Reproducible

  • Aim for reproducibility, even if it is challenging at first
  • Visit RAPS with R for reproducibility principles

Revisiting the scenarios

Scenario 1 Revisited

  • You are a GP.

  • You use an AI-based clinical decision support (CDS) tool to help manage patients with “breast lump” presentations.

  • The tool suggests options like invasive investigation, imaging, or watchful waiting.

  • A 17-year-old male patient presents with a “breast lump.”

  • Question: Should you trust the model’s recommendations for this patient?

Scenario 1: Critical Questions

  • Was the AI CDS trained on both male and female images?
  • Were patients under 18 included in the training set?
  • Is the model optimized for sensitivity or specificity?
  • What does your clinical judgment suggest?
  • What are the local governance policies regarding AI use?

Scenario 2 Revisited

  • You are a clinical lead for your organization.

  • Your goal is to improve pathways for acutely unwell patients.

  • You have a 100,000 GBP budget.

  • Two options are presented:

    • Option 1: Recruit two specialist nurses
    • Option 2: Deploy an AI-based CDS and risk management system
  • Question: Which option would you choose?

Scenario 2: Considerations

  • Nurses:

    • Would they provide versatility and fill in during staff shortages?
    • Would they be happy in their roles, or might they leave?
  • AI System:

    • Does your organization have the digital infrastructure for AI?
    • Are clinical pathways aligned with AI integration?
    • If AI flags deterioration, can your pathways respond effectively?
    • AI offers scalability and continuous availability, but what if it fails?

Conclusion

What we have covered

  • Today, we have covered:
    • The basics of regression and classification
    • Comprehensively evaluating binary classifiers
    • Ethical and practical considerations of AI in clinical settings
  • We will delve into these further in the second part of the workshop.

Further Reading and Resources

Discussion and Q&A

  • Let’s discuss any remaining questions on ML/AI in clinical applications!