Slides and other materials are available at
https://tinyurl.com/UseR2022

Part 1: Introduction

About the team

Meet your guide to ResponsibleML



More about MI² https://mi2.ai/.

Agenda

Feel free to post any questions during the workshop in the chat. From time to time we will approach these questions.

  • 16:00 About the team

  • 16:10 Agenda + motivation

  • 16:20 EDA

  • 16:30 Let’s train some models

  • 16:40 Evaluate performance + examples

  • 16:50 Do it yourself

  • 17:00 XAI piramide - introduction

  • 17:10 Permutational Variable Importance + examples

  • 17:20 Do it yourself

  • 17:30 BREAK

  • 18:00 Break-down + examples

  • 18:10 SHAP + examples

  • 18:20 Do it yourself

  • 18:30 Ceteris Paribus + examples

  • 18:40 Partial Dependence Profile + examples

  • 18:50 Do it yourself

  • 19:00 modelStudio + examples

  • 19:10 Do it yourself

  • 19:20 Closing remarks

Design Principles

This workshop aims to present a set of methods for the exploration of complex predictive models. We assume that participants are familiar with R and have some basic knowledge of predictive models. In this workshop, we will show how to explore these models.

The workshop consists of 1/3 lecture, 1/3 code examples discussed by the tutor and 1/3 computer-based exercises for participants.

As the group is large, it may happen that someone will have some problems with the tasks, in such a situation please write about it in the chat.

It may also happen that someone will do the tasks much earlier. In such a situation it would be best to help others who have questions.

Materials

The workshop is based on material from a comic book on Responsible Machine Learning. This book can be accessed online and is also available in paperback on amazon and lulu.

Part 1: Introduction to predictive modelling + EDA

Get prepared. Install the following packages.

install.packages(c("tableone", "DALEX", "ggplot2", "partykit", "ranger", "rms"))

The purpose of this tutorial is to present techniques for model exploration, visualisation and explanation. To do this we will use some interesting real-world data, train a few models on the data and then use XAI (eXplainable artificial intelligence) techniques to explore these models. Along the way, we will tackle various interesting topics such as model training, model verification, model visualisation, model comparison and exploratory model analysis.

We assume that users have some basic knowledge about predictive modelling so we can focus on model exploration. If you want to learn models about models, then some in-depth introduction in An Introduction to Statistical Learning: with Applications in R. In this tutorial we will present some basics of model explanatory analysis, if you are looking for details then you will find them in Explanatory Model Analysis.

Why should I care?

Predictive models have been used throughout entire human history. Priests in Egypt were predicting when the flood of the Nile or a solar eclipse would come. Developments in statistics, increasing the availability of datasets, and increasing computing power allow predictive models to be built faster and faster.

Today, predictive models are used virtually everywhere. Planning the supply chain for a large corporation, recommending lunch or a movie for the evening, or predicting traffic jams in a city. Newspapers are full of interesting applications.

But how are such predictive models developed? In the following sections, we will go through a life cycle of a predictive model. From the concept phase, through design, training, and checking, to the deployment. For this example, we will use the data set on the risk of death for Covid-19 patients after SARS-COV-2 infection. But keep in mind that the data presented here is artificial. It is generated to mirror relations in real data but does not contain real observations for real patients. Still, it should be an interesting use case to discuss a typical lifetime of a predictive model.

Tools

In this tutorial we will work on three types of models, logistic regression with splines, which is implemented in the rms package, simple decision tree implemented in partykit package and random forest implemented in the ranger package.

Models will be explained and visualized with the DALEX package. Note that there are also other packages with similar functionalities, for modelling other popular choices are mlr, tidymodels and caret while for the model explanation you will find lots of interesting features in flashlight and iml.

The problem

The life cycle of a predictive model begins with a well-defined problem. In this example, we are looking for a model that assesses the risk of death after being diagnosed covid. We don’t want to guess who will survive and who won’t. We want to construct a score that allows us to sort patients by risk of death.

Why do we need such a model? It could have many applications! Those at higher risk of death could be given more protection, such as providing them with pulse oximeters or preferentially vaccinating them.

Load packages

library("tableone")
library("ggplot2")
library("partykit")
library("ranger")

set.seed(1313)

Conception

Before we build any model, even before we touch any data we should first determine for what purpose we will build a predictive model.

It is very important to define the objective before we sit down to programming because later it is easy to get lost in setting function parameters and dealing with all these details that we need to do. It is easy to lose sight of the long-term goal.

So, first: Define the objective.

For these exercises, We have selected data on the covid pandemic. Imagine that we want to determine the order of vaccination. In this example, we want to create a predictive model that assesses individual risks because we would like to rank patients according to their risks.

To get a model that gives the best ranking we will use the AUC measure to evaluate model performance. What exactly the AUC is I’ll talk about a little later, right now the key thing is that we’re interested in ranking patients based on their risk score.

Read the data

To build a model we need good data. In Machine Learning, the word good means a large amount of representative data. Collecting representative data is not easy and often requires designing an appropriate experiment.

The best possible scenario is that one can design and run an experiment to collect the necessary data. In less comfortable situations, we look for “natural experiments,” i.e., data that have been collected for another purpose but that can be used to build a model. Here we will use the data= collected through epidemiological interviews. There will be a lot of data points and it should be fairly representative, although unfortunately it only involves symptomatic patients who are tested positive for SARS-COV-2.

For this exercise, we have prepared two sets of characteristics of patients infected with covid. It is important to note that these are not real patient data. This is simulated data, generated to have relationships consistent with real data (obtained from NIH), but the data itself is not real. Fortunately, they are sufficient for our exercise.

The data is divided into two sets covid_spring and covid_summer. The first is acquired in spring 2020 and will be used as training data while the second dataset is acquired in summer and will be used for validation. In machine learning, model validation is performed on a separate data set. This controls the risk of overfitting an elastic model to the data. If we do not have a separate set then it is generated using cross-validation, out of sample or out of time techniques.

  • covid_spring corresponds to covid mortality data from spring 2020. We will use this data for model training.
  • covid_summer corresponds to covid mortality data from summer 2020. We will use this data for model validation.

Both datasets are available in the DALEX package.

library("DALEX")

head(covid_spring)
##   Gender Age Cardiovascular.Diseases Diabetes Neurological.Diseases
## 1   Male  29                      No       No                    No
## 2   Male  50                      No       No                    No
## 3   Male  39                      No       No                    No
## 4   Male  40                      No       No                    No
## 5   Male  53                      No       No                    No
## 6 Female  36                      No       No                    No
##   Kidney.Diseases Cancer Hospitalization Fever Cough Weakness Death
## 1              No     No              No    No    No       No    No
## 2              No     No              No   Yes   Yes      Yes    No
## 3              No     No              No    No    No       No    No
## 4              No     No              No    No    No       No    No
## 5              No     No              No   Yes   Yes      Yes    No
## 6              No     No              No    No    No       No    No
head(covid_summer)
##   Gender Age Cardiovascular.Diseases Diabetes Neurological.Diseases
## 1 Female  57                      No       No                    No
## 2   Male  34                      No       No                    No
## 3   Male  73                      No       No                    No
## 4 Female  48                     Yes       No                    No
## 5   Male  29                      No       No                    No
## 6   Male  54                      No       No                    No
##   Kidney.Diseases Cancer Hospitalization Fever Cough Weakness Death
## 1              No     No             Yes   Yes   Yes       No    No
## 2              No     No              No    No    No       No    No
## 3              No     No             Yes    No    No       No   Yes
## 4              No     No              No   Yes    No       No    No
## 5              No     No             Yes   Yes    No      Yes    No
## 6              No     No             Yes   Yes   Yes       No    No

Explore the data

Before we start any serious modelling, it is worth looking at the data first. To do this, we will do a simple EDA. In R there are many tools to do data exploration, I value packages that support so-called table one.

library("tableone")

table1 <- CreateTableOne(vars = colnames(covid_spring)[1:11],
                         data = covid_spring,
                         strata = "Death")
print(table1)
##                                    Stratified by Death
##                                     No            Yes           p      test
##   n                                  9487           513                    
##   Gender = Male (%)                  4554 (48.0)    271 (52.8)   0.037     
##   Age (mean (SD))                   44.19 (18.32) 74.44 (13.27) <0.001     
##   Cardiovascular.Diseases = Yes (%)   839 ( 8.8)    273 (53.2)  <0.001     
##   Diabetes = Yes (%)                  260 ( 2.7)     78 (15.2)  <0.001     
##   Neurological.Diseases = Yes (%)     127 ( 1.3)     57 (11.1)  <0.001     
##   Kidney.Diseases = Yes (%)           111 ( 1.2)     62 (12.1)  <0.001     
##   Cancer = Yes (%)                    158 ( 1.7)     68 (13.3)  <0.001     
##   Hospitalization = Yes (%)          2344 (24.7)    481 (93.8)  <0.001     
##   Fever = Yes (%)                    3314 (34.9)    335 (65.3)  <0.001     
##   Cough = Yes (%)                    3062 (32.3)    253 (49.3)  <0.001     
##   Weakness = Yes (%)                 2282 (24.1)    196 (38.2)  <0.001

During modelling, the part related to exploration often takes the most time. In this case, we will limit ourselves to some simple graphs.

ggplot(covid_spring, aes(Age)) +
  geom_histogram() +
  ggtitle("Histogram of age")

ggplot(covid_spring, aes(Age, fill = Death)) +
  geom_histogram(color = "white") +
  ggtitle("Histogram of age") + 
  DALEX::theme_ema() +
  scale_fill_manual("", values = c("grey", "red3"))

library(ggmosaic)
ggplot(data = covid_spring) + 
  geom_mosaic(aes(x=product(Diabetes), fill = Death)) + 
  DALEX::theme_ema() +
  scale_fill_manual("", values = c("grey", "red3"))

Transform the data

One of the most important rules to remember when building a predictive model is: Do not condition on the future!

Variables like Hospitalization or Cough are not good predictors, because they are not known in advance.

covid_spring <- covid_spring[,c("Gender", "Age", "Cardiovascular.Diseases", "Diabetes",
               "Neurological.Diseases", "Kidney.Diseases", "Cancer",
               "Death")]
covid_summer <- covid_summer[,c("Gender", "Age", "Cardiovascular.Diseases", "Diabetes",
               "Neurological.Diseases", "Kidney.Diseases", "Cancer",
               "Death")]