To illustrate applications of auditor to regression problems we will use an artificial dataset dragons available in the DALEX2 package. Our goal is to predict the length of life of dragons.
## year_of_birth height weight scars colour year_of_discovery ## 1 -1291 59.40365 15.32391 7 red 1700 ## 2 1589 46.21374 11.80819 5 red 1700 ## 3 1528 49.17233 13.34482 6 red 1700 ## 4 1645 48.29177 13.27427 5 green 1700 ## 5 -8 49.99679 13.08757 1 red 1700 ## 6 915 45.40876 11.48717 2 red 1700 ## number_of_lost_teeth life_length ## 1 25 1368.4331 ## 2 28 1377.0474 ## 3 38 1603.9632 ## 4 33 1434.4222 ## 5 18 985.4905 ## 6 20 969.5682
The beginning of each analysis is creation of a
modelAudit object. It’s an object that can be used to audit a model.
In this section we give short overview of a visual validation of model errors and show the propositions for the validation scores. Auditor helps to find answers for questions that may be crucial for further analyses.
Does the model fit data? Is it not missing the information?
Which model has better performance?
How similar models are?
In further sections, we overview auditor functions for analysis of model residuals. They are discussed in alphabetical order.
The auditor provides 2 pipelines of observation influence audit.
model %>% audit() %>% observationInfluence() %>% plot(type=…) This pipeline is recommended. Function
observationInfluence() creates a
observationInfluence object. Such object may be passed to a
plot() function with defined type of plot. This approach requires one additional function within the pipeline. However, once created
observationInfluence contains all nessesary calculations that all plots require. Therefore, generating multiple plots is fast. It is usefull as caluclating Coook’s distances for models gifferent than liner may take a lot of time. Alternative: model %>% audit() %>% observationInfluence() %>% plotType()
model %>% audit() %>% plot(type=…) This pipeline is shorter than previous one. Calculations are carried out every time a function is called. However, it is faster to use.
Alternative model %>% audit() %>% plotType()
Help of functions
plot[Type]() contains additional information about plots.
In this vignette we use first pipeline. First, we need to create a
## cooks.dist label index ## 1744 0.016856928 lm 1744 ## 1100 0.010160460 lm 1100 ## 29 0.009422205 lm 29 ## 1908 0.008938871 lm 1908 ## 7 0.006788849 lm 7 ## 826 0.006061949 lm 826
Some plots may require specified variable or fitted values for
Cook’s distance is used to estimate of the influence of an single observation. It is a tool for identifying observations that may negatively affect the model.
Data points indicated by Cook’s distances are worth checking for validity. Cook’s distances may be also used for indicating regions of the design space where it would be good to obtain more observations.
Cook’s Distances are calculated by removing the i-th observation from the data and recalculating the model. It shows how much all the values in the model change when the i-th observation is removed.
In the case of models of classes other than
glm the distances are computed directly from the definition, so this may take a while. In this example we will compute them for a linear model.