Mapping explanation -Python toolchaining for spatial interpretative machine learning - Trump vs. Clinton 2016 case study¶

Topics¶

  • Data loading and browsing (Trump vs Clinton, 2016 election)
  • Machine learnign model. Can we predict result, and to what extent?
  • Decomposion of black-box model. Partial dependency.
  • Building explanatory model
  • More about explainers
  • Towards interpetation. How machine learning helps understand complex processes.

Required Libraries¶

  • matplotlib - python graphical library
  • seaborn - visualisation front-end for matplotlib
  • numpy, scipy (newst version)
  • pandas - python data frame
  • geopandas - spatial extension of pandas
  • scikit-learn - machine learning library
  • xgboost - machine learning model (scikit-learn plugin)
  • shap - interpretation of machine learning models

Note:

Install all those libraries. Ensure about version. Version conflicts are possible.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as clt
import geopandas as gpd
import seaborn as sns

Download all materials from Dropbox or create a directory with dataset called data in working directory. Ensure, all data files are there:

  • counties.p - large file, over 280MB, spatial data
  • explained.p - dependend variable (Trump vs Clinton election)
  • explanatory.p - social, economic and demographic data
  • shapvalues - data transformed to explanation (impact) - will be discussed in details

Also create a directory with visualisation called vis. Will be helpful.

Loading and browsing the data¶

Comments:

what is pickle

pandas and geopandas

In [2]:
explanatory = pd.read_pickle("data/explanatory.p")
explained = pd.read_pickle("data/explained.p")
counties = pd.read_pickle("data/counties.p")

Data are generally unreviewable:

In [3]:
explanatory.head()
Out[3]:
race_white race_black race_hispanic sex_female age_0_17 age_over65 ind_divorce marriage_never_married lang_second_eng empl_civilian_unemployed ... transp_public pov_under_1 insured_under65 house_vacant housevalue_median POP16dens MEDIAN_INCOME08_16 POP08_16 edu_high housevalue_IQR
GISJOIN
G0100010 0.756835 0.183709 0.025723 0.511762 0.251649 0.139785 0.196471 0.248933 0.041779 0.033644 ... 0.000782 0.122660 0.896521 0.084265 141000.0 35.167013 -0.012534 0.008408 0.245928 136300.0
G0100030 0.831788 0.092256 0.043667 0.511949 0.221894 0.187149 0.231375 0.246486 0.069489 0.036725 ... 0.002143 0.129938 0.856273 0.301453 173400.0 47.226343 0.012996 0.088874 0.295471 157800.0
G0100050 0.458856 0.478883 0.043098 0.464981 0.215488 0.165289 0.281972 0.345559 0.060732 0.061604 ... 0.003871 0.263737 0.842386 0.227080 90300.0 11.360474 0.064819 -0.031126 0.128678 116300.0
G0100070 0.747652 0.212121 0.022240 0.464646 0.210704 0.148857 0.231938 0.296369 0.023695 0.034764 ... 0.004831 0.164539 0.892983 0.214445 97200.0 13.918126 0.080165 -0.008652 0.120000 112300.0
G0100090 0.876577 0.015580 0.087273 0.504852 0.235703 0.171929 0.182036 0.198090 0.068917 0.029713 ... 0.001733 0.165344 0.867668 0.135472 124200.0 34.243279 0.022570 0.004142 0.130498 119800.0

5 rows × 25 columns

We will use geopandas to browse data

  • join geodataframe with other data (gisjoin)
  • set visual parameters
  • set spatial limits.
In [3]:
#it take some time
ct = counties.join(explained.rename("exp"))
fig,ax = plt.subplots(figsize=(12,7))
ct.plot(column='exp',ax = ax, cmap="bwr",legend=True,norm=clt.CenteredNorm(halfrange=0.7))
ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
Out[3]:
(-1500000.0, 1800000.0)

Explained variables is difference between percentage of voting for Trump and Clinton. Positive values means that Trump got more votes, negative value Clinton got more votes. Zero means balance.

For explanatory variables, it is much better to browse file vis/vars.pdf

Note:

vars.pdf was crated with external python script (available as supplementary_visual_map_series.ipynb). In general, visualisation in Python harsh, thus is not the topic of the course. File supplement_visual_map_series.ipynb show how map series was created. Finally list of *png are converted to pdf using imagemagic.

Machine learning models¶

Now we will select three ensemble methods:

  • Radom forest
  • Extremly randomised trees
  • XGboost

We create model in a little wired way - (will be discussed) - bias-variance trade-off shifted towards variance. Model will be fitted to entire data set.

In [4]:
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.ensemble import ExtraTreesRegressor as ETR
from xgboost import XGBRegressor as XGB

modelRFR = RFR(max_features=.2)
modelETR = ETR(max_features=.2)
modelXGB = XGB(objective='reg:squarederror')
In [5]:
modelRFR.fit(explanatory,explained)
pred_RFR = modelRFR.predict(explanatory)

modelETR.fit(explanatory,explained)
pred_ETR = modelETR.predict(explanatory)

modelXGB.fit(explanatory,explained)
pred_XGB = modelXGB.predict(explanatory)

Note:

whe dependend variable is categorical we replace .predict() method with .predict_proba() and interprete probability

Now we can compare models:

We can see that extreme randomised trees (middle pannel) tends gives most consistent results:

In [6]:
fig,axes = plt.subplots(ncols=3, figsize=(14,6))
axes[0].scatter(explained,pred_RFR,s=2)
axes[1].scatter(explained,pred_ETR,s=2)
axes[2].scatter(explained,pred_XGB,s=2)
Out[6]:
<matplotlib.collections.PathCollection at 0x7f695c2d4f70>

Decompositon - model level explanation (Partial dependency plots)¶

Partial depencence show an outcome when variable gets given values, and other variables are set to its mean. And mean is ~0.317.

Black box models are difficult to interpret due to their complex structure. Partial dependence plots (a.n.a. PDP) plots can be used to visualize and analyze interaction between the target response [1] and one or more input features of interest. PDP show the dependence between the target response and a feature (variable) of interest. All other features (the ‘complement’ features) are marginalizing are set to thiers means. PDP can be described thus as a function of the input feature of interest. We can also analyse variables interaction (up to 2 due to human limitation).

We can see if relations are linear or non-linear, monotonic or not monotonic, where (range of values) feature is important where is not etc.

Note:

there are lot of issues with PDP. PDP assumes that features are independent, what almost nver happen, so it can lead to absurd results. ALE Plots (Accumulated Local Effect) solve that problem.

In [7]:
from sklearn.inspection import PartialDependenceDisplay
In [8]:
# list columns names just to selecy variables of interest.
explanatory.columns
Out[8]:
Index(['race_white', 'race_black', 'race_hispanic', 'sex_female', 'age_0_17',
       'age_over65', 'ind_divorce', 'marriage_never_married',
       'lang_second_eng', 'empl_civilian_unemployed', 'ind_agriculture',
       'income_median', 'public_asist', 'gini_index', 'transp_car',
       'transp_public', 'pov_under_1', 'insured_under65', 'house_vacant',
       'housevalue_median', 'POP16dens', 'MEDIAN_INCOME08_16', 'POP08_16',
       'edu_high', 'housevalue_IQR'],
      dtype='object')
In [9]:
# long calculation
fig, axes = plt.subplots(ncols=3,figsize=(16,6),sharey=True)
PartialDependenceDisplay.from_estimator(modelETR, explanatory, features=["edu_high","ind_agriculture","race_black"],ax=axes)
Out[9]:
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x7f4bf47a94c0>

What we can say:

  • edu_high decrease monotonically, means: more educated people, more votes for Clinton
  • ind_agriculture increse lineary, means more farmers, more Trump voters. Morover. Assuming mean response is 0.31, Clinton get more votes than Trump onlu on areas where there are no farmers
  • race_black, is flat till 15% and show small benefit for Trump; when percentage of AA is over 15% there is an increasing support for Clinton.

Building an explanation¶

Explanation of a single case¶

SHAP abbrewiation from (SHapley Additive exPlanations) values allow to explain the output of any machine learning model. It means it is agnostic method, and do not require any assumptions about the model and data used for its training. It uses a game theory approach that measures each variable contribution to the model outcome for single case. The range of SHAP values for each feature determines its importance. The greater range the greater the significance of the variable. Features with positive SHAP values positively impact (increase the outocome or probability), while those with negative values have a negative impact (decrease outcome or probability).

Explanation is a process of transforming the explanatory data into a new data set that contains an impact of a given feature across all the cases.

$$ X \sim y \rightarrow X' \sim \hat{y} $$

https://shap.readthedocs.io/en/latest/

In [8]:
import shap
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
In [9]:
model = modelETR
n=23
explainer = shap.explainers.Permutation(model.predict, explanatory) # X masker
shap_values = explainer(explanatory[n:n+1],max_evals=51,error_bounds=True)

There are few types of explainers:

  • tree - dedicated to tree-based models. Those explainers tends to create non-linar explanatory models.
  • permutation - main agnostic model
  • kernel (more or less similar to that implemented in shapr (R)
  • partition - respects a hierarchical structure of the data
  • other: linear, Additive, Deep (deep learing)
  • warpers: Maple, Lime, Gain.

We will explore in brief. permutation and partition explainers.

Waterfall plots¶

Waterfall plots display explanations for individual cases. The plot's initial point is the model output's expected value when all variables are set to their mean. Each row displays the positive (red) or negative (blue) contribution of each feature, changing the value from the expected model output over the background dataset to the output for this prediction.

In [10]:
#test few examples
shap.plots.waterfall(shap_values[0])

We only provide shap values. We can compare waterfall with original values of features.

In [13]:
explanatory.iloc[n]
Out[13]:
race_white                       0.280693
race_black                       0.694974
race_hispanic                    0.009583
sex_female                       0.538261
age_0_17                         0.251871
age_over65                       0.156761
ind_divorce                      0.294962
marriage_never_married           0.414861
lang_second_eng                  0.026648
empl_civilian_unemployed         0.078192
ind_agriculture                  0.033730
income_median                28136.000000
public_asist                     0.327536
gini_index                       0.517000
transp_car                       0.945817
transp_public                    0.002810
pov_under_1                      0.343000
insured_under65                  0.852488
house_vacant                     0.204673
housevalue_median            81000.000000
POP16dens                       16.094373
MEDIAN_INCOME08_16               0.074796
POP08_16                        -0.049252
edu_high                         0.137843
housevalue_IQR              109200.000000
Name: G0100470, dtype: float64

Create explanation for entire data set¶

In [14]:
#Very long caculation, don't run...
model = modelETR
model.fit(explanatory,explained)
shap_values = explainer(explanatory,max_evals=51,error_bounds=True) 
shapvalues = pd.DataFrame(shap_values.values,columns=explanatory.columns,index=explanatory.index)
Permutation explainer: 3136it [07:17,  7.02it/s]                                
In [11]:
#... simply load precalculated data
shapvalues = pd.read_pickle("data/shapvalues.p")

Compare real value with impact - dependency plots¶

Dependence plots (not partial) show the effect of a single feature across the entire dataset. This is a plot a feature’s value vs. the SHAP value of that feature across all samples represented by dots. SHAP dependence plots takes into account for the interaction effects present in the features thus are defined only in regions of the data domain. The vertical dispersion of SHAP values at a single feature value is the result of interaction between variables.

Comments:

cmap

norm

Note:

original SHAP scatterplot has bad defaults and is useless with higher number of features

In [16]:
scatter_kws = dict(cmap="bwr",norm=clt.CenteredNorm(halfrange=0.4),edgecolor="none",s=30)
fig,axes = plt.subplots(ncols=3,figsize=(14,6),sharey=True)
axes[0].scatter(explanatory.edu_high,shapvalues.edu_high,c=shapvalues.edu_high,**scatter_kws)
axes[1].scatter(explanatory.race_white,shapvalues.race_white,c=shapvalues.race_white,**scatter_kws)
axes[2].scatter(explanatory.race_black,shapvalues.race_black,c=shapvalues.race_black,**scatter_kws)
Out[16]:
<matplotlib.collections.PathCollection at 0x7f4bd4aca790>

Dependency plots for the entire data¶

In [13]:
# do not run...
scatter_kws = dict(cmap="bwr",norm=clt.CenteredNorm(halfrange=0.4),edgecolor="none",s=20)
fig,axes = plt.subplots(ncols=5,nrows=5,figsize=(25,25),sharey=True)
for column,ax  in zip(explanatory.columns,axes.flatten()):
    ax.scatter(explanatory[column],shapvalues[column],c=shapvalues[column],**scatter_kws)
    ax.set_title(column)
    ax.set_facecolor("gray")
#fig.savefig("dependency_plots.pdf")

... see also vis/dependency_plot.pdf

For quick browsing of variable importnace only build-in summary_plot function can be used.

In [15]:
shap.summary_plot(shap_values, explanatory,cmap="bwr")
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Spatial visualisation¶

Dependence plots provide insight into the relationships between variables, but with spatial data, the spatial distribution of influence importance is equally important. In practice, spatial relationships between a variable and its impact and dependence plots should be analyzed.

In [18]:
fig,ax = plt.subplots(figsize=(10,7))
ct = counties.join(shapvalues["race_hispanic"])
ct.plot(column='race_hispanic',ax = ax, cmap="bwr",legend=True,norm=clt.CenteredNorm(halfrange=0.2))

ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
Out[18]:
(-1500000.0, 1800000.0)

...see vis/impact.pdf for all impacts.

More about explainers¶

Up to this point, we have not assumed relationships between features. Game theory assumes that the variables (players) are completely independent, but in practice there are relationships between the variables, causing the importance of some features to be obscured by others. In order to minimize the interaction of traits, we use maskers, which define groups where traits exchange importance among themselves

Comments:

what is masker, max_evals

Shap provides utilities for the fast clustering of variables.

Note:

Each variable is a "case" represented by thousands of "variables". This is one reason why we use a cosine metric.

In [19]:
from sklearn.preprocessing import scale
import scipy.cluster.hierarchy as sch
clustering = shap.utils.hclust(scale(explanatory.values), metric="cosine",linkage="average")
fig, ax = plt.subplots(1,1,figsize=(12,4))
dendr = sch.dendrogram(clustering,labels=explanatory.columns,ax=ax)
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
In [21]:
masker = shap.maskers.Partition(explanatory, clustering=clustering)
explainer_partition = shap.explainers.Partition(model.predict, masker)
explainer_permutation = shap.explainers.Permutation(model.predict, explanatory)
explainer_correlation = shap.explainers.Permutation(model.predict,shap.maskers.Partition(explanatory)) #correlation is default
In [22]:
n=23
shap_values_partition = explainer_partition(explanatory[n:n+1],max_evals=51,error_bounds=True) 
shap_values_permutation = explainer_permutation(explanatory[n:n+1],max_evals=51,error_bounds=True) 
shap_values_correlation = explainer_correlation(explanatory[n:n+1],max_evals=51,error_bounds=True) 

shap.plots.waterfall(shap_values_permutation[0])
shap.plots.waterfall(shap_values_partition[0])
shap.plots.waterfall(shap_values_correlation[0])
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names

Using partition with heirarchical clustering as a masker reveals the effect of gini_index, while the default partition masker (which is correlation) does not. (more in the classroom)

Towards an interpretation¶

In order to be able to interpret a large geospatial dataset, it is necessary to reduce its dimensionality, for example, through clustering or PCA. With clustering, we have a group of clusters that can be analyzed in terms of:

  • spatial arrangement
  • distribution of the dependent variable
  • information about the dependent variables, but in our case their influence.

We will use Affinity Propagation as clustering method. PCA code is avaialbe in spplementary materials (supplementary_pca.ipynb)

[From Scikit-learn]: AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.

Affinity Propagation returns exeplars and labels for all samples passed to .fit(). Exemplars are rather "natular leader" than "everyman", like medoid or centroid.

Note:

Code for presentation-level plots described below can be found in suplementary_pca.ipynb

In [23]:
from sklearn.cluster import AffinityPropagation
In [24]:
ap = AffinityPropagation(preference=-1.5).fit(shapvalues.values)
ap.cluster_centers_indices_
ap.labels_ # cluster labels, starting from 0
clusters = pd.Series(ap.labels_,index=explanatory.index,name="clusters")
In [20]:
clusters = pd.read_pickle("data/clusters.p")

Optimal number of cluters is always a challenge. Here we found that with "pretty" preference -1.5 we got 10 clusters. As we notice later, there are 3 Pro-Clinton clusters, and 7 Pro-Trump. Small increasing of preference results in 11 clusters with 8 Pro-Trump. Small Decreasing results in 9 clusters with 2 Pro-Clinton. It looks like 10 is a local optimum.

Spatial distribution¶

In [26]:
fig,ax = plt.subplots(figsize=(10,7))
ct = counties.join(clusters)
ct.plot(column='clusters',ax = ax, cmap="rainbow",legend=True)
ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
Out[26]:
(-1500000.0, 1800000.0)

Most important note: we see strong spatial pattern

Distribution of dependent variable inside clusters¶

The distribution of the dependent variable in each cluster can be represented using a typical boxplot. To make a boxplot we will use the Seaborn package, before that we need to combine two series in DataFrame.

An analysis of the distributions shows that clusters 4, 5 and 6 show hrbastories where Clinton significantly won, cluster 3 where the balance between the candidates is marked, the remaining examples are counties where Trump won.

In [27]:
explained_clusters = pd.DataFrame({"explained":explained,"clusters":clusters})
fig,ax = plt.subplots(figsize=(10,7))
sns.boxplot(data=explained_clusters,x="clusters",y="explained",palette="rainbow",ax=ax)
Out[27]:
<Axes: xlabel='clusters', ylabel='explained'>

How clusters are shaped?¶

To analyze the effect of individual variables on the outcome, we will use a heatmap from the Seaborn package. Unlike the shap library's built-in tool, it allows you to add a column of colors so as to correlate the map and boxplots.

The third element shows how the election result was shaped in each cluster. Leaving aside the detailed interepretation, which is beyond the scope of the course, it is possible to show how the method works in practice:

cluster 4 - Clinotn victory 15 to 50% lead over Trump. Both race_white and race_black show a positive impact in favor of Trump. Clinton's victory is due to the following characteristics: never_maried, hause_value_median, edu_high (big-city Democrats)

Focus 6: Clinotn 20 to 55% victory The main traits that gave victory were white_race and black_race (black farmers)

In [28]:
from matplotlib import cm
colors = cm.rainbow(np.arange(0,1,0.1))
expl = shapvalues.iloc[ap.cluster_centers_indices_,]
g = sns.clustermap(expl,row_cluster=False,col_cluster=False,cmap="bwr",vmin=-.4,vmax=.2,center=0,row_colors=colors,figsize=(15,15))

Note:

Waterfall diagrams from the presentation can be found in the file supplementray_waterfalls.ipynb