Note:
Install all those libraries. Ensure about version. Version conflicts are possible.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as clt
import geopandas as gpd
import seaborn as sns
Download all materials from Dropbox or create a directory with dataset called data in working directory. Ensure, all data files are there:
Also create a directory with visualisation called vis. Will be helpful.
explanatory = pd.read_pickle("data/explanatory.p")
explained = pd.read_pickle("data/explained.p")
counties = pd.read_pickle("data/counties.p")
Data are generally unreviewable:
explanatory.head()
race_white | race_black | race_hispanic | sex_female | age_0_17 | age_over65 | ind_divorce | marriage_never_married | lang_second_eng | empl_civilian_unemployed | ... | transp_public | pov_under_1 | insured_under65 | house_vacant | housevalue_median | POP16dens | MEDIAN_INCOME08_16 | POP08_16 | edu_high | housevalue_IQR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GISJOIN | |||||||||||||||||||||
G0100010 | 0.756835 | 0.183709 | 0.025723 | 0.511762 | 0.251649 | 0.139785 | 0.196471 | 0.248933 | 0.041779 | 0.033644 | ... | 0.000782 | 0.122660 | 0.896521 | 0.084265 | 141000.0 | 35.167013 | -0.012534 | 0.008408 | 0.245928 | 136300.0 |
G0100030 | 0.831788 | 0.092256 | 0.043667 | 0.511949 | 0.221894 | 0.187149 | 0.231375 | 0.246486 | 0.069489 | 0.036725 | ... | 0.002143 | 0.129938 | 0.856273 | 0.301453 | 173400.0 | 47.226343 | 0.012996 | 0.088874 | 0.295471 | 157800.0 |
G0100050 | 0.458856 | 0.478883 | 0.043098 | 0.464981 | 0.215488 | 0.165289 | 0.281972 | 0.345559 | 0.060732 | 0.061604 | ... | 0.003871 | 0.263737 | 0.842386 | 0.227080 | 90300.0 | 11.360474 | 0.064819 | -0.031126 | 0.128678 | 116300.0 |
G0100070 | 0.747652 | 0.212121 | 0.022240 | 0.464646 | 0.210704 | 0.148857 | 0.231938 | 0.296369 | 0.023695 | 0.034764 | ... | 0.004831 | 0.164539 | 0.892983 | 0.214445 | 97200.0 | 13.918126 | 0.080165 | -0.008652 | 0.120000 | 112300.0 |
G0100090 | 0.876577 | 0.015580 | 0.087273 | 0.504852 | 0.235703 | 0.171929 | 0.182036 | 0.198090 | 0.068917 | 0.029713 | ... | 0.001733 | 0.165344 | 0.867668 | 0.135472 | 124200.0 | 34.243279 | 0.022570 | 0.004142 | 0.130498 | 119800.0 |
5 rows × 25 columns
We will use geopandas to browse data
#it take some time
ct = counties.join(explained.rename("exp"))
fig,ax = plt.subplots(figsize=(12,7))
ct.plot(column='exp',ax = ax, cmap="bwr",legend=True,norm=clt.CenteredNorm(halfrange=0.7))
ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
(-1500000.0, 1800000.0)
Explained variables is difference between percentage of voting for Trump and Clinton. Positive values means that Trump got more votes, negative value Clinton got more votes. Zero means balance.
For explanatory variables, it is much better to browse file vis/vars.pdf
Note:
vars.pdf was crated with external python script (available as supplementary_visual_map_series.ipynb). In general, visualisation in Python harsh, thus is not the topic of the course. File supplement_visual_map_series.ipynb show how map series was created. Finally list of *png are converted to pdf using imagemagic.
Now we will select three ensemble methods:
We create model in a little wired way - (will be discussed) - bias-variance trade-off shifted towards variance. Model will be fitted to entire data set.
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.ensemble import ExtraTreesRegressor as ETR
from xgboost import XGBRegressor as XGB
modelRFR = RFR(max_features=.2)
modelETR = ETR(max_features=.2)
modelXGB = XGB(objective='reg:squarederror')
modelRFR.fit(explanatory,explained)
pred_RFR = modelRFR.predict(explanatory)
modelETR.fit(explanatory,explained)
pred_ETR = modelETR.predict(explanatory)
modelXGB.fit(explanatory,explained)
pred_XGB = modelXGB.predict(explanatory)
Note:
whe dependend variable is categorical we replace .predict() method with .predict_proba() and interprete probability
Now we can compare models:
We can see that extreme randomised trees (middle pannel) tends gives most consistent results:
fig,axes = plt.subplots(ncols=3, figsize=(14,6))
axes[0].scatter(explained,pred_RFR,s=2)
axes[1].scatter(explained,pred_ETR,s=2)
axes[2].scatter(explained,pred_XGB,s=2)
<matplotlib.collections.PathCollection at 0x7f695c2d4f70>
Partial depencence show an outcome when variable gets given values, and other variables are set to its mean. And mean is ~0.317.
Black box models are difficult to interpret due to their complex structure. Partial dependence plots (a.n.a. PDP) plots can be used to visualize and analyze interaction between the target response [1] and one or more input features of interest. PDP show the dependence between the target response and a feature (variable) of interest. All other features (the ‘complement’ features) are marginalizing are set to thiers means. PDP can be described thus as a function of the input feature of interest. We can also analyse variables interaction (up to 2 due to human limitation).
We can see if relations are linear or non-linear, monotonic or not monotonic, where (range of values) feature is important where is not etc.
Note:
there are lot of issues with PDP. PDP assumes that features are independent, what almost nver happen, so it can lead to absurd results. ALE Plots (Accumulated Local Effect) solve that problem.
from sklearn.inspection import PartialDependenceDisplay
# list columns names just to selecy variables of interest.
explanatory.columns
Index(['race_white', 'race_black', 'race_hispanic', 'sex_female', 'age_0_17', 'age_over65', 'ind_divorce', 'marriage_never_married', 'lang_second_eng', 'empl_civilian_unemployed', 'ind_agriculture', 'income_median', 'public_asist', 'gini_index', 'transp_car', 'transp_public', 'pov_under_1', 'insured_under65', 'house_vacant', 'housevalue_median', 'POP16dens', 'MEDIAN_INCOME08_16', 'POP08_16', 'edu_high', 'housevalue_IQR'], dtype='object')
# long calculation
fig, axes = plt.subplots(ncols=3,figsize=(16,6),sharey=True)
PartialDependenceDisplay.from_estimator(modelETR, explanatory, features=["edu_high","ind_agriculture","race_black"],ax=axes)
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x7f4bf47a94c0>
What we can say:
SHAP abbrewiation from (SHapley Additive exPlanations) values allow to explain the output of any machine learning model. It means it is agnostic method, and do not require any assumptions about the model and data used for its training. It uses a game theory approach that measures each variable contribution to the model outcome for single case. The range of SHAP values for each feature determines its importance. The greater range the greater the significance of the variable. Features with positive SHAP values positively impact (increase the outocome or probability), while those with negative values have a negative impact (decrease outcome or probability).
Explanation is a process of transforming the explanatory data into a new data set that contains an impact of a given feature across all the cases.
$$ X \sim y \rightarrow X' \sim \hat{y} $$import shap
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
model = modelETR
n=23
explainer = shap.explainers.Permutation(model.predict, explanatory) # X masker
shap_values = explainer(explanatory[n:n+1],max_evals=51,error_bounds=True)
There are few types of explainers:
We will explore in brief. permutation and partition explainers.
Waterfall plots display explanations for individual cases. The plot's initial point is the model output's expected value when all variables are set to their mean. Each row displays the positive (red) or negative (blue) contribution of each feature, changing the value from the expected model output over the background dataset to the output for this prediction.
#test few examples
shap.plots.waterfall(shap_values[0])
We only provide shap values. We can compare waterfall with original values of features.
explanatory.iloc[n]
race_white 0.280693 race_black 0.694974 race_hispanic 0.009583 sex_female 0.538261 age_0_17 0.251871 age_over65 0.156761 ind_divorce 0.294962 marriage_never_married 0.414861 lang_second_eng 0.026648 empl_civilian_unemployed 0.078192 ind_agriculture 0.033730 income_median 28136.000000 public_asist 0.327536 gini_index 0.517000 transp_car 0.945817 transp_public 0.002810 pov_under_1 0.343000 insured_under65 0.852488 house_vacant 0.204673 housevalue_median 81000.000000 POP16dens 16.094373 MEDIAN_INCOME08_16 0.074796 POP08_16 -0.049252 edu_high 0.137843 housevalue_IQR 109200.000000 Name: G0100470, dtype: float64
#Very long caculation, don't run...
model = modelETR
model.fit(explanatory,explained)
shap_values = explainer(explanatory,max_evals=51,error_bounds=True)
shapvalues = pd.DataFrame(shap_values.values,columns=explanatory.columns,index=explanatory.index)
Permutation explainer: 3136it [07:17, 7.02it/s]
#... simply load precalculated data
shapvalues = pd.read_pickle("data/shapvalues.p")
Dependence plots (not partial) show the effect of a single feature across the entire dataset. This is a plot a feature’s value vs. the SHAP value of that feature across all samples represented by dots. SHAP dependence plots takes into account for the interaction effects present in the features thus are defined only in regions of the data domain. The vertical dispersion of SHAP values at a single feature value is the result of interaction between variables.
Comments:
cmap
norm
Note:
original SHAP scatterplot has bad defaults and is useless with higher number of features
scatter_kws = dict(cmap="bwr",norm=clt.CenteredNorm(halfrange=0.4),edgecolor="none",s=30)
fig,axes = plt.subplots(ncols=3,figsize=(14,6),sharey=True)
axes[0].scatter(explanatory.edu_high,shapvalues.edu_high,c=shapvalues.edu_high,**scatter_kws)
axes[1].scatter(explanatory.race_white,shapvalues.race_white,c=shapvalues.race_white,**scatter_kws)
axes[2].scatter(explanatory.race_black,shapvalues.race_black,c=shapvalues.race_black,**scatter_kws)
<matplotlib.collections.PathCollection at 0x7f4bd4aca790>
# do not run...
scatter_kws = dict(cmap="bwr",norm=clt.CenteredNorm(halfrange=0.4),edgecolor="none",s=20)
fig,axes = plt.subplots(ncols=5,nrows=5,figsize=(25,25),sharey=True)
for column,ax in zip(explanatory.columns,axes.flatten()):
ax.scatter(explanatory[column],shapvalues[column],c=shapvalues[column],**scatter_kws)
ax.set_title(column)
ax.set_facecolor("gray")
#fig.savefig("dependency_plots.pdf")
... see also vis/dependency_plot.pdf
For quick browsing of variable importnace only build-in summary_plot function can be used.
shap.summary_plot(shap_values, explanatory,cmap="bwr")
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
Dependence plots provide insight into the relationships between variables, but with spatial data, the spatial distribution of influence importance is equally important. In practice, spatial relationships between a variable and its impact and dependence plots should be analyzed.
fig,ax = plt.subplots(figsize=(10,7))
ct = counties.join(shapvalues["race_hispanic"])
ct.plot(column='race_hispanic',ax = ax, cmap="bwr",legend=True,norm=clt.CenteredNorm(halfrange=0.2))
ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
(-1500000.0, 1800000.0)
...see vis/impact.pdf for all impacts.
Up to this point, we have not assumed relationships between features. Game theory assumes that the variables (players) are completely independent, but in practice there are relationships between the variables, causing the importance of some features to be obscured by others. In order to minimize the interaction of traits, we use maskers, which define groups where traits exchange importance among themselves
Comments:
what is masker, max_evals
Shap provides utilities for the fast clustering of variables.
Note:
Each variable is a "case" represented by thousands of "variables". This is one reason why we use a cosine metric.
from sklearn.preprocessing import scale
import scipy.cluster.hierarchy as sch
clustering = shap.utils.hclust(scale(explanatory.values), metric="cosine",linkage="average")
fig, ax = plt.subplots(1,1,figsize=(12,4))
dendr = sch.dendrogram(clustering,labels=explanatory.columns,ax=ax)
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
masker = shap.maskers.Partition(explanatory, clustering=clustering)
explainer_partition = shap.explainers.Partition(model.predict, masker)
explainer_permutation = shap.explainers.Permutation(model.predict, explanatory)
explainer_correlation = shap.explainers.Permutation(model.predict,shap.maskers.Partition(explanatory)) #correlation is default
n=23
shap_values_partition = explainer_partition(explanatory[n:n+1],max_evals=51,error_bounds=True)
shap_values_permutation = explainer_permutation(explanatory[n:n+1],max_evals=51,error_bounds=True)
shap_values_correlation = explainer_correlation(explanatory[n:n+1],max_evals=51,error_bounds=True)
shap.plots.waterfall(shap_values_permutation[0])
shap.plots.waterfall(shap_values_partition[0])
shap.plots.waterfall(shap_values_correlation[0])
X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names X does not have valid feature names, but ExtraTreesRegressor was fitted with feature names
Using partition with heirarchical clustering as a masker reveals the effect of gini_index, while the default partition masker (which is correlation) does not. (more in the classroom)
In order to be able to interpret a large geospatial dataset, it is necessary to reduce its dimensionality, for example, through clustering or PCA. With clustering, we have a group of clusters that can be analyzed in terms of:
We will use Affinity Propagation as clustering method. PCA code is avaialbe in spplementary materials (supplementary_pca.ipynb)
[From Scikit-learn]: AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.
Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.
Affinity Propagation returns exeplars and labels for all samples passed to .fit(). Exemplars are rather "natular leader" than "everyman", like medoid or centroid.
Note:
Code for presentation-level plots described below can be found in suplementary_pca.ipynb
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(preference=-1.5).fit(shapvalues.values)
ap.cluster_centers_indices_
ap.labels_ # cluster labels, starting from 0
clusters = pd.Series(ap.labels_,index=explanatory.index,name="clusters")
clusters = pd.read_pickle("data/clusters.p")
Optimal number of cluters is always a challenge. Here we found that with "pretty" preference -1.5 we got 10 clusters. As we notice later, there are 3 Pro-Clinton clusters, and 7 Pro-Trump. Small increasing of preference results in 11 clusters with 8 Pro-Trump. Small Decreasing results in 9 clusters with 2 Pro-Clinton. It looks like 10 is a local optimum.
fig,ax = plt.subplots(figsize=(10,7))
ct = counties.join(clusters)
ct.plot(column='clusters',ax = ax, cmap="rainbow",legend=True)
ax.set_xlim(-2500000,2300000)
ax.set_ylim(-1500000,1800000)
(-1500000.0, 1800000.0)
Most important note: we see strong spatial pattern
The distribution of the dependent variable in each cluster can be represented using a typical boxplot. To make a boxplot we will use the Seaborn package, before that we need to combine two series in DataFrame.
An analysis of the distributions shows that clusters 4, 5 and 6 show hrbastories where Clinton significantly won, cluster 3 where the balance between the candidates is marked, the remaining examples are counties where Trump won.
explained_clusters = pd.DataFrame({"explained":explained,"clusters":clusters})
fig,ax = plt.subplots(figsize=(10,7))
sns.boxplot(data=explained_clusters,x="clusters",y="explained",palette="rainbow",ax=ax)
<Axes: xlabel='clusters', ylabel='explained'>
To analyze the effect of individual variables on the outcome, we will use a heatmap from the Seaborn package. Unlike the shap library's built-in tool, it allows you to add a column of colors so as to correlate the map and boxplots.
The third element shows how the election result was shaped in each cluster. Leaving aside the detailed interepretation, which is beyond the scope of the course, it is possible to show how the method works in practice:
cluster 4 - Clinotn victory 15 to 50% lead over Trump. Both race_white and race_black show a positive impact in favor of Trump. Clinton's victory is due to the following characteristics: never_maried, hause_value_median, edu_high (big-city Democrats)
Focus 6: Clinotn 20 to 55% victory The main traits that gave victory were white_race and black_race (black farmers)
from matplotlib import cm
colors = cm.rainbow(np.arange(0,1,0.1))
expl = shapvalues.iloc[ap.cluster_centers_indices_,]
g = sns.clustermap(expl,row_cluster=False,col_cluster=False,cmap="bwr",vmin=-.4,vmax=.2,center=0,row_colors=colors,figsize=(15,15))
Note:
Waterfall diagrams from the presentation can be found in the file supplementray_waterfalls.ipynb