Titanic data set example

Note: The focus of this example is less on finding anomalies but rather to illustrate model explanability in the case of categorical and continuous features.

[1]:

import numpy as np
from sklearn.datasets import fetch_openml
from bhad.model import BHAD

[2]:

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

X.head(2)

[2]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON

[3]:

X_cleaned = X.drop(['body', 'cabin', 'name', 'ticket', 'boat'], axis=1).dropna()  # not needed
y_cleaned = y[X_cleaned.index]

X_cleaned.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 684 entries, 0 to 1281
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     684 non-null    int64
 1   sex        684 non-null    category
 2   age        684 non-null    float64
 3   sibsp      684 non-null    int64
 4   parch      684 non-null    int64
 5   fare       684 non-null    float64
 6   embarked   684 non-null    category
 7   home.dest  684 non-null    object
dtypes: category(2), float64(2), int64(3), object(1)
memory usage: 39.0+ KB

Partition dataset:

[4]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)

print(np.unique(y_train, return_counts=True))
print(np.unique(y_test, return_counts=True))

(458, 8)
(226, 8)
(array(['0', '1'], dtype=object), array([242, 216]))
(array(['0', '1'], dtype=object), array([122, 104]))

Train model and create local/global model explanation:

Retrieve local model explanations. Here: Specify all numeric and categorical columns explicitly

[5]:

num_cols = list(X_train.select_dtypes(include=['float', 'int']).columns)
cat_cols = list(X_train.select_dtypes(include=['object', 'category']).columns)

Score your train set:

[6]:

model = BHAD(
    contamination=0.01,
    num_features=num_cols,
    cat_features=cat_cols,
    nbins=None,
    verbose=False
)

y_pred_train_new = model.fit_predict(X_train)
scores_train_new = model.decision_function(X_train)

print("Training predictions:", np.unique(y_pred_train_new, return_counts=True))

Training predictions: (array([-1,  1]), array([  5, 453]))

[7]:

from bhad import explainer

local_expl = explainer.Explainer(bhad_obj=model, discretize_obj=model._discretizer).fit()

--- BHAD Model Explainer ---

Using fitted BHAD and discretizer.
Marginal distributions estimated using train set of shape (458, 8)

[8]:

df_train = local_expl.get_explanation(nof_feat_expl = 5)

Create local explanations for 458 observations.

[9]:

global_feat_imp = local_expl.global_feat_imp         # based on X_train
global_feat_imp

[9]:

	avg ranks
embarked	0.152058
sex	0.257304
parch	0.279548
sibsp	0.444223
age	0.491700
fare	0.546813
pclass	0.634462
home.dest	1.000000

Get global model explanation (in decreasing order):

[10]:

from matplotlib import pyplot as plt

plt.barh(global_feat_imp.index, global_feat_imp.values.flatten())
plt.xlabel("Feature importances");

../_images/notebooks_Titanic_Example_14_0.png

Get local explanations, i.e. feature importances (in decreasing order):

[11]:

for obs, ex in enumerate(df_train.explanation.values):
    if (obs % 100) == 0:
        print(f'\nObs. {obs}:\n', ex)


Obs. 0:
 parch (Cumul.perc.: 0.996): 5.0
home.dest (Perc.: 0.011): Sweden Winnipeg, MN
sex (Perc.: 0.4): female

Obs. 100:
 home.dest (Perc.: 0.002): Tofta, Sweden Joliet, IL
fare (Cumul.perc.: 0.07): 7.78

Obs. 200:
 home.dest (Perc.: 0.013): Brooklyn, NY

Obs. 300:
 home.dest (Perc.: 0.007): Bournmouth, England
age (Cumul.perc.: 0.05): 5.0
sex (Perc.: 0.4): female

Obs. 400:
 home.dest (Perc.: 0.002): Taalintehdas, Finland Hoboken, NJ

[12]:

y_pred_test = model.predict(X_test)

[13]:

df_test = local_expl.get_explanation(nof_feat_expl = 4)
df_test.head(2)

Create local explanations for 226 observations.

[13]:

	pclass	sex	age	sibsp	parch	fare	embarked	home.dest	explanation
0	2.0	male	36.0	1.0	2.0	27.7500	S	Bournmouth, England	home.dest (Perc.: 0.007): Bournmouth, England
1	1.0	male	49.0	1.0	1.0	110.8833	C	Haverford, PA	home.dest (Perc.: 0.007): Haverford, PA\nfare ...

[14]:

for obs, ex in enumerate(df_test.explanation.values):
    if (obs % 50) == 0:
        print(f'\nObs. {obs}:\n', ex)


Obs. 0:
 home.dest (Perc.: 0.007): Bournmouth, England

Obs. 50:
 home.dest (Perc.: 0.002): Deephaven, MN / Cedar Rapids, IA
fare (Cumul.perc.: 0.91): 106.42

Obs. 100:
 home.dest (Perc.: 0.002): Hudson, NY
sex (Perc.: 0.4): female

Obs. 150:
 home.dest (Perc.: 0.0): ?Havana, Cuba

Obs. 200:
 embarked (Perc.: 0.048): Q
home.dest (Perc.: 0.0): Co Sligo, Ireland Hartford, CT
sex (Perc.: 0.4): female
fare (Cumul.perc.: 0.061): 7.75

[15]:

local_expl.global_feat_imp   # based on X_test

[15]:

	avg ranks
embarked	0.157711
parch	0.245639
sex	0.256804
sibsp	0.441731
age	0.480112
fare	0.575715
pclass	0.638521
home.dest	1.000000