BHAD package

Submodules

bhad.explainer module

class bhad.explainer.Explainer(bhad_obj: Type[BHAD], discretize_obj: Type[Discretize], verbose: bool = True)

Bases: object

fit() Explainer
get_explanation(thresholds: List[float] = None, nof_feat_expl: int = 5, append: bool = True) DataFrame

Find most infrequent feature realizations based on the BHAD output. Motivation: the BHAD anomaly score is simply the unweighted average of the log probabilities per feature level/bin (categ. + discretized numerical). Therefore the levels which lead an observation to being outlierish are those with (relatively) infrequent counts.

Parameters

nof_feat_expl: max. number of features to report (e.g. the 5 most infrequent features per obs.) thresholds: list of threshold values per feature between [0,1], referring to rel. freq.

Returns:

df_original: original dataset + additional column with explanations

bhad.model module

class bhad.model.BHAD(contamination: float = 0.01, alpha: float = 0.5, exclude_col: List[str] = [], num_features: List[str] = [], cat_features: List[str] = [], append_score: bool = False, verbose: bool = True, nbins: int | None = None, discretize: bool = True, lower: float | None = None, k: int = 1, round_intervals: int = 5, eps: float = 0.001, make_labels: bool = False, prior_gamma: float = 0.9, prior_max_M: int | None = None)

Bases: BaseEstimator, OutlierMixin

decision_function(X: DataFrame) array

Outlier score centered around the threshold value. Outliers are scored negatively (<= 0) and inliers are scored positively (> 0).

Parameters

Xpandas.DataFrame, shape (n_samples, n_features)

The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.

Returns

scoresnumpy.array, shape (n_samples,)

The outlier score of the input samples centered arount threshold value.

fit(X: DataFrame, y: array | Series = None) BHAD

Apply the BHAD and calculate the outlier threshold value.

Parameters

Xpandas.DataFrame, shape (n_samples, n_features)

The input samples. X values should be of type str, or castable to str (e.g. catagorical). If discretize=True (default), continuous features will be automatically discretized.

yIgnored

Not used, present for API consistency by convention.

Returns

self : BHAD object

predict(X: DataFrame) array

Returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters

Xpandas.DataFrame, shape (n_samples, n_features)

The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.

Returns

scoresarray, shape (n_samples,)

The outlier labels of the input samples. -1 means an outlier, 1 means an inlier.

score_samples(X: DataFrame) DataFrame

Outlier score calculated by summing the counts of each feature level in the dataset.

Parameters

Xpandas.DataFrame, shape (n_samples, n_features)

The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.

Returns

scoresnumpy.array, shape (n_samples,)

The outlier score of the input samples centered arount threshold value.

bhad.utils module

class bhad.utils.Discretize(columns: List[str] = [], nbins: int = None, lower: float = None, k: int = 1, round_intervals: int = 5, eps: float = 0.001, make_labels: bool = False, verbose: bool = True, prior_gamma: float = 0.9, prior_max_M: int = None, **kwargs)

Bases: BaseEstimator, TransformerMixin

Discretize continous features by binning. Compute posterior of number of bins and MAP estimate. Will be used as input for Bayesian histogram anomaly detector (BHAD)

Input:

columns: list of feature names nbins: number of bins to discretize numeric features into, if None MAP estimates will be computed for each feature lower: optional lower value for the first bin, very often 0, e.g. amounts k: number of standard deviations to be used for the intervals, see k*np.std(v) round_intervals: number of digits to round the intervals eps: minimum value of variance of a numeric features (check for ‘zero-variance features’) make_labels: assign integer labels to bins instead of technical intervals

fit(X: DataFrame) Discretize
log_post_pmf_nof_bins(feature_values: array) Dict[int, float]

Evaluate log posterior prob. measure of number of bins (over grid of supported values) see paper section ‘posteriors’.

Args:

feature_values (np.array): univariate variable values

Returns:

Dict[int, float]: grid of number of bin values with log-pmf values

transform(X: DataFrame) DataFrame
bhad.utils.accratio(x: array) array

Gaussian proposal density x : grid of values on the support of x, e.g.: np.linspace(1e-8,1-1e-8,100)

bhad.utils.bart_simpson_density(x: array, m: int = 4) array

Calculate density of Bart Simpson distr. aka The Claw (see Larry Wasserman, All of nonparametric statistics, section 6)

Args:

x (np.array): discrete grid over support of the distribution m (int, optional): number of mixture components. Defaults to 4.

Returns:

np.array: density values

bhad.utils.exp_normalize(x: array) array

Exp-normalize trick https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/

Args:

x (np.array): sample data points

Returns:

np.array: Normalized input array

bhad.utils.freedman_diaconis(data: array, return_width: bool = False) int

Use Freedman Diaconis rule to compute optimal histogram bin width or number of bins.

Parameters

data: np.ndarray

One-dimensional array.

return_width: Boolean

bhad.utils.geometric_prior(M: int, gamma: float = 0.7, max_M: int = 100, log: bool = False) float

Geometric (power series) prior

Args:

M (int): number of bins gamma (float, optional): prior hyperparameter. Defaults to 0.7. max_M (int, optional): max value of M. Defaults to 100. log (bool, optional): use log prior. Defaults to True.

Returns:

float: (log) prior density

bhad.utils.jitter(M: int, noise_scale: float = 100000.0, seed: int = None) array

Generates jitter that can be added to any float, e.g. helps when used with pd.qcut to produce unique class edges M: number of random draws, i.e. size

bhad.utils.log_marglike_nbins(M: int, y: array) float

Log-posterior of number of bins M using conjugate Jeffreys’ prior for the bin probabilities and a flat improper prior for the number of bins. This is therefore equivalent to the marginal log-likelihood of the number of bins.

Args:

M (int): number of bins parameter y (np.array): univariate sample data points

Returns:

float: log posterior probability value

class bhad.utils.mvt2mixture(thetas: dict = {'Sigma1': None, 'Sigma2': None, 'mean1': None, 'mean2': None, 'nu1': None, 'nu2': None}, seed: int = None, gaussian: bool = False, **figure_param)

Bases: object

draw(n_samples: int = 100, k: int = 2, p: float = 0.5) Tuple

Random number generator: Input: ——- n_samples: Number of realizations to generate k: Number of features (Dimension of the t-distr.) p: Success probability Bernoulli(p) p.m.f.

show2D(save_plot: bool = False, legend_on: bool = True, **kwargs)

Make scatter plot for first two dimensions of the random draws

show3D(save_plot: bool = False, legend_on: bool = True, **kwargs)

Make scatter plot for first three dimensions of the random draws

class bhad.utils.onehot_encoder(exclude_columns: List[str] = [], prefix_sep: str = '_', oos_token: str = 'OTHERS', verbose: bool = True, **kwargs)

Bases: TransformerMixin, BaseEstimator

fit(X: DataFrame) onehot_encoder
get_feature_names_out(input_features: list = None) array

Get feature names as used in one-hot encoder, i.e. after binning/disretizing

Returns:

[numpy array]: feature names as used in discretizer, e.g. intervals

transform(X: DataFrame) csr_matrix

Map X values to respective bins and encode as one-hot

Args:

X (pd.DataFrame): Discretized/Binned input dataframe

Returns:

csr_matrix: Dummy/One-hot matrix

bhad.utils.paste(*lists: List[Any], sep: str = ' ', collapse: str | None = None) List[str] | str

Concatenates elements from multiple lists element-wise into strings, similar to R’s paste function.

Args:

lists: One or more lists of elements to be concatenated element-wise. sep (str, optional): Separator to use between elements from each list. Defaults to “ “. collapse (str, optional): If provided, concatenates all resulting strings into a single string using this separator. Defaults to None.

Returns:

List[str] or str: A list of concatenated strings if collapse is None, otherwise a single concatenated string.

Examples:
>>> paste(['a', 'b'], [1, 2])
['a 1', 'b 2']
>>> paste(['a', 'b'], [1, 2], sep='-')
['a-1', 'b-2']
>>> paste(['a', 'b'], [1, 2], collapse=', ')
'a 1, b 2'
bhad.utils.rbartsim(MCsim: int = 10000, seed: int = None, verbose: bool = True) array

Sample from Bart Simpson density via Accept-Reject algorithm with a normal proposal distribution.

Parameters:

MCsim (int): Number of Monte Carlo simulations (random draws) to perform. Default is 10**4. seed (int, optional): Random seed for reproducibility. Default is None. verbose (bool): If True, prints the acceptance rate. Default is True.

Returns:

np.array: Array of accepted random draws from the proposal distribution.

Notes:
  • The function maximizes the acceptance ratio function accratio to determine the scaling constant.

  • Uses a normal distribution (mean=0, std=1) as the proposal distribution.

  • The acceptance ratio function accratio must be defined elsewhere in the code.

bhad.utils.timer(func)

Print the runtime of the decorated function

Module contents