BHAD package
Submodules
bhad.explainer module
- class bhad.explainer.Explainer(bhad_obj: Type[BHAD], discretize_obj: Type[Discretize], verbose: bool = True)
Bases:
object- get_explanation(thresholds: List[float] = None, nof_feat_expl: int = 5, append: bool = True) DataFrame
Find most infrequent feature realizations based on the BHAD output. Motivation: the BHAD anomaly score is simply the unweighted average of the log probabilities per feature level/bin (categ. + discretized numerical). Therefore the levels which lead an observation to being outlierish are those with (relatively) infrequent counts.
Parameters
nof_feat_expl: max. number of features to report (e.g. the 5 most infrequent features per obs.) thresholds: list of threshold values per feature between [0,1], referring to rel. freq.
Returns:
df_original: original dataset + additional column with explanations
bhad.model module
- class bhad.model.BHAD(contamination: float = 0.01, alpha: float = 0.5, exclude_col: List[str] = [], num_features: List[str] = [], cat_features: List[str] = [], append_score: bool = False, verbose: bool = True, nbins: int | None = None, discretize: bool = True, lower: float | None = None, k: int = 1, round_intervals: int = 5, eps: float = 0.001, make_labels: bool = False, prior_gamma: float = 0.9, prior_max_M: int | None = None)
Bases:
BaseEstimator,OutlierMixin- decision_function(X: DataFrame) array
Outlier score centered around the threshold value. Outliers are scored negatively (<= 0) and inliers are scored positively (> 0).
Parameters
- Xpandas.DataFrame, shape (n_samples, n_features)
The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.
Returns
- scoresnumpy.array, shape (n_samples,)
The outlier score of the input samples centered arount threshold value.
- fit(X: DataFrame, y: array | Series = None) BHAD
Apply the BHAD and calculate the outlier threshold value.
Parameters
- Xpandas.DataFrame, shape (n_samples, n_features)
The input samples. X values should be of type str, or castable to str (e.g. catagorical). If discretize=True (default), continuous features will be automatically discretized.
- yIgnored
Not used, present for API consistency by convention.
Returns
self : BHAD object
- predict(X: DataFrame) array
Returns labels for X.
Returns -1 for outliers and 1 for inliers.
Parameters
- Xpandas.DataFrame, shape (n_samples, n_features)
The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.
Returns
- scoresarray, shape (n_samples,)
The outlier labels of the input samples. -1 means an outlier, 1 means an inlier.
- score_samples(X: DataFrame) DataFrame
Outlier score calculated by summing the counts of each feature level in the dataset.
Parameters
- Xpandas.DataFrame, shape (n_samples, n_features)
The input samples. X values should be of type str, or easily castable to str (e.g. categorical). If discretize=True, continuous features will be automatically discretized using the fitted discretizer.
Returns
- scoresnumpy.array, shape (n_samples,)
The outlier score of the input samples centered arount threshold value.
bhad.utils module
- class bhad.utils.Discretize(columns: List[str] = [], nbins: int = None, lower: float = None, k: int = 1, round_intervals: int = 5, eps: float = 0.001, make_labels: bool = False, verbose: bool = True, prior_gamma: float = 0.9, prior_max_M: int = None, **kwargs)
Bases:
BaseEstimator,TransformerMixinDiscretize continous features by binning. Compute posterior of number of bins and MAP estimate. Will be used as input for Bayesian histogram anomaly detector (BHAD)
Input:
columns: list of feature names nbins: number of bins to discretize numeric features into, if None MAP estimates will be computed for each feature lower: optional lower value for the first bin, very often 0, e.g. amounts k: number of standard deviations to be used for the intervals, see k*np.std(v) round_intervals: number of digits to round the intervals eps: minimum value of variance of a numeric features (check for ‘zero-variance features’) make_labels: assign integer labels to bins instead of technical intervals
- fit(X: DataFrame) Discretize
- log_post_pmf_nof_bins(feature_values: array) Dict[int, float]
Evaluate log posterior prob. measure of number of bins (over grid of supported values) see paper section ‘posteriors’.
- Args:
feature_values (np.array): univariate variable values
- Returns:
Dict[int, float]: grid of number of bin values with log-pmf values
- transform(X: DataFrame) DataFrame
- bhad.utils.accratio(x: array) array
Gaussian proposal density x : grid of values on the support of x, e.g.: np.linspace(1e-8,1-1e-8,100)
- bhad.utils.bart_simpson_density(x: array, m: int = 4) array
Calculate density of Bart Simpson distr. aka The Claw (see Larry Wasserman, All of nonparametric statistics, section 6)
- Args:
x (np.array): discrete grid over support of the distribution m (int, optional): number of mixture components. Defaults to 4.
- Returns:
np.array: density values
- bhad.utils.exp_normalize(x: array) array
Exp-normalize trick https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
- Args:
x (np.array): sample data points
- Returns:
np.array: Normalized input array
- bhad.utils.freedman_diaconis(data: array, return_width: bool = False) int
Use Freedman Diaconis rule to compute optimal histogram bin width or number of bins.
Parameters
- data: np.ndarray
One-dimensional array.
return_width: Boolean
- bhad.utils.geometric_prior(M: int, gamma: float = 0.7, max_M: int = 100, log: bool = False) float
Geometric (power series) prior
- Args:
M (int): number of bins gamma (float, optional): prior hyperparameter. Defaults to 0.7. max_M (int, optional): max value of M. Defaults to 100. log (bool, optional): use log prior. Defaults to True.
- Returns:
float: (log) prior density
- bhad.utils.jitter(M: int, noise_scale: float = 100000.0, seed: int = None) array
Generates jitter that can be added to any float, e.g. helps when used with pd.qcut to produce unique class edges M: number of random draws, i.e. size
- bhad.utils.log_marglike_nbins(M: int, y: array) float
Log-posterior of number of bins M using conjugate Jeffreys’ prior for the bin probabilities and a flat improper prior for the number of bins. This is therefore equivalent to the marginal log-likelihood of the number of bins.
- Args:
M (int): number of bins parameter y (np.array): univariate sample data points
- Returns:
float: log posterior probability value
- class bhad.utils.mvt2mixture(thetas: dict = {'Sigma1': None, 'Sigma2': None, 'mean1': None, 'mean2': None, 'nu1': None, 'nu2': None}, seed: int = None, gaussian: bool = False, **figure_param)
Bases:
object- draw(n_samples: int = 100, k: int = 2, p: float = 0.5) Tuple
Random number generator: Input: ——- n_samples: Number of realizations to generate k: Number of features (Dimension of the t-distr.) p: Success probability Bernoulli(p) p.m.f.
- show2D(save_plot: bool = False, legend_on: bool = True, **kwargs)
Make scatter plot for first two dimensions of the random draws
- show3D(save_plot: bool = False, legend_on: bool = True, **kwargs)
Make scatter plot for first three dimensions of the random draws
- class bhad.utils.onehot_encoder(exclude_columns: List[str] = [], prefix_sep: str = '_', oos_token: str = 'OTHERS', verbose: bool = True, **kwargs)
Bases:
TransformerMixin,BaseEstimator- fit(X: DataFrame) onehot_encoder
- get_feature_names_out(input_features: list = None) array
Get feature names as used in one-hot encoder, i.e. after binning/disretizing
- Returns:
[numpy array]: feature names as used in discretizer, e.g. intervals
- transform(X: DataFrame) csr_matrix
Map X values to respective bins and encode as one-hot
- Args:
X (pd.DataFrame): Discretized/Binned input dataframe
- Returns:
csr_matrix: Dummy/One-hot matrix
- bhad.utils.paste(*lists: List[Any], sep: str = ' ', collapse: str | None = None) List[str] | str
Concatenates elements from multiple lists element-wise into strings, similar to R’s paste function.
- Args:
lists: One or more lists of elements to be concatenated element-wise. sep (str, optional): Separator to use between elements from each list. Defaults to “ “. collapse (str, optional): If provided, concatenates all resulting strings into a single string using this separator. Defaults to None.
- Returns:
List[str] or str: A list of concatenated strings if collapse is None, otherwise a single concatenated string.
- Examples:
>>> paste(['a', 'b'], [1, 2]) ['a 1', 'b 2'] >>> paste(['a', 'b'], [1, 2], sep='-') ['a-1', 'b-2'] >>> paste(['a', 'b'], [1, 2], collapse=', ') 'a 1, b 2'
- bhad.utils.rbartsim(MCsim: int = 10000, seed: int = None, verbose: bool = True) array
Sample from Bart Simpson density via Accept-Reject algorithm with a normal proposal distribution.
- Parameters:
MCsim (int): Number of Monte Carlo simulations (random draws) to perform. Default is 10**4. seed (int, optional): Random seed for reproducibility. Default is None. verbose (bool): If True, prints the acceptance rate. Default is True.
- Returns:
np.array: Array of accepted random draws from the proposal distribution.
- Notes:
The function maximizes the acceptance ratio function accratio to determine the scaling constant.
Uses a normal distribution (mean=0, std=1) as the proposal distribution.
The acceptance ratio function accratio must be defined elsewhere in the code.
- bhad.utils.timer(func)
Print the runtime of the decorated function