Skip to content

BFDA Utilities

Bayes Factor Design Analysis for sample-size planning — simulation engine, power curves, and plotting utilities.

utils

simulate_nonpaired_scores(N=200, theta_A=0.75, theta_B=0.6, seed=0, rng=None)

Simulate independent binary outcomes for a non-paired A/B test.

Each group is sampled independently from a Bernoulli distribution with the specified success probability.

Parameters:

Name Type Description Default
N int

Number of observations per group.

200
theta_A float

True success probability for model A.

0.75
theta_B float

True success probability for model B.

0.6
seed int

Random seed for reproducibility.

0
rng Generator | None

Optional pre-seeded RNG; if provided, seed is ignored.

None

Returns:

Type Description
NonPairedSimResult

class:NonPairedSimResult with fields y_A, y_B,

NonPairedSimResult

theta_A, theta_B, and true_params.

Source code in bayesprop/utils/utils.py
def simulate_nonpaired_scores(
    N: int = 200,
    theta_A: float = 0.75,
    theta_B: float = 0.60,
    seed: int = 0,
    rng: np.random.Generator | None = None,
) -> NonPairedSimResult:
    """Simulate independent binary outcomes for a non-paired A/B test.

    Each group is sampled independently from a Bernoulli distribution
    with the specified success probability.

    Args:
        N: Number of observations per group.
        theta_A: True success probability for model A.
        theta_B: True success probability for model B.
        seed: Random seed for reproducibility.
        rng: Optional pre-seeded RNG; if provided, *seed* is ignored.

    Returns:
        :class:`NonPairedSimResult` with fields ``y_A``, ``y_B``,
        ``theta_A``, ``theta_B``, and ``true_params``.
    """
    if rng is None:
        rng = np.random.default_rng(seed)
    y_A = rng.binomial(1, theta_A, size=N).astype(float)
    y_B = rng.binomial(1, theta_B, size=N).astype(float)
    return NonPairedSimResult(
        y_A=y_A,
        y_B=y_B,
        theta_A=theta_A,
        theta_B=theta_B,
        true_params=NonPairedTrueParams(N=N, theta_A=theta_A, theta_B=theta_B),
    )

simulate_paired_scores(N=200, mu=0.0, delta_A=0.5, delta_B=0.0, sigma_theta=0.0, seed=0, rng=None)

Simulate paired binary outcomes from a logistic DGP.

Matches the paired model: y_A ~ Bern(σ(μ + δ_A)), y_B ~ Bern(σ(μ)).

When sigma_theta > 0 each item i additionally receives a random effect ε_i ~ N(0, sigma_theta) so that θ_i = μ + ε_i (useful for more realistic BFDA simulations).

Parameters:

Name Type Description Default
N int

Number of paired observations.

200
mu float

Shared logit-scale intercept.

0.0
delta_A float

Logit-scale treatment effect for model A.

0.5
delta_B float

Logit-scale offset for model B (0 by default).

0.0
sigma_theta float

SD of optional per-item random effects (0 = fixed effects matching the model).

0.0
seed int

Random seed for reproducibility.

0
rng Generator | None

Optional pre-seeded RNG; if provided, seed is ignored.

None

Returns:

Type Description
PairedSimResult

class:PairedSimResult with fields y_A, y_B,

PairedSimResult

p_A_true, p_B_true, theta_true, and true_params.

Source code in bayesprop/utils/utils.py
def simulate_paired_scores(
    N: int = 200,
    mu: float = 0.0,
    delta_A: float = 0.5,
    delta_B: float = 0.0,
    sigma_theta: float = 0.0,
    seed: int = 0,
    rng: np.random.Generator | None = None,
) -> PairedSimResult:
    """Simulate paired binary outcomes from a logistic DGP.

    Matches the paired model: ``y_A ~ Bern(σ(μ + δ_A))``,
    ``y_B ~ Bern(σ(μ))``.

    When ``sigma_theta > 0`` each item *i* additionally receives a
    random effect ``ε_i ~ N(0, sigma_theta)`` so that
    ``θ_i = μ + ε_i`` (useful for more realistic BFDA simulations).

    Args:
        N: Number of paired observations.
        mu: Shared logit-scale intercept.
        delta_A: Logit-scale treatment effect for model A.
        delta_B: Logit-scale offset for model B (0 by default).
        sigma_theta: SD of optional per-item random effects
            (``0`` = fixed effects matching the model).
        seed: Random seed for reproducibility.
        rng: Optional pre-seeded RNG; if provided, *seed* is ignored.

    Returns:
        :class:`PairedSimResult` with fields ``y_A``, ``y_B``,
        ``p_A_true``, ``p_B_true``, ``theta_true``, and ``true_params``.
    """
    if rng is None:
        rng = np.random.default_rng(seed)
    theta_true = mu + rng.normal(0.0, sigma_theta, size=N)
    p_A = _sigmoid(theta_true + delta_A)
    p_B = _sigmoid(theta_true + delta_B)
    y_A = rng.binomial(1, p_A)
    y_B = rng.binomial(1, p_B)
    return PairedSimResult(
        y_A=y_A,
        y_B=y_B,
        p_A_true=p_A,
        p_B_true=p_B,
        theta_true=theta_true,
        true_params=PairedTrueParams(N=N, mu=mu, sigma_theta=sigma_theta, delta_A=delta_A, delta_B=delta_B),
    )

bfda_simulate(data_generator, decision_fn, sample_sizes, n_sim=500, seed=42)

Generic BFDA engine -- works with any data-generating process and decision rule.

Parameters:

Name Type Description Default
data_generator Callable[[Generator, int], tuple[ndarray, ndarray]]

Callable(rng, n) -> (y_A, y_B). Generates one simulated dataset of size n per group using the provided RNG.

required
decision_fn Callable[[ndarray, ndarray], bool]

Callable(y_A, y_B) -> bool. Returns True when the simulated dataset produces a "decisive" result (i.e. rejects H0).

required
sample_sizes list[int]

List of per-group sample sizes to evaluate.

required
n_sim int

Number of simulated datasets per sample size.

500
seed int

Random seed for reproducibility.

42

Returns:

Type Description
dict[int, float]

Dictionary mapping sample size -> P(decisive outcome).

Source code in bayesprop/utils/utils.py
def bfda_simulate(
    data_generator: Callable[[np.random.Generator, int], tuple[np.ndarray, np.ndarray]],
    decision_fn: Callable[[np.ndarray, np.ndarray], bool],
    sample_sizes: list[int],
    n_sim: int = 500,
    seed: int = 42,
) -> dict[int, float]:
    """Generic BFDA engine -- works with any data-generating process and decision rule.

    Args:
        data_generator: Callable(rng, n) -> (y_A, y_B). Generates one simulated
            dataset of size *n* per group using the provided RNG.
        decision_fn: Callable(y_A, y_B) -> bool. Returns ``True`` when the
            simulated dataset produces a "decisive" result (i.e. rejects H0).
        sample_sizes: List of per-group sample sizes to evaluate.
        n_sim: Number of simulated datasets per sample size.
        seed: Random seed for reproducibility.

    Returns:
        Dictionary mapping sample size -> P(decisive outcome).
    """
    rng = np.random.default_rng(seed)

    power: dict[int, float] = {}
    for n in sample_sizes:
        decisive_count = sum(decision_fn(*data_generator(rng, n)) for _ in range(n_sim))
        power[n] = decisive_count / n_sim

    return power

bf10_to_ph0(bf_10, prior_H0=0.5)

Convert BF_10 to posterior probability of H0.

Parameters:

Name Type Description Default
bf_10 float

Bayes factor in favour of H1.

required
prior_H0 float

Prior probability of H0.

0.5

Returns:

Type Description
float

P(H0 | data).

Source code in bayesprop/utils/utils.py
def bf10_to_ph0(bf_10: float, prior_H0: float = 0.5) -> float:
    """Convert BF_10 to posterior probability of H0.

    Args:
        bf_10: Bayes factor in favour of H1.
        prior_H0: Prior probability of H0.

    Returns:
        P(H0 | data).
    """
    bf_01 = 1.0 / max(bf_10, 1e-300)
    return (bf_01 * prior_H0) / (bf_01 * prior_H0 + (1 - prior_H0))

bfda_power_curve(theta_A_true, theta_B_true, sample_sizes, design='nonpaired', decision_rule='bayes_factor', *, bf_threshold=3.0, ph0_threshold=0.05, prior_H0=0.5, rope=(-0.02, 0.02), ci_mass=0.95, n_sim=500, seed=42, alpha0=1.0, beta0=1.0, sigma_theta=2.0, prior_sigma_delta=1.0, prior_sigma_mu=2.0, n_iter=1000, burn_in=300, n_chains=2)

Unified Bayes Factor Design Analysis for any design × decision-rule.

Simulates datasets under a known effect and estimates the probability that a given Bayesian decision rule will reject H0 as a function of sample size (i.e. Bayesian "power").

Supported combinations:

+----------------+----------------+--------------------+--------+ | design | decision_rule | key threshold | fast? | +================+================+====================+========+ | nonpaired | bayes_factor | bf_threshold | yes | | nonpaired | posterior_null | ph0_threshold | yes | | nonpaired | rope | rope | medium | | paired | bayes_factor | bf_threshold | slow | | paired | posterior_null | ph0_threshold | slow | | paired | rope | rope | slow | +----------------+----------------+--------------------+--------+

Parameters:

Name Type Description Default
theta_A_true float

Assumed true success rate for model A.

required
theta_B_true float

Assumed true success rate for model B.

required
sample_sizes list[int]

List of per-group sample sizes to evaluate.

required
design str

"nonpaired" or "paired".

'nonpaired'
decision_rule DecisionRuleType

"bayes_factor", "posterior_null", or "rope".

'bayes_factor'
bf_threshold float

BF_10 threshold for decisive evidence (bayes_factor).

3.0
ph0_threshold float

Reject H0 when P(H0|data) < this (posterior_null).

0.05
prior_H0 float

Prior probability of H0 (posterior_null).

0.5
rope tuple[float, float]

(lower, upper) bounds of the ROPE (rope).

(-0.02, 0.02)
ci_mass float

Credible interval mass for ROPE analysis (rope).

0.95
n_sim int

Number of simulated datasets per sample size.

500
seed int

Random seed for reproducibility.

42
alpha0 float

Prior Beta alpha parameter (non-paired only).

1.0
beta0 float

Prior Beta beta parameter (non-paired only).

1.0
sigma_theta float

SD of the shared latent item effect (paired DGP).

2.0
prior_sigma_delta float

SD of N(0, σ) prior on delta_A (paired only).

1.0
prior_sigma_mu float

SD of N(0, σ) prior on mu (paired only).

2.0
n_iter int

Total Gibbs iterations per chain (paired only).

1000
burn_in int

Warm-up iterations per chain (paired only).

300
n_chains int

Number of MCMC chains per dataset (paired only).

2

Returns:

Type Description
dict[int, float]

Dictionary mapping sample size -> P(decisive outcome).

Source code in bayesprop/utils/utils.py
def bfda_power_curve(
    theta_A_true: float,
    theta_B_true: float,
    sample_sizes: list[int],
    design: str = "nonpaired",
    decision_rule: DecisionRuleType = "bayes_factor",
    *,
    # BF thresholds
    bf_threshold: float = 3.0,
    # P(H0) thresholds
    ph0_threshold: float = 0.05,
    prior_H0: float = 0.5,
    # ROPE thresholds
    rope: tuple[float, float] = (-0.02, 0.02),
    ci_mass: float = 0.95,
    # Simulation
    n_sim: int = 500,
    seed: int = 42,
    # Non-paired model priors
    alpha0: float = 1.0,
    beta0: float = 1.0,
    # Paired DGP
    sigma_theta: float = 2.0,
    # Paired model priors / MCMC
    prior_sigma_delta: float = 1.0,
    prior_sigma_mu: float = 2.0,
    n_iter: int = 1000,
    burn_in: int = 300,
    n_chains: int = 2,
) -> dict[int, float]:
    """Unified Bayes Factor Design Analysis for any design × decision-rule.

    Simulates datasets under a known effect and estimates the probability
    that a given Bayesian decision rule will reject H0 as a function of
    sample size (i.e. Bayesian "power").

    Supported combinations:

    +----------------+----------------+--------------------+--------+
    | design         | decision_rule  | key threshold      | fast?  |
    +================+================+====================+========+
    | ``nonpaired``  | bayes_factor   | ``bf_threshold``   | yes    |
    | ``nonpaired``  | posterior_null | ``ph0_threshold``  | yes    |
    | ``nonpaired``  | rope           | ``rope``           | medium |
    | ``paired``     | bayes_factor   | ``bf_threshold``   | slow   |
    | ``paired``     | posterior_null | ``ph0_threshold``  | slow   |
    | ``paired``     | rope           | ``rope``           | slow   |
    +----------------+----------------+--------------------+--------+

    Args:
        theta_A_true: Assumed true success rate for model A.
        theta_B_true: Assumed true success rate for model B.
        sample_sizes: List of per-group sample sizes to evaluate.
        design: ``"nonpaired"`` or ``"paired"``.
        decision_rule: ``"bayes_factor"``, ``"posterior_null"``, or ``"rope"``.
        bf_threshold: BF_10 threshold for decisive evidence (``bayes_factor``).
        ph0_threshold: Reject H0 when P(H0|data) < this (``posterior_null``).
        prior_H0: Prior probability of H0 (``posterior_null``).
        rope: (lower, upper) bounds of the ROPE (``rope``).
        ci_mass: Credible interval mass for ROPE analysis (``rope``).
        n_sim: Number of simulated datasets per sample size.
        seed: Random seed for reproducibility.
        alpha0: Prior Beta alpha parameter (non-paired only).
        beta0: Prior Beta beta parameter (non-paired only).
        sigma_theta: SD of the shared latent item effect (paired DGP).
        prior_sigma_delta: SD of N(0, σ) prior on delta_A (paired only).
        prior_sigma_mu: SD of N(0, σ) prior on mu (paired only).
        n_iter: Total Gibbs iterations per chain (paired only).
        burn_in: Warm-up iterations per chain (paired only).
        n_chains: Number of MCMC chains per dataset (paired only).

    Returns:
        Dictionary mapping sample size -> P(decisive outcome).
    """
    # Build data generator
    if design == "nonpaired":
        data_gen = _make_nonpaired_generator(theta_A_true, theta_B_true)
    elif design == "paired":
        data_gen = _make_paired_generator(theta_A_true, theta_B_true, sigma_theta=sigma_theta)
    else:
        raise ValueError(f"Unknown design: {design!r}. Use 'nonpaired' or 'paired'.")

    # Build decision function
    decide = _make_decision_fn(
        design=design,
        decision_rule=decision_rule,
        bf_threshold=bf_threshold,
        ph0_threshold=ph0_threshold,
        prior_H0=prior_H0,
        rope=rope,
        ci_mass=ci_mass,
        alpha0=alpha0,
        beta0=beta0,
        prior_sigma_delta=prior_sigma_delta,
        prior_sigma_mu=prior_sigma_mu,
        n_iter=n_iter,
        burn_in=burn_in,
        n_chains=n_chains,
        seed=seed,
    )

    return bfda_simulate(
        data_generator=data_gen,
        decision_fn=decide,
        sample_sizes=sample_sizes,
        n_sim=n_sim,
        seed=seed,
    )

find_n_for_power(power_curve, target_power=0.8)

Interpolate the sample size needed to achieve a target power level.

Parameters:

Name Type Description Default
power_curve dict[int, float]

Dictionary mapping sample size -> power (from BFDA).

required
target_power float

Desired power level (default 0.80).

0.8

Returns:

Type Description
float | None

Interpolated sample size, or None if target is outside the range.

Source code in bayesprop/utils/utils.py
def find_n_for_power(
    power_curve: dict[int, float],
    target_power: float = 0.80,
) -> float | None:
    """Interpolate the sample size needed to achieve a target power level.

    Args:
        power_curve: Dictionary mapping sample size -> power (from BFDA).
        target_power: Desired power level (default 0.80).

    Returns:
        Interpolated sample size, or ``None`` if target is outside the range.
    """
    ns = list(power_curve.keys())
    ps = list(power_curve.values())

    if min(ps) >= target_power:
        return float(ns[0])
    if max(ps) < target_power:
        return None

    f_interp = interp1d(ps, ns, kind="linear")
    return float(f_interp(target_power))

plot_bfda_power(power_curve, theta_A_true, theta_B_true, bf_threshold=3.0, target_power=0.8, title=None, ax=None)

Plot a BFDA power curve with 80%/95% reference lines.

Parameters:

Name Type Description Default
power_curve dict[int, float]

Dictionary mapping sample size -> power.

required
theta_A_true float

Assumed true rate for model A (for title).

required
theta_B_true float

Assumed true rate for model B (for title).

required
bf_threshold float

BF_10 threshold used (for y-axis label).

3.0
target_power float

Power level to highlight via interpolation.

0.8
title str | None

Optional custom title.

None
ax Axes | None

Optional matplotlib Axes to plot on.

None

Returns:

Type Description
Figure

The matplotlib Figure.

Source code in bayesprop/utils/utils.py
def plot_bfda_power(
    power_curve: dict[int, float],
    theta_A_true: float,
    theta_B_true: float,
    bf_threshold: float = 3.0,
    target_power: float = 0.80,
    title: str | None = None,
    ax: plt.Axes | None = None,
) -> plt.Figure:
    """Plot a BFDA power curve with 80%/95% reference lines.

    Args:
        power_curve: Dictionary mapping sample size -> power.
        theta_A_true: Assumed true rate for model A (for title).
        theta_B_true: Assumed true rate for model B (for title).
        bf_threshold: BF_10 threshold used (for y-axis label).
        target_power: Power level to highlight via interpolation.
        title: Optional custom title.
        ax: Optional matplotlib Axes to plot on.

    Returns:
        The matplotlib Figure.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(9, 5))
    else:
        fig = ax.get_figure()

    ns = list(power_curve.keys())
    ps = list(power_curve.values())

    ax.plot(ns, ps, "o-", color="#2196F3", linewidth=2, markersize=7)
    ax.axhline(0.80, color="gray", linestyle="--", linewidth=1, alpha=0.7, label="80% power")
    ax.axhline(0.95, color="gray", linestyle=":", linewidth=1, alpha=0.7, label="95% power")

    n_target = find_n_for_power(power_curve, target_power)
    if n_target is not None:
        ax.axvline(
            n_target,
            color="#E91E63",
            linestyle="-.",
            linewidth=1.5,
            alpha=0.7,
            label=f"n ~ {n_target:.0f} for {target_power:.0%} power",
        )

    delta = theta_A_true - theta_B_true
    if title is None:
        title = f"BFDA Power Curve -- Delta = {delta:.3f} (theta_A={theta_A_true:.2f}, theta_B={theta_B_true:.2f})"
    ax.set_xlabel("Sample size per group (n)")
    ax.set_ylabel(f"P(BF_10 > {bf_threshold:.0f})")
    ax.set_title(title, fontsize=12, fontweight="bold")
    ax.set_ylim(-0.02, 1.02)
    ax.legend(fontsize=9)
    ax.grid(alpha=0.3)
    fig.tight_layout()

    return fig

plot_bfda_sensitivity(theta_A_true, theta_B_true, sample_sizes, thresholds=None, n_sim=500, seed=42, design='nonpaired', title=None, ax=None, **kwargs)

Plot BFDA power curves for multiple BF_10 thresholds.

Works for both paired and non-paired designs.

Parameters:

Name Type Description Default
theta_A_true float

Assumed true success rate for model A.

required
theta_B_true float

Assumed true success rate for model B.

required
sample_sizes list[int]

List of per-group sample sizes to evaluate.

required
thresholds list[float] | None

BF_10 thresholds to compare (default: [3, 6, 10]).

None
n_sim int

Number of simulated datasets per sample size.

500
seed int

Random seed for reproducibility.

42
design str

"nonpaired" or "paired".

'nonpaired'
title str | None

Optional custom title.

None
ax Axes | None

Optional matplotlib Axes to plot on.

None
**kwargs Any

Extra arguments forwarded to :func:bfda_power_curve (e.g. alpha0, beta0, sigma_theta, prior_sigma_delta, n_iter, etc.).

{}

Returns:

Type Description
Figure

The matplotlib Figure.

Source code in bayesprop/utils/utils.py
def plot_bfda_sensitivity(
    theta_A_true: float,
    theta_B_true: float,
    sample_sizes: list[int],
    thresholds: list[float] | None = None,
    n_sim: int = 500,
    seed: int = 42,
    design: str = "nonpaired",
    title: str | None = None,
    ax: plt.Axes | None = None,
    **kwargs: Any,
) -> plt.Figure:
    """Plot BFDA power curves for multiple BF_10 thresholds.

    Works for both paired and non-paired designs.

    Args:
        theta_A_true: Assumed true success rate for model A.
        theta_B_true: Assumed true success rate for model B.
        sample_sizes: List of per-group sample sizes to evaluate.
        thresholds: BF_10 thresholds to compare (default: [3, 6, 10]).
        n_sim: Number of simulated datasets per sample size.
        seed: Random seed for reproducibility.
        design: ``"nonpaired"`` or ``"paired"``.
        title: Optional custom title.
        ax: Optional matplotlib Axes to plot on.
        **kwargs: Extra arguments forwarded to :func:`bfda_power_curve`
            (e.g. ``alpha0``, ``beta0``, ``sigma_theta``,
            ``prior_sigma_delta``, ``n_iter``, etc.).

    Returns:
        The matplotlib Figure.
    """
    if thresholds is None:
        thresholds = [3.0, 6.0, 10.0]

    if ax is None:
        fig, ax = plt.subplots(figsize=(9, 5))
    else:
        fig = ax.get_figure()

    colors = ["#4CAF50", "#FF9800", "#E91E63", "#9C27B0", "#2196F3"]

    for thresh, col in zip(thresholds, colors, strict=False):
        curve = bfda_power_curve(
            theta_A_true=theta_A_true,
            theta_B_true=theta_B_true,
            sample_sizes=sample_sizes,
            design=design,
            decision_rule="bayes_factor",
            bf_threshold=thresh,
            n_sim=n_sim,
            seed=seed,
            **kwargs,
        )

        ax.plot(
            list(curve.keys()),
            list(curve.values()),
            "o-",
            color=col,
            linewidth=2,
            markersize=6,
            label=f"BF_10 > {thresh:.0f}",
        )

    ax.axhline(0.80, color="gray", linestyle="--", linewidth=1, alpha=0.6)
    ax.set_xlabel("Sample size per group (n)")
    ax.set_ylabel("P(BF_10 > threshold)")
    if title is None:
        title = f"BFDA Sensitivity -- {design.title()} Design"
    ax.set_title(title, fontsize=12, fontweight="bold")
    ax.set_ylim(-0.02, 1.02)
    ax.legend(fontsize=9)
    ax.grid(alpha=0.3)
    fig.tight_layout()

    return fig