Skip to content

Non-Paired Model

Independent Beta-Bernoulli A/B test with exact Savage-Dickey Bayes factor.

The sequential variant SequentialNonPairedBayesPropTest is documented on the Sequential designs page.

bayes_nonpaired

Non-paired Bayesian A/B test using conjugate Beta-Bernoulli model.

This module provides :class:NonPairedBayesPropTest for comparing two independent groups via binarized pass/fail counts with a Beta-Bernoulli conjugate model, Savage-Dickey Bayes factor on the difference of proportions, posterior predictive checks, and publication-ready plots.

Also provides :func:descriptive_summary for building a combined descriptive statistics + threshold-sweep summary table.

Typical workflow::

from ai_eval.resources.bayes_nonpaired import NonPairedBayesPropTest, descriptive_summary

bb = NonPairedBayesPropTest(threshold=0.7)
result = bb.test(scores_A, scores_B)

df = descriptive_summary(scores_dict, thresholds=[0.5, 0.7, 0.8, 0.9, 0.95])

NonPairedBayesPropTest(alpha0=1.0, beta0=1.0, threshold=0.7, n_quad=100, seed=0, n_samples=20000, verbose=False, decision_rule='all', rope_epsilon=0.02)

Non-paired Bayesian A/B test using conjugate Beta-Bernoulli model.

Workflow
  1. Binarize continuous scores at a threshold.
  2. Update Beta(alpha0, beta0) prior with observed pass/fail counts.
  3. Compute posterior probability of superiority P(theta_B > theta_A) via Gauss-Legendre quadrature.

Initialise the Beta-Bernoulli proportion test.

Parameters:

Name Type Description Default
alpha0 float

Prior alpha parameter for the Beta distribution.

1.0
beta0 float

Prior beta parameter for the Beta distribution.

1.0
threshold float

Binarization threshold for continuous scores.

0.7
n_quad int

Number of Gauss-Legendre quadrature nodes.

100
seed int

Random seed for Monte Carlo sampling.

0
n_samples int

Number of Monte Carlo draws for difference posterior.

20000
verbose bool

If True, print diagnostic messages.

False
decision_rule DecisionRuleType

Default decision framework — one of "bayes_factor", "posterior_null", "rope", or "all".

'all'
rope_epsilon float

Half-width of the ROPE interval (default 0.02 = 2 pp).

0.02
Source code in bayesprop/resources/bayes_nonpaired.py
def __init__(
    self,
    alpha0: float = 1.0,
    beta0: float = 1.0,
    threshold: float = 0.7,
    n_quad: int = 100,
    seed: int = 0,
    n_samples: int = 20_000,
    verbose: bool = False,
    decision_rule: DecisionRuleType = "all",
    rope_epsilon: float = 0.02,
) -> None:
    """Initialise the Beta-Bernoulli proportion test.

    Args:
        alpha0: Prior alpha parameter for the Beta distribution.
        beta0: Prior beta parameter for the Beta distribution.
        threshold: Binarization threshold for continuous scores.
        n_quad: Number of Gauss-Legendre quadrature nodes.
        seed: Random seed for Monte Carlo sampling.
        n_samples: Number of Monte Carlo draws for difference posterior.
        verbose: If True, print diagnostic messages.
        decision_rule: Default decision framework — one of
            ``"bayes_factor"``, ``"posterior_null"``, ``"rope"``, or ``"all"``.
        rope_epsilon: Half-width of the ROPE interval (default 0.02 = 2 pp).
    """
    self.alpha0 = alpha0
    self.beta0 = beta0
    self.threshold = threshold
    self.n_quad = n_quad
    self.seed = seed
    self.n_samples = n_samples
    self.verbose = verbose
    self.decision_rule = decision_rule
    self.rope_epsilon = rope_epsilon

    if self.verbose:
        print(
            f"Initialized NonPairedBayesPropTest with "
            f"alpha0={alpha0}, beta0={beta0}, threshold={threshold}, n_quad={n_quad}"
        )

    # precompute quadrature nodes/weights (reused across calls)
    nodes, weights = np.polynomial.legendre.leggauss(n_quad)
    self._x = 0.5 * (nodes + 1)
    self._w = 0.5 * weights
_binarize(y)

Return y unchanged if already binary, else binarize at self.threshold.

Parameters:

Name Type Description Default
y ndarray

Array of scores. If all values are 0.0 or 1.0 the array is returned as-is; otherwise values ≥ self.threshold become 1.0 and values below become 0.0.

required

Returns:

Type Description
ndarray

Binary array of the same length as y.

Source code in bayesprop/resources/bayes_nonpaired.py
def _binarize(self, y: np.ndarray) -> np.ndarray:
    """Return y unchanged if already binary, else binarize at self.threshold.

    Args:
        y: Array of scores. If all values are 0.0 or 1.0 the array
            is returned as-is; otherwise values ≥ ``self.threshold``
            become 1.0 and values below become 0.0.

    Returns:
        Binary array of the same length as *y*.
    """
    unique = np.unique(y)
    if np.all(np.isin(unique, [0.0, 1.0])):
        return y

    if self.verbose:
        print(
            f"Warning: non-binary scores detected, binarizing at threshold {self.threshold}"
        )
    return (y >= self.threshold).astype(float)
prob_greater(a1, b1, a2, b2)

Posterior probability of superiority P(theta1 > theta2) via Gauss-Legendre quadrature.

Parameters:

Name Type Description Default
a1 float

Alpha parameter of the first Beta distribution.

required
b1 float

Beta parameter of the first Beta distribution.

required
a2 float

Alpha parameter of the second Beta distribution.

required
b2 float

Beta parameter of the second Beta distribution.

required

Returns:

Type Description
float

Posterior probability of superiority, i.e. the probability

float

that a draw from Beta(a1, b1) exceeds a draw from Beta(a2, b2).

Source code in bayesprop/resources/bayes_nonpaired.py
def prob_greater(self, a1: float, b1: float, a2: float, b2: float) -> float:
    """Posterior probability of superiority P(theta1 > theta2) via Gauss-Legendre quadrature.

    Args:
        a1: Alpha parameter of the first Beta distribution.
        b1: Beta parameter of the first Beta distribution.
        a2: Alpha parameter of the second Beta distribution.
        b2: Beta parameter of the second Beta distribution.

    Returns:
        Posterior probability of superiority, i.e. the probability
        that a draw from Beta(a1, b1) exceeds a draw from Beta(a2, b2).
    """
    x, w = self._x, self._w
    log_pdf1 = (a1 - 1) * np.log(x) + (b1 - 1) * np.log(1 - x) - betaln(a1, b1)
    cdf2 = betainc(a2, b2, x)
    return float(np.dot(w, np.exp(log_pdf1) * cdf2))
test(y_a, y_b)

Run the Beta-Bernoulli test. Auto-binarizes if scores are not 0/1.

Parameters:

Name Type Description Default
y_a ArrayLike

Scores for group A (continuous or binary).

required
y_b ArrayLike

Scores for group B (continuous or binary).

required

Returns:

Type Description
NonPairedTestResult

class:NonPairedTestResult with posterior Beta parameters

NonPairedTestResult

and posterior probability of superiority P(theta_B > theta_A).

Source code in bayesprop/resources/bayes_nonpaired.py
def test(self, y_a: npt.ArrayLike, y_b: npt.ArrayLike) -> NonPairedTestResult:
    """Run the Beta-Bernoulli test. Auto-binarizes if scores are not 0/1.

    Args:
        y_a: Scores for group A (continuous or binary).
        y_b: Scores for group B (continuous or binary).

    Returns:
        :class:`NonPairedTestResult` with posterior Beta parameters
        and posterior probability of superiority P(theta_B > theta_A).
    """
    y_a = self._binarize(np.asarray(y_a, dtype=float))
    y_b = self._binarize(np.asarray(y_b, dtype=float))

    N_a, N_b = len(y_a), len(y_b)
    sA, sB = y_a.sum(), y_b.sum()

    a1 = self.alpha0 + sA
    b1 = self.beta0 + (N_a - sA)
    a2 = self.alpha0 + sB
    b2 = self.beta0 + (N_b - sB)

    p = self.prob_greater(a2, b2, a1, b1)  # P(theta_B > theta_A)

    return NonPairedTestResult(
        thetaA_post=BetaParams(alpha=float(a1), beta=float(b1)),
        thetaB_post=BetaParams(alpha=float(a2), beta=float(b2)),
        P_B_greater_A=float(p),
    )
fit(y_a, y_b)

Fit Beta posteriors and sample difference posterior via Monte Carlo.

This extends :meth:test by also drawing from the individual Beta posteriors to build the posterior of Δ = p_A − p_B, which enables Savage-Dickey Bayes factors, posterior predictive checks, and richer summaries.

Parameters:

Name Type Description Default
y_a ArrayLike

Scores for group A (continuous or binary).

required
y_b ArrayLike

Scores for group B (continuous or binary).

required

Returns:

Type Description
NonPairedBayesPropTest

self (for method chaining).

Source code in bayesprop/resources/bayes_nonpaired.py
def fit(self, y_a: npt.ArrayLike, y_b: npt.ArrayLike) -> NonPairedBayesPropTest:
    """Fit Beta posteriors and sample difference posterior via Monte Carlo.

    This extends :meth:`test` by also drawing from the individual
    Beta posteriors to build the posterior of Δ = p_A − p_B, which
    enables Savage-Dickey Bayes factors, posterior predictive checks,
    and richer summaries.

    Args:
        y_a: Scores for group A (continuous or binary).
        y_b: Scores for group B (continuous or binary).

    Returns:
        self (for method chaining).
    """
    self.y_A_obs = self._binarize(np.asarray(y_a, dtype=float))
    self.y_B_obs = self._binarize(np.asarray(y_b, dtype=float))

    N_a, N_b = len(self.y_A_obs), len(self.y_B_obs)
    sA, sB = self.y_A_obs.sum(), self.y_B_obs.sum()

    self.a_A = self.alpha0 + sA
    self.b_A = self.beta0 + (N_a - sA)
    self.a_B = self.alpha0 + sB
    self.b_B = self.beta0 + (N_b - sB)

    # P(B > A) via quadrature (exact)
    self.p_B_greater_A = self.prob_greater(self.a_B, self.b_B, self.a_A, self.b_A)

    # Monte Carlo samples from Beta posteriors
    rng = np.random.default_rng(self.seed)
    self.theta_A_samples = rng.beta(self.a_A, self.b_A, size=self.n_samples)
    self.theta_B_samples = rng.beta(self.a_B, self.b_B, size=self.n_samples)
    self.delta_samples = self.theta_A_samples - self.theta_B_samples

    self.summary = NonPairedSummary(
        mean_delta=float(self.delta_samples.mean()),
        ci_95=CredibleInterval(
            lower=float(np.quantile(self.delta_samples, 0.025)),
            upper=float(np.quantile(self.delta_samples, 0.975)),
        ),
        **{"P(A > B)": float((self.delta_samples > 0).mean())},
        theta_A_mean=float(self.theta_A_samples.mean()),
        theta_B_mean=float(self.theta_B_samples.mean()),
    )

    self.trace_summary = pd.DataFrame(
        {
            "mean": [
                self.theta_A_samples.mean(),
                self.theta_B_samples.mean(),
                self.delta_samples.mean(),
            ],
            "sd": [
                self.theta_A_samples.std(),
                self.theta_B_samples.std(),
                self.delta_samples.std(),
            ],
            "hdi_3%": [
                np.quantile(self.theta_A_samples, 0.03),
                np.quantile(self.theta_B_samples, 0.03),
                np.quantile(self.delta_samples, 0.03),
            ],
            "hdi_97%": [
                np.quantile(self.theta_A_samples, 0.97),
                np.quantile(self.theta_B_samples, 0.97),
                np.quantile(self.delta_samples, 0.97),
            ],
        },
        index=["theta_A", "theta_B", "delta"],
    )

    return self
_check_fitted()

Raise RuntimeError if .fit() has not been called.

Source code in bayesprop/resources/bayes_nonpaired.py
def _check_fitted(self) -> None:
    """Raise RuntimeError if .fit() has not been called."""
    if not hasattr(self, "delta_samples") or self.delta_samples is None:
        raise RuntimeError("Model has not been fitted yet. Call .fit() first.")
savage_dickey_test(null_value=0.0)

Savage-Dickey density-ratio Bayes factor for H0: Δ = null_value.

The prior on Δ = p_A − p_B is induced by the independent Beta(α₀, β₀) priors. Densities are computed via exact log-space convolution (:func:beta_diff_pdf).

Parameters:

Name Type Description Default
null_value float

The point null hypothesis value for Δ.

0.0

Returns:

Type Description
SavageDickeyResult

class:SavageDickeyResult with BF_01, BF_10, densities,

SavageDickeyResult

interpretation, and decision.

Source code in bayesprop/resources/bayes_nonpaired.py
def savage_dickey_test(self, null_value: float = 0.0) -> SavageDickeyResult:
    """Savage-Dickey density-ratio Bayes factor for H0: Δ = null_value.

    The prior on Δ = p_A − p_B is induced by the independent
    Beta(α₀, β₀) priors. Densities are computed via exact
    log-space convolution (:func:`beta_diff_pdf`).

    Args:
        null_value: The point null hypothesis value for Δ.

    Returns:
        :class:`SavageDickeyResult` with BF_01, BF_10, densities,
        interpretation, and decision.
    """
    self._check_fitted()

    # Posterior density at null via analytic convolution
    posterior_at_null = beta_diff_pdf(
        null_value, self.a_A, self.b_A, self.a_B, self.b_B
    )

    # Prior density at null via analytic convolution
    prior_at_null = beta_diff_pdf(
        null_value, self.alpha0, self.beta0, self.alpha0, self.beta0
    )

    BF_01 = posterior_at_null / prior_at_null
    BF_10 = 1.0 / BF_01

    if BF_10 > 100:
        interpretation = "Decisive evidence against H0"
    elif BF_10 > 30:
        interpretation = "Very strong evidence against H0"
    elif BF_10 > 10:
        interpretation = "Strong evidence against H0"
    elif BF_10 > 3:
        interpretation = "Moderate evidence against H0"
    elif BF_10 > 1:
        interpretation = "Anecdotal evidence against H0"
    elif BF_10 == 1:
        interpretation = "No evidence either way"
    elif BF_01 > 100:
        interpretation = "Decisive evidence for H0"
    elif BF_01 > 30:
        interpretation = "Very strong evidence for H0"
    elif BF_01 > 10:
        interpretation = "Strong evidence for H0"
    elif BF_01 > 3:
        interpretation = "Moderate evidence for H0"
    else:
        interpretation = "Anecdotal evidence for H0"

    decision = "Reject H0" if BF_10 > 3 else "Fail to reject H0"

    return SavageDickeyResult(
        BF_01=BF_01,
        BF_10=BF_10,
        posterior_density_at_0=posterior_at_null,
        prior_density_at_0=prior_at_null,
        interpretation=interpretation,
        decision=decision,
    )
posterior_probability_H0(BF_01, prior_H0=0.5) staticmethod

Convert BF_01 to posterior probability of H0 (spike-and-slab).

Parameters:

Name Type Description Default
BF_01 float

Bayes factor in favour of H0.

required
prior_H0 float

Prior probability of H0 (default 0.5).

0.5

Returns:

Type Description
PosteriorProbH0Result

class:PosteriorProbH0Result with posterior and prior odds

PosteriorProbH0Result

and model probabilities.

Source code in bayesprop/resources/bayes_nonpaired.py
@staticmethod
def posterior_probability_H0(
    BF_01: float, prior_H0: float = 0.5
) -> PosteriorProbH0Result:
    """Convert BF_01 to posterior probability of H0 (spike-and-slab).

    Args:
        BF_01: Bayes factor in favour of H0.
        prior_H0: Prior probability of H0 (default 0.5).

    Returns:
        :class:`PosteriorProbH0Result` with posterior and prior odds
        and model probabilities.
    """
    prior_odds = prior_H0 / (1 - prior_H0)
    posterior_odds = BF_01 * prior_odds
    P_H0 = posterior_odds / (1 + posterior_odds)
    P_H1 = 1 - P_H0

    if P_H1 > 0.95:
        decision = "Reject H0"
    elif P_H0 > 0.95:
        decision = "Fail to reject H0"
    else:
        decision = "Undecided"

    return PosteriorProbH0Result(
        **{"P(H0|data)": P_H0, "P(H1|data)": P_H1},
        prior_odds=prior_odds,
        posterior_odds=posterior_odds,
        decision=decision,
    )
rope_test(rope=None, ci_mass=0.95)

ROPE analysis on the posterior of Δ = θ_A − θ_B.

Parameters:

Name Type Description Default
rope tuple[float, float] | None

(lower, upper) ROPE bounds. Defaults to (-self.rope_epsilon, +self.rope_epsilon).

None
ci_mass float

Credible interval mass (default 95%).

0.95

Returns:

Type Description
ROPEResult

class:ROPEResult with CI, ROPE overlap fractions, and

ROPEResult

decision.

Source code in bayesprop/resources/bayes_nonpaired.py
def rope_test(
    self,
    rope: tuple[float, float] | None = None,
    ci_mass: float = 0.95,
) -> ROPEResult:
    """ROPE analysis on the posterior of Δ = θ_A − θ_B.

    Args:
        rope: (lower, upper) ROPE bounds. Defaults to
            ``(-self.rope_epsilon, +self.rope_epsilon)``.
        ci_mass: Credible interval mass (default 95%).

    Returns:
        :class:`ROPEResult` with CI, ROPE overlap fractions, and
        decision.
    """
    self._check_fitted()
    if rope is None:
        rope = (-self.rope_epsilon, self.rope_epsilon)
    return ROPEResult.from_samples(self.delta_samples, rope=rope, ci_mass=ci_mass)
decide(rule=None)

Run the chosen decision framework(s) and return a composite result.

Parameters:

Name Type Description Default
rule DecisionRuleType | None

Override the default decision_rule. One of "bayes_factor", "posterior_null", "rope", or "all".

None

Returns:

Type Description
HypothesisDecision

class:HypothesisDecision with the requested sub-results

HypothesisDecision

populated.

Source code in bayesprop/resources/bayes_nonpaired.py
def decide(self, rule: DecisionRuleType | None = None) -> HypothesisDecision:
    """Run the chosen decision framework(s) and return a composite result.

    Args:
        rule: Override the default ``decision_rule``. One of
            ``"bayes_factor"``, ``"posterior_null"``, ``"rope"``,
            or ``"all"``.

    Returns:
        :class:`HypothesisDecision` with the requested sub-results
        populated.
    """
    self._check_fitted()
    rule = rule or self.decision_rule

    bf: SavageDickeyResult | None = None
    pn: PosteriorProbH0Result | None = None
    rp: ROPEResult | None = None

    if rule in ("bayes_factor", "posterior_null", "all"):
        bf = self.savage_dickey_test()
    if rule in ("posterior_null", "all"):
        assert bf is not None  # guaranteed by the branch above
        pn = self.posterior_probability_H0(bf.BF_01)
    if rule in ("rope", "all"):
        rp = self.rope_test()

    return HypothesisDecision(
        bayes_factor=bf, posterior_null=pn, rope=rp, rule=rule
    )
ppc_pvalues(seed=None)

Posterior predictive p-values for summary statistics.

For each summary statistic (group means and mean difference), replicated datasets are drawn from the posterior predictive distribution and a two-sided mid-p value is computed as

.. math::

p = 2\,\min\!\bigl(P(T^{\text{rep}} > T^{\text{obs}})
    + \tfrac{1}{2} P(T^{\text{rep}} = T^{\text{obs}}),\;
    P(T^{\text{rep}} < T^{\text{obs}})
    + \tfrac{1}{2} P(T^{\text{rep}} = T^{\text{obs}})\bigr).

The mid-p correction splits the probability mass at exact ties evenly between the two tails, which prevents the p-value from clipping at 1.0 when T^rep and T^obs coincide often (a common occurrence for binary data, where the sample mean lives on the coarse grid k/n).

Note

For the conjugate Beta-Bernoulli model the sample mean is the sufficient statistic for each group, so PPC p-values for the means are expected to be close to 1.0 by construction. They are reported here only as a sanity check against gross misspecification (e.g. wrong likelihood family) and should not be interpreted as a strong test of fit for this saturated model.

Parameters:

Name Type Description Default
seed int | None

Random seed for reproducibility. Falls back to self.seed if not provided.

None

Returns:

Type Description
dict[str, PPCStatistic]

Dict mapping statistic name to :class:PPCStatistic

dict[str, PPCStatistic]

with observed value, p-value, and status ("OK" / "WARN").

Raises:

Type Description
RuntimeError

If :meth:fit has not been called.

Source code in bayesprop/resources/bayes_nonpaired.py
def ppc_pvalues(self, seed: int | None = None) -> dict[str, PPCStatistic]:
    """Posterior predictive p-values for summary statistics.

    For each summary statistic (group means and mean difference),
    replicated datasets are drawn from the posterior predictive
    distribution and a two-sided ``mid-p`` value is computed as

    .. math::

        p = 2\\,\\min\\!\\bigl(P(T^{\\text{rep}} > T^{\\text{obs}})
            + \\tfrac{1}{2} P(T^{\\text{rep}} = T^{\\text{obs}}),\\;
            P(T^{\\text{rep}} < T^{\\text{obs}})
            + \\tfrac{1}{2} P(T^{\\text{rep}} = T^{\\text{obs}})\\bigr).

    The mid-p correction splits the probability mass at exact ties
    evenly between the two tails, which prevents the p-value from
    clipping at 1.0 when ``T^rep`` and ``T^obs`` coincide often
    (a common occurrence for binary data, where the sample mean
    lives on the coarse grid ``k/n``).

    Note:
        For the conjugate Beta-Bernoulli model the sample mean is
        the sufficient statistic for each group, so PPC p-values
        for the means are expected to be close to 1.0 by
        construction.  They are reported here only as a sanity
        check against gross misspecification (e.g. wrong
        likelihood family) and should not be interpreted as a
        strong test of fit for this saturated model.

    Args:
        seed: Random seed for reproducibility.  Falls back to
            ``self.seed`` if not provided.

    Returns:
        Dict mapping statistic name to :class:`PPCStatistic`
        with observed value, p-value, and status ("OK" / "WARN").

    Raises:
        RuntimeError: If :meth:`fit` has not been called.
    """
    self._check_fitted()

    rng = np.random.default_rng(seed if seed is not None else self.seed)

    n_A = len(self.y_A_obs)
    n_B = len(self.y_B_obs)

    # Replicate datasets from posterior draws
    y_A_rep = rng.binomial(
        1, self.theta_A_samples[:, None], size=(self.n_samples, n_A)
    )
    y_B_rep = rng.binomial(
        1, self.theta_B_samples[:, None], size=(self.n_samples, n_B)
    )

    checks = {
        "mean(y_A)": (self.y_A_obs.mean(), y_A_rep.mean(axis=1)),
        "mean(y_B)": (self.y_B_obs.mean(), y_B_rep.mean(axis=1)),
        "mean(y_A)-mean(y_B)": (
            self.y_A_obs.mean() - self.y_B_obs.mean(),
            y_A_rep.mean(axis=1) - y_B_rep.mean(axis=1),
        ),
    }

    # Tolerance for floating-point ties (statistics on binary data
    # live on the coarse grid k/n, so exact equality is common).
    atol = 1e-12

    results: dict[str, PPCStatistic] = {}
    for stat_name, (obs_val, rep_vals) in checks.items():
        rep_arr = np.asarray(rep_vals, dtype=float)
        # Decompose the replicated distribution into three disjoint
        # masses relative to the observed statistic: strictly less,
        # equal (within tolerance), and strictly greater.
        p_eq = float(np.mean(np.abs(rep_arr - obs_val) <= atol))
        p_gt = float(np.mean(rep_arr > obs_val + atol))
        p_lt = float(np.mean(rep_arr < obs_val - atol))
        # Mid-p one-sided tail probabilities: assign half of the
        # tie mass to each tail so that the p-value remains
        # uniform under the null even when the test statistic
        # is discrete (Lancaster, 1961).
        p_ge_mid = p_gt + 0.5 * p_eq
        p_le_mid = p_lt + 0.5 * p_eq
        # Two-sided mid-p value, clipped at 1.0 to guard against
        # tiny numerical overshoot when both tails are ~0.5.
        p_val = min(2.0 * min(p_ge_mid, p_le_mid), 1.0)
        results[stat_name] = PPCStatistic(
            observed=float(obs_val),
            p_value=p_val,
            # WARN flags p < 0.05 (observed statistic in the
            # extreme 5% of the posterior predictive distribution).
            status="OK" if p_val > 0.05 else "WARN",
        )
    return results
plot_posteriors(**kwargs)

Two-panel plot: overlaid θ_A / θ_B posteriors and Δ = θ_A − θ_B.

The left panel shows the analytic Beta posterior densities for θ_A and θ_B overlaid in a single axes. The right panel shows the Monte Carlo difference posterior Δ = θ_A − θ_B.

Parameters:

Name Type Description Default
**kwargs dict

Accepts figsize (default (14, 5)) and title (default "Beta-Bernoulli Posteriors (Non-Paired)").

{}
Source code in bayesprop/resources/bayes_nonpaired.py
def plot_posteriors(self, **kwargs: dict) -> None:
    """Two-panel plot: overlaid θ_A / θ_B posteriors and Δ = θ_A − θ_B.

    The left panel shows the analytic Beta posterior densities for
    θ_A and θ_B overlaid in a single axes. The right panel shows the
    Monte Carlo difference posterior Δ = θ_A − θ_B.

    Args:
        **kwargs: Accepts ``figsize`` (default ``(14, 5)``) and
            ``title`` (default ``"Beta-Bernoulli Posteriors (Non-Paired)"``).
    """
    import matplotlib.pyplot as plt
    from scipy.stats import beta as beta_dist

    self._check_fitted()

    figsize = kwargs.pop("figsize", (14, 5))
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Panel 1: θ_A and θ_B overlaid (analytic Beta densities)
    ax = axes[0]
    x = np.linspace(0, 1, 500)
    pdf_A = beta_dist.pdf(x, self.a_A, self.b_A)
    pdf_B = beta_dist.pdf(x, self.a_B, self.b_B)

    ax.plot(
        x,
        pdf_A,
        color="#2196F3",
        linewidth=2,
        label=f"θ_A ~ Beta({self.a_A:.0f}, {self.b_A:.0f})",
    )
    ax.fill_between(x, pdf_A, alpha=0.15, color="#2196F3")
    ax.plot(
        x,
        pdf_B,
        color="#4CAF50",
        linewidth=2,
        label=f"θ_B ~ Beta({self.a_B:.0f}, {self.b_B:.0f})",
    )
    ax.fill_between(x, pdf_B, alpha=0.15, color="#4CAF50")

    ax.axvline(
        self.a_A / (self.a_A + self.b_A),
        color="#2196F3",
        linestyle="--",
        linewidth=1,
        alpha=0.6,
    )
    ax.axvline(
        self.a_B / (self.a_B + self.b_B),
        color="#4CAF50",
        linestyle="--",
        linewidth=1,
        alpha=0.6,
    )
    ax.set_xlabel("θ")
    ax.set_ylabel("Density")
    ax.set_title("Beta Posteriors", fontsize=11, fontweight="bold")
    ax.legend(fontsize=9)
    ax.grid(alpha=0.3)

    # Panel 2: Δ = θ_A − θ_B
    ax = axes[1]
    samples = self.delta_samples
    ax.hist(
        samples,
        bins=60,
        density=True,
        alpha=0.6,
        color="#9C27B0",
        edgecolor="white",
    )
    ax.axvline(0, color="gray", linestyle="--", linewidth=1, alpha=0.6)
    ax.axvline(
        samples.mean(),
        color="#9C27B0",
        linewidth=1.5,
        label=f"Mean = {samples.mean():.4f}",
    )
    ax.set_xlabel("Δ = θ_A − θ_B")
    ax.set_ylabel("Density")
    ax.set_title("Difference Posterior", fontsize=11, fontweight="bold")
    ax.legend(fontsize=9)
    ax.grid(alpha=0.3)

    fig.suptitle(
        kwargs.pop("title", "Beta-Bernoulli Posteriors (Non-Paired)"),
        fontsize=13,
        fontweight="bold",
        y=1.02,
    )
    plt.tight_layout()
    plt.show()
plot_posterior_delta(color='#9C27B0', **kwargs)

KDE posterior density of Δ = θ_A − θ_B with 95% CI.

Plots a smooth kernel density estimate of the Monte Carlo difference posterior with the 95% credible interval shaded and the posterior mean marked.

Parameters:

Name Type Description Default
color str

Colour for the density curve and fill.

'#9C27B0'
**kwargs dict

Accepts figsize (default (7, 5)) and title (default "Posterior of Δ (Beta-Bernoulli)").

{}

Raises:

Type Description
RuntimeError

If :meth:fit has not been called.

Source code in bayesprop/resources/bayes_nonpaired.py
def plot_posterior_delta(self, color: str = "#9C27B0", **kwargs: dict) -> None:
    """KDE posterior density of Δ = θ_A − θ_B with 95% CI.

    Plots a smooth kernel density estimate of the Monte Carlo
    difference posterior with the 95% credible interval shaded
    and the posterior mean marked.

    Args:
        color: Colour for the density curve and fill.
        **kwargs: Accepts ``figsize`` (default ``(7, 5)``) and
            ``title`` (default ``"Posterior of Δ (Beta-Bernoulli)"``).

    Raises:
        RuntimeError: If :meth:`fit` has not been called.
    """
    import matplotlib.pyplot as plt

    self._check_fitted()
    samples = self.delta_samples
    ci_low, ci_high = np.quantile(samples, [0.025, 0.975])
    mean_val = samples.mean()

    kde = gaussian_kde(samples)
    x_grid = np.linspace(samples.min() - 0.05, samples.max() + 0.05, 500)
    density = kde(x_grid)

    figsize = kwargs.pop("figsize", (7, 5))
    _, ax = plt.subplots(figsize=figsize)
    ax.plot(x_grid, density, color=color, linewidth=2)
    ax.fill_between(x_grid, density, alpha=0.15, color=color)
    mask = (x_grid >= ci_low) & (x_grid <= ci_high)
    ax.fill_between(
        x_grid[mask],
        density[mask],
        alpha=0.35,
        color=color,
        label="95% CI",
    )
    ax.axvline(
        mean_val,
        color=color,
        linestyle="-",
        linewidth=1.5,
        alpha=0.8,
        label=f"Mean = {mean_val:.4f}",
    )
    ax.axvline(
        0,
        color="gray",
        linestyle="--",
        linewidth=1,
        alpha=0.6,
        label="Δ = 0 (no difference)",
    )
    ax.set_xlabel("Δ = θ_A − θ_B")
    ax.set_ylabel("Density")
    ax.set_title(
        kwargs.pop("title", "Posterior of Δ (Beta-Bernoulli)"),
        fontsize=12,
        fontweight="bold",
    )
    ax.legend(fontsize=9, loc="upper right")
    ax.grid(axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()
plot_savage_dickey(color='#9C27B0', **kwargs)

Posterior vs prior density of Δ with Savage-Dickey BF annotation.

Overlays the exact convolution densities of Δ under the posterior and prior, marks the density values at Δ = 0, and annotates the plot with BF₁₀, log₁₀ BF₁₀, and the decision.

Parameters:

Name Type Description Default
color str

Colour for the posterior density curve and fill.

'#9C27B0'
**kwargs dict

Accepts figsize (default (7, 5)) and title (default "Savage-Dickey Test (Beta-Bernoulli)").

{}

Raises:

Type Description
RuntimeError

If :meth:fit has not been called.

Source code in bayesprop/resources/bayes_nonpaired.py
def plot_savage_dickey(self, color: str = "#9C27B0", **kwargs: dict) -> None:
    """Posterior vs prior density of Δ with Savage-Dickey BF annotation.

    Overlays the exact convolution densities of Δ under the
    posterior and prior, marks the density values at Δ = 0, and
    annotates the plot with BF₁₀, log₁₀ BF₁₀, and the decision.

    Args:
        color: Colour for the posterior density curve and fill.
        **kwargs: Accepts ``figsize`` (default ``(7, 5)``) and
            ``title`` (default
            ``"Savage-Dickey Test (Beta-Bernoulli)"``).

    Raises:
        RuntimeError: If :meth:`fit` has not been called.
    """
    import matplotlib.pyplot as plt

    self._check_fitted()
    bf = self.savage_dickey_test()
    samples = self.delta_samples

    x_grid = np.linspace(samples.min() - 0.1, samples.max() + 0.1, 500)

    # Posterior and prior density via analytic convolution
    density_post = np.array(
        [beta_diff_pdf(z, self.a_A, self.b_A, self.a_B, self.b_B) for z in x_grid]
    )
    density_prior = np.array(
        [
            beta_diff_pdf(z, self.alpha0, self.beta0, self.alpha0, self.beta0)
            for z in x_grid
        ]
    )

    figsize = kwargs.pop("figsize", (7, 5))
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot(x_grid, density_post, color=color, linewidth=2, label="Posterior")
    ax.fill_between(x_grid, density_post, alpha=0.15, color=color)
    ax.plot(
        x_grid,
        density_prior,
        color="gray",
        linewidth=1.5,
        linestyle="--",
        alpha=0.7,
        label=f"Prior Beta({self.alpha0},{self.beta0})",
    )
    ax.plot(
        0,
        bf.posterior_density_at_0,
        "o",
        color="red",
        markersize=10,
        zorder=5,
        label=f"Post. at Δ=0: {bf.posterior_density_at_0:.2f}",
    )
    ax.plot(
        0,
        bf.prior_density_at_0,
        "s",
        color="gray",
        markersize=8,
        zorder=5,
        label=f"Prior at Δ=0: {bf.prior_density_at_0:.2f}",
    )

    bf10_label = _format_bf(bf.BF_10)
    log10_bf = np.log10(bf.BF_10)
    ax.text(
        0.02,
        0.97,
        f"$BF_{{10}}$ = {bf10_label}\n$\\log_{{10}}BF_{{10}}$ = {log10_bf:.1f}\n{bf.decision}",
        fontsize=10,
        fontweight="bold",
        color="darkred",
        transform=ax.transAxes,
        verticalalignment="top",
        bbox=dict(
            boxstyle="round,pad=0.3",
            facecolor="lightyellow",
            edgecolor="darkred",
            alpha=0.9,
        ),
    )
    ax.set_xlabel("Δ = θ_A − θ_B")
    ax.set_ylabel("Density")
    ax.set_title(
        kwargs.pop("title", "Savage-Dickey Test (Beta-Bernoulli)"),
        fontsize=12,
        fontweight="bold",
    )
    ax.legend(fontsize=9, loc="upper right")
    ax.grid(axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()
print_summary()

Print posterior summary, Savage-Dickey test, and PPC p-values.

Outputs a formatted report to stdout containing:

  • Beta posterior parameters and moments for θ_A, θ_B
  • Posterior mean Δ, 95% CI, and P(A > B)
  • Savage-Dickey Bayes factor with interpretation
  • Posterior model probabilities P(H₀ | D) and P(H₁ | D)
  • Posterior predictive p-values for key summary statistics
  • Trace summary table (mean, sd, HDI)

Raises:

Type Description
RuntimeError

If :meth:fit has not been called.

Source code in bayesprop/resources/bayes_nonpaired.py
def print_summary(self) -> None:
    """Print posterior summary, Savage-Dickey test, and PPC p-values.

    Outputs a formatted report to stdout containing:

    - Beta posterior parameters and moments for θ_A, θ_B
    - Posterior mean Δ, 95% CI, and P(A > B)
    - Savage-Dickey Bayes factor with interpretation
    - Posterior model probabilities P(H₀ | D) and P(H₁ | D)
    - Posterior predictive p-values for key summary statistics
    - Trace summary table (mean, sd, HDI)

    Raises:
        RuntimeError: If :meth:`fit` has not been called.
    """
    self._check_fitted()

    s = self.summary
    n_A, n_B = len(self.y_A_obs), len(self.y_B_obs)
    verdict = (
        "A wins"
        if s.p_A_greater_B > 0.95
        else ("Tied" if s.p_A_greater_B > 0.5 else "B wins")
    )

    print("Beta-Bernoulli posterior summary (Non-Paired)")
    print("=" * 60)
    print(
        f"  θ_A ~ Beta({self.a_A:.0f}, {self.b_A:.0f})  "
        f"mean={s.theta_A_mean:.4f}  "
        f"[n_A={n_A}, k_A={int(self.y_A_obs.sum())}]"
    )
    print(
        f"  θ_B ~ Beta({self.a_B:.0f}, {self.b_B:.0f})  "
        f"mean={s.theta_B_mean:.4f}  "
        f"[n_B={n_B}, k_B={int(self.y_B_obs.sum())}]"
    )
    print(f"  Mean Δ (θ_A − θ_B):  {s.mean_delta:.4f}")
    print(f"  95% CI:              [{s.ci_95.lower:.4f}, {s.ci_95.upper:.4f}]")
    print(f"  P(A > B):            {s.p_A_greater_B:.4f}  (MC)")
    print(f"  P(B > A):            {self.p_B_greater_A:.4f}  (quadrature)")
    print(f"  Verdict:             {verdict}")

    bf = self.savage_dickey_test()
    print()
    print("Savage-Dickey Bayes Factor: H0 (Δ = 0) vs H1 (Δ ≠ 0)")
    print("=" * 60)
    print(f"  Prior  density at Δ=0: {bf.prior_density_at_0:.4f}")
    print(f"  Post.  density at Δ=0: {bf.posterior_density_at_0:.4f}")
    print(f"  BF_01 (for H0):        {_format_bf(bf.BF_01)}")
    print(f"  BF_10 (against H0):    {_format_bf(bf.BF_10)}")
    print(f"  log₁₀(BF_10):          {np.log10(bf.BF_10):.1f}")
    print(f"  → {bf.interpretation}")
    print(f"  → Decision: {bf.decision}")

    post = self.posterior_probability_H0(bf.BF_01)
    print()
    print("Posterior model probabilities (prior P(H0) = 0.5)")
    print("=" * 60)
    print(f"  P(H0|data): {post.p_H0:.2e}")
    print(f"  P(H1|data): {post.p_H1:.6f}")

    ppc = self.ppc_pvalues()
    print()
    print("Posterior Predictive p-values")
    print("=" * 60)
    print(f"  {'Statistic':<25} {'Observed':>10} {'p-value':>10} {'Status':>8}")
    print("  " + "-" * 55)
    for stat, vals in ppc.items():
        print(
            f"  {stat:<25} {vals.observed:>10.4f} {vals.p_value:>10.3f} {vals.status:>8}"
        )

    print()
    print("Trace summary")
    print("=" * 60)
    print(self.trace_summary.to_string())
plot_forest(results, label_A='Model A', label_B='Model B', **kwargs) staticmethod

Forest plot with P(A > B) bar chart for multiple metrics.

The left panel shows posterior mean differences with 95% credible intervals; the right panel shows horizontal bars for the posterior probability of superiority.

Parameters:

Name Type Description Default
results dict[str, 'NonPairedBayesPropTest']

Mapping from metric name to a fitted :class:NonPairedBayesPropTest instance.

required
label_A str

Display label for group A.

'Model A'
label_B str

Display label for group B.

'Model B'
**kwargs Any

Accepts figsize and title.

{}
Source code in bayesprop/resources/bayes_nonpaired.py
@staticmethod
def plot_forest(
    results: dict[str, "NonPairedBayesPropTest"],
    label_A: str = "Model A",
    label_B: str = "Model B",
    **kwargs: Any,
) -> None:
    """Forest plot with P(A > B) bar chart for multiple metrics.

    The left panel shows posterior mean differences with 95%
    credible intervals; the right panel shows horizontal bars
    for the posterior probability of superiority.

    Args:
        results: Mapping from metric name to a fitted
            :class:`NonPairedBayesPropTest` instance.
        label_A: Display label for group A.
        label_B: Display label for group B.
        **kwargs: Accepts ``figsize`` and ``title``.
    """
    import matplotlib.patches as mpatches
    import matplotlib.pyplot as plt

    metrics = list(results.keys())
    means = [results[m].summary.mean_delta for m in metrics]
    ci_lows = [results[m].summary.ci_95.lower for m in metrics]
    ci_highs = [results[m].summary.ci_95.upper for m in metrics]
    probs = [results[m].summary.p_A_greater_B for m in metrics]

    colors = [
        "#2196F3" if p > 0.95 else "#FF9800" if p > 0.5 else "#F44336"
        for p in probs
    ]
    y_pos = np.arange(len(metrics))

    figsize = kwargs.pop("figsize", (14, max(4, 2 * len(metrics))))
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    ax = axes[0]
    for i, (m, ci_l, ci_h, col) in enumerate(
        zip(means, ci_lows, ci_highs, colors, strict=False)
    ):
        ax.plot(
            [ci_l, ci_h],
            [i, i],
            color=col,
            linewidth=2.5,
            solid_capstyle="round",
        )
        ax.plot(m, i, "o", color=col, markersize=8, zorder=5)
    ax.axvline(0, color="gray", linestyle="--", linewidth=1, alpha=0.7)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(metrics, fontsize=11)
    ax.set_xlabel(f"Mean Δ (θ_A − θ_B)\n{label_B} better | {label_A} better →")
    ax.set_title("Posterior Mean Difference with 95% CI", fontweight="bold")
    ax.invert_yaxis()
    ax.grid(axis="x", alpha=0.3)

    ax2 = axes[1]
    ax2.barh(y_pos, probs, color=colors, height=0.5, alpha=0.85)
    ax2.axvline(0.5, color="gray", linestyle="--", linewidth=1, alpha=0.7)
    ax2.set_yticks(y_pos)
    ax2.set_yticklabels(metrics, fontsize=11)
    ax2.set_xlabel(f"P({label_A} > {label_B})")
    ax2.set_title("Posterior Probability of Superiority", fontweight="bold")
    ax2.set_xlim(0, 1.05)
    ax2.invert_yaxis()
    ax2.grid(axis="x", alpha=0.3)
    for i, p in enumerate(probs):
        ax2.text(
            p + 0.02,
            i,
            f"{p:.2f}",
            va="center",
            fontsize=10,
            fontweight="bold",
        )

    legend_elements = [
        mpatches.Patch(color="#2196F3", label="Strong (P > 0.95)"),
        mpatches.Patch(color="#FF9800", label="Moderate (0.5 < P ≤ 0.95)"),
        mpatches.Patch(color="#F44336", label="Reversed (P ≤ 0.5)"),
    ]
    fig.legend(
        handles=legend_elements,
        loc="lower center",
        ncol=3,
        fontsize=9,
        bbox_to_anchor=(0.5, -0.02),
    )
    fig.suptitle(
        kwargs.pop(
            "title",
            f"{label_A} vs {label_B} — Beta-Bernoulli Comparison (Non-Paired)",
        ),
        fontsize=14,
        fontweight="bold",
        y=1.02,
    )
    plt.tight_layout()
    plt.show()
print_comparison_table(results) staticmethod

Print a formatted comparison table across metrics.

Displays the posterior mean difference, 95% credible interval, posterior probability of superiority P(A > B), and a verdict for each metric in a fixed-width table.

The verdict is determined as:

  • A wins if P(A > B) > 0.95
  • B wins if P(A > B) ≤ 0.5
  • Tied otherwise

Parameters:

Name Type Description Default
results dict[str, 'NonPairedBayesPropTest']

Mapping from metric name to a fitted :class:NonPairedBayesPropTest instance.

required
Source code in bayesprop/resources/bayes_nonpaired.py
@staticmethod
def print_comparison_table(
    results: dict[str, "NonPairedBayesPropTest"],
) -> None:
    """Print a formatted comparison table across metrics.

    Displays the posterior mean difference, 95% credible interval,
    posterior probability of superiority P(A > B), and a verdict
    for each metric in a fixed-width table.

    The verdict is determined as:

    - **A wins** if P(A > B) > 0.95
    - **B wins** if P(A > B) ≤ 0.5
    - **Tied** otherwise

    Args:
        results: Mapping from metric name to a fitted
            :class:`NonPairedBayesPropTest` instance.
    """
    print("=" * 80)
    print(
        f"{'Metric':<25} {'Mean Δ':>8} {'95% CI':>20} {'P(A>B)':>8} {'Verdict':>12}"
    )
    print("=" * 80)
    for m, model in results.items():
        s = model.summary
        verdict = (
            "A wins"
            if s.p_A_greater_B > 0.95
            else ("Tied" if s.p_A_greater_B > 0.5 else "B wins")
        )
        print(
            f"{m:<25} {s.mean_delta:>8.4f} "
            f"[{s.ci_95.lower:>7.4f}, {s.ci_95.upper:>7.4f}] "
            f"{s.p_A_greater_B:>8.4f} {verdict:>12}"
        )
    print("=" * 80)

_format_bf(value)

Format a Bayes Factor for human-readable display.

Source code in bayesprop/resources/bayes_nonpaired.py
def _format_bf(value: float) -> str:
    """Format a Bayes Factor for human-readable display."""
    if value > 1e4:
        return f"10^{np.log10(value):.0f}"
    elif value < 1e-4 and value > 0:
        return f"10^{np.log10(value):.0f}"
    else:
        return f"{value:.2f}"

beta_diff_pdf(z, a1, b1, a2, b2, n_grid=2000)

Evaluate the PDF of Δ = θ_A − θ_B at z via log-space convolution.

Where θ_A ~ Beta(a1, b1) and θ_B ~ Beta(a2, b2) are independent.

Parameters:

Name Type Description Default
z float

Point at which to evaluate the density (−1 < z < 1).

required
a1 float

Alpha parameter for the Beta distribution of θ_A.

required
b1 float

Beta parameter for the Beta distribution of θ_A.

required
a2 float

Alpha parameter for the Beta distribution of θ_B.

required
b2 float

Beta parameter for the Beta distribution of θ_B.

required
n_grid int

Number of quadrature nodes for trapezoidal integration.

2000

Returns:

Type Description
float

f_Δ(z), the density of the difference at z.

Source code in bayesprop/resources/bayes_nonpaired.py
def beta_diff_pdf(
    z: float,
    a1: float,
    b1: float,
    a2: float,
    b2: float,
    n_grid: int = 2000,
) -> float:
    """Evaluate the PDF of Δ = θ_A − θ_B at *z* via log-space convolution.

    Where θ_A ~ Beta(a1, b1) and θ_B ~ Beta(a2, b2) are independent.

    Args:
        z: Point at which to evaluate the density (−1 < z < 1).
        a1: Alpha parameter for the Beta distribution of θ_A.
        b1: Beta parameter for the Beta distribution of θ_A.
        a2: Alpha parameter for the Beta distribution of θ_B.
        b2: Beta parameter for the Beta distribution of θ_B.
        n_grid: Number of quadrature nodes for trapezoidal integration.

    Returns:
        f_Δ(z), the density of the difference at *z*.
    """
    if z <= -1 or z >= 1:
        return 0.0

    # Δ = θ_A − θ_B  ⟹  θ_A = θ_B + z
    # f_Δ(z) = ∫ f_A(x) · f_B(x − z) dx   for x in [max(0,z), min(1,1+z)]
    lower = max(0.0, z)
    upper = min(1.0, 1.0 + z)

    x = np.linspace(lower + 1e-12, upper - 1e-12, n_grid)

    # log Beta PDFs for θ_A at x
    log_fA = (a1 - 1) * np.log(x) + (b1 - 1) * np.log(1 - x) - betaln(a1, b1)
    # θ_B = x − z
    xb = x - z
    log_fB = (a2 - 1) * np.log(xb) + (b2 - 1) * np.log(1 - xb) - betaln(a2, b2)

    integrand = np.exp(log_fA + log_fB)
    return float(np.trapezoid(integrand, x))

descriptive_summary(scores_data, thresholds=None)

Build a combined descriptive-stats + Beta-Bernoulli threshold-sweep table.

Parameters

scores_data : dict Must contain keys "model_A", "model_B", and "metrics" where metrics[name] has "s_A_raw" and "s_B_raw" arrays. thresholds : list of float, optional Binarization thresholds for the Beta-Bernoulli sweep. Default: [0.5, 0.7, 0.8, 0.9, 0.95].

Returns:

pd.DataFrame Multi-indexed by (Metric, Model) with descriptive stats and BB results.

Source code in bayesprop/resources/bayes_nonpaired.py
def descriptive_summary(
    scores_data: dict,
    thresholds: list[float] | None = None,
) -> pd.DataFrame:
    """Build a combined descriptive-stats + Beta-Bernoulli threshold-sweep table.

    Parameters
    ----------
    scores_data : dict
        Must contain keys ``"model_A"``, ``"model_B"``, and ``"metrics"``
        where ``metrics[name]`` has ``"s_A_raw"`` and ``"s_B_raw"`` arrays.
    thresholds : list of float, optional
        Binarization thresholds for the Beta-Bernoulli sweep.
        Default: ``[0.5, 0.7, 0.8, 0.9, 0.95]``.

    Returns:
    -------
    pd.DataFrame
        Multi-indexed by (Metric, Model) with descriptive stats and BB results.
    """
    if thresholds is None:
        thresholds = [0.5, 0.7, 0.8, 0.9, 0.95]

    model_A = scores_data["model_A"]
    model_B = scores_data["model_B"]
    metric_names = list(scores_data["metrics"].keys())

    rows: list[dict] = []
    for metric in metric_names:
        s_A = np.array(scores_data["metrics"][metric]["s_A_raw"])
        s_B = np.array(scores_data["metrics"][metric]["s_B_raw"])
        diff = s_A - s_B

        rows.append(
            {
                "Metric": metric,
                "Model": model_A,
                "n": len(s_A),
                "Mean": s_A.mean(),
                "Std": s_A.std(),
                "Min": s_A.min(),
                "Q25": np.quantile(s_A, 0.25),
                "Median": np.median(s_A),
                "Q75": np.quantile(s_A, 0.75),
                "Max": s_A.max(),
            }
        )
        rows.append(
            {
                "Metric": metric,
                "Model": model_B,
                "n": len(s_B),
                "Mean": s_B.mean(),
                "Std": s_B.std(),
                "Min": s_B.min(),
                "Q25": np.quantile(s_B, 0.25),
                "Median": np.median(s_B),
                "Q75": np.quantile(s_B, 0.75),
                "Max": s_B.max(),
            }
        )
        rows.append(
            {
                "Metric": metric,
                "Model": "Δ (A − B)",
                "n": len(diff),
                "Mean": diff.mean(),
                "Std": diff.std(),
                "Min": diff.min(),
                "Q25": np.quantile(diff, 0.25),
                "Median": np.median(diff),
                "Q75": np.quantile(diff, 0.75),
                "Max": diff.max(),
            }
        )

        for tau in thresholds:
            bb_tau = NonPairedBayesPropTest(threshold=tau)
            res = bb_tau.test(s_A, s_B)
            rows.append(
                {
                    "Metric": metric,
                    "Model": f"BB (τ={tau})",
                    "n": len(s_A),
                    "P(B>A)": res.P_B_greater_A,
                    "θ_A posterior": f"Beta({res.thetaA_post.alpha}, {res.thetaA_post.beta})",
                    "θ_B posterior": f"Beta({res.thetaB_post.alpha}, {res.thetaB_post.beta})",
                }
            )

    df = pd.DataFrame(rows).set_index(["Metric", "Model"])

    n_sample = len(np.array(scores_data["metrics"][metric_names[0]]["s_A_raw"]))
    print(f"Paired LLM scores: {model_A} vs {model_B}  (n={n_sample} per model)")
    print(f"  Model A = {model_A}")
    print(f"  Model B = {model_B}")
    print("  Δ = A − B")
    print("  BB (τ=x) = Beta-Bernoulli test binarized at threshold τ")

    return df