Benchmarks OenoBench Methodology

OenoBench methodology

OenoBench is constructed by a four-stage, AI-driven pipeline that turns heterogeneous wine-domain sources into a balanced, low-bias evaluation set covering six domains — viticulture, winemaking, wine business, wine regions, grape varieties, and producers — and four difficulty tiers (L1 easy through L4 hardest, calibrated to WSET / CMS levels). Each stage is automated where automation is reliable and gated by humans where judgement is required. The summary below covers the v1.2 design; full prompts, model versions, and inter-rater statistics appear in the v1.0 paper.

Data collection. Curated source material is gathered across all six domains, with explicit coverage targets so that no single sub-domain dominates and every fact traces to an external authoritative source.
Multi-model question generation. Five frontier and open-source models (Claude, GPT, Gemini, Llama, Qwen) plus deterministic templates propose candidate questions and answers from the source material. Multi-model generation is the central bias-mitigation step: no single family disproportionately shapes the questions it will later be scored on.
AI validation. A 9-agent audit checks each candidate for factual accuracy, ambiguity, leakage from the generation prompt, distractor quality, country-representation balance, and verbatim copying. Failing items are dropped or rewritten.
Human review. A final human pass spot-checks a sample per stratum, resolves disagreements between validators, and signs off on each release.

The leaderboard reports four orthogonal slices on top of the headline accuracy: per-domain breakdown, per-difficulty breakdown, closed-book vs contextual split (does the question require contextual reasoning, or is it answerable from parametric memory alone?), and reasoning-mode lift (the same base model scored with and without extended reasoning). Self-preference bias — each evaluator’s accuracy on questions generated by its own family vs by other families — is reported separately. Each release is published as a versioned JSON file alongside the leaderboard, so historical runs remain reproducible as the question set evolves.