OenoBench methodology
OenoBench is constructed by a four-stage, AI-driven pipeline that turns heterogeneous wine-domain sources into a balanced, low-bias evaluation set. Each stage is automated where automation is reliable and gated by humans where judgement is required. The summary below covers the MVP design; a fuller treatment — with prompts, model versions, and inter-rater statistics — will appear in the v1.0 paper.
- Data collection. Curated source material is gathered across the four pillars (viticulture, winemaking, wine business, regions), with explicit coverage targets so that no single sub-domain dominates.
- Question generation. Multiple frontier models propose candidate questions and answers from the source material. Using more than one model family is a deliberate bias-mitigation step.
- AI validation. A separate validator pool checks each candidate for factual accuracy, ambiguity, and leakage from the generation prompt; failing items are dropped or rewritten.
- Human review. A final human pass spot-checks samples per stratum, resolves disagreements between validators, and signs off on each release.
Each release is published as a versioned JSON file alongside the leaderboard, so historical runs remain reproducible as the question set evolves.