Skip to content

OenoBench methodology

OenoBench is constructed by a four-stage, AI-driven pipeline that turns heterogeneous wine-domain sources into a balanced, low-bias evaluation set. Each stage is automated where automation is reliable and gated by humans where judgement is required. The summary below covers the MVP design; a fuller treatment — with prompts, model versions, and inter-rater statistics — will appear in the v1.0 paper.

  • Data collection. Curated source material is gathered across the four pillars (viticulture, winemaking, wine business, regions), with explicit coverage targets so that no single sub-domain dominates.
  • Question generation. Multiple frontier models propose candidate questions and answers from the source material. Using more than one model family is a deliberate bias-mitigation step.
  • AI validation. A separate validator pool checks each candidate for factual accuracy, ambiguity, and leakage from the generation prompt; failing items are dropped or rewritten.
  • Human review. A final human pass spot-checks samples per stratum, resolves disagreements between validators, and signs off on each release.

Each release is published as a versioned JSON file alongside the leaderboard, so historical runs remain reproducible as the question set evolves.