Benchmarks OenoBench

OenoBench

OenoBench is a wine-knowledge benchmark for large language models. The release_v1.2 corpus contains 3,266 multiple-choice questions spread across six domains — viticulture, winemaking, wine business, wine regions, grape varieties, and producers — and four difficulty tiers calibrated to the WSET / Court of Master Sommeliers ladder. Sixteen frontier and open-source model configurations have been evaluated against it, and every release ships as a versioned JSON dataset committed to this repository so historical runs remain reproducible.

The questions are produced by a multi-model generation pipeline with explicit bias-mitigation steps: no single model family disproportionately shapes the questions it will later be scored on, every fact must trace to an external source, and a 9-agent quality audit gates each release. Two companion pages document the system in detail:

Methodology — the four-stage pipeline used to build and validate the question set, and the bias-mitigation framework around multi-model generation.
Leaderboard — interactive model scores with filters by domain, difficulty, and closed-book vs contextual slice, plus dedicated views for reasoning lift, self-preference bias, and cost efficiency.