Benchmarks OenoBench Leaderboard
OenoBench Leaderboard
OenoBench evaluates 16 model configurations across 3,266 multiple-choice questions covering viticulture, winemaking, wine business, the world’s wine regions, grape varieties, and producers. Use the filters to slice the leaderboard by domain, difficulty tier, or whether the question is answerable from parametric knowledge alone (closed-book) or requires contextual reasoning. The other tabs surface dedicated analyses from the paper: reasoning-mode lift, self-preference bias, and cost efficiency.
Wine domain
Question difficulty
Closed-book vs contextual
1 o3OpenAIeffort83.6%2 GPT-5OpenAI82.8%3 Gemini 2.5 Pro (thinking)Googlethinking82.6%4 Gemini 2.5 ProGoogle81.7%5 Claude Opus 4.7Anthropic81.0%6 Claude Opus 4.7 (thinking)Anthropicthinking81.0%7 GPT-5 miniOpenAI78.4%8 DeepSeek-R1DeepSeekthinking77.1%9 Gemini 2.5 FlashGoogle75.1%10 DeepSeek-V3DeepSeek70.3%11 Mistral Large 2411Mistral AI69.1%12 Qwen 2.5 72BAlibaba67.4%13 Llama 3.3 70BMeta67.1%14 Llama 3.1 8BMeta60.5%15 Qwen 2.5 7BAlibaba57.0%16 Claude Haiku 4.5Anthropic53.3%
For the methodology behind these scores — corpus construction, multi-model generation, AI validation, and the bias-mitigation framework — see the methodology page.