Benchmarks OenoBench Leaderboard

OenoBench Leaderboard

OenoBench evaluates 16 model configurations across 3,266 multiple-choice questions covering viticulture, winemaking, wine business, the world’s wine regions, grape varieties, and producers. Use the filters to slice the leaderboard by domain, difficulty tier, or whether the question is answerable from parametric knowledge alone (closed-book) or requires contextual reasoning. The other tabs surface dedicated analyses from the paper: reasoning-mode lift, self-preference bias, and cost efficiency.

OenoBench · v2026-05-0416 model configurations · 3,266 questions across six wine domains

Wine domain

Question difficulty

Closed-book vs contextual

1
o3OpenAIeffort
83.6%
2
GPT-5OpenAI
82.8%
3
Gemini 2.5 Pro (thinking)Googlethinking
82.6%
4
Gemini 2.5 ProGoogle
81.7%
5
Claude Opus 4.7Anthropic
81.0%
6
Claude Opus 4.7 (thinking)Anthropicthinking
81.0%
7
GPT-5 miniOpenAI
78.4%
8
DeepSeek-R1DeepSeekthinking
77.1%
9
Gemini 2.5 FlashGoogle
75.1%
10
DeepSeek-V3DeepSeek
70.3%
11
Mistral Large 2411Mistral AI
69.1%
12
Qwen 2.5 72BAlibaba
67.4%
13
Llama 3.3 70BMeta
67.1%
14
Llama 3.1 8BMeta
60.5%
15
Qwen 2.5 7BAlibaba
57.0%
16
Claude Haiku 4.5Anthropic
53.3%

For the methodology behind these scores — corpus construction, multi-model generation, AI validation, and the bias-mitigation framework — see the methodology page.