Skip to content

Benchmarks OenoBench Leaderboard

OenoBench Leaderboard

OenoBench evaluates 16 model configurations across 3,266 multiple-choice questions covering viticulture, winemaking, wine business, the world’s wine regions, grape varieties, and producers. Use the filters to slice the leaderboard by domain, difficulty tier, or whether the question is answerable from parametric knowledge alone (closed-book) or requires contextual reasoning. The other tabs surface dedicated analyses from the paper: reasoning-mode lift, self-preference bias, and cost efficiency.

OenoBench · v2026-05-0416 model configurations · 3,266 questions across six wine domains
Wine domain
Question difficulty
Closed-book vs contextual
  1. 1
    o3OpenAIeffort
    83.6%
  2. 2
    GPT-5OpenAI
    82.8%
  3. 3
    Gemini 2.5 Pro (thinking)Googlethinking
    82.6%
  4. 4
    Gemini 2.5 ProGoogle
    81.7%
  5. 5
    Claude Opus 4.7Anthropic
    81.0%
  6. 6
    Claude Opus 4.7 (thinking)Anthropicthinking
    81.0%
  7. 7
    GPT-5 miniOpenAI
    78.4%
  8. 8
    DeepSeek-R1DeepSeekthinking
    77.1%
  9. 9
    Gemini 2.5 FlashGoogle
    75.1%
  10. 10
    DeepSeek-V3DeepSeek
    70.3%
  11. 11
    Mistral Large 2411Mistral AI
    69.1%
  12. 12
    Qwen 2.5 72BAlibaba
    67.4%
  13. 13
    Llama 3.3 70BMeta
    67.1%
  14. 14
    Llama 3.1 8BMeta
    60.5%
  15. 15
    Qwen 2.5 7BAlibaba
    57.0%
  16. 16
    Claude Haiku 4.5Anthropic
    53.3%
0%25%50%75%100%

For the methodology behind these scores — corpus construction, multi-model generation, AI validation, and the bias-mitigation framework — see the methodology page.