Generated: 2026-02-27 00:27
| # | Model | HumanEval | HumanEval+ | vs published |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 98.2% | 95.1% | +4.0pp |
| 2 | Qwen3.5-122B-A10B UD-Q4_K_XL | 97.6% | 94.5% | — |
| 3 | Qwen3.5-27B UD-Q6_K_XL (dense) | 98.2% | 94.5% | — |
| 4 | Claude Opus 4.6 (thinking) | 99.4% | 93.9% | +5.2pp |
| 5 | Qwen3-Next-80B-A3B UD-Q5_K_XL | 98.2% | 93.9% | — |
| 6 | Qwen3-Coder-Next UD-Q5_K_XL | 93.9% | 90.9% | -0.2pp |
| 7 | Qwen3.5-35B-A3B UD-Q6_K_XL | 95.1% | 90.9% | — |
| 8 | bench-qwen3-coder-ud-q6 | 92.1% | 89.0% | — |
| 9 | GLM-4.7 Flash Q8_0 * | 89.0% | 87.2% | +2.0pp |
| 10 | GPT-OSS 120B F16 | 93.3% | 87.2% | +5.0pp |
| 11 | GLM-4.7 Flash Q4_K_M * | 87.8% | 83.5% | +0.8pp |
| # | Model | HumanEval | Source |
|---|---|---|---|
| 1 | Claude Opus 4.6 (thinking) | 99.4% | Local benchmark |
| 2 | Claude Opus 4.6 | 98.2% | Local benchmark |
| 3 | Qwen3-Next-80B-A3B UD-Q5_K_XL | 98.2% | Local benchmark |
| 4 | Qwen3.5-27B UD-Q6_K_XL (dense) | 98.2% | Local benchmark |
| 5 | Qwen3.5-122B-A10B UD-Q4_K_XL | 97.6% | Local benchmark |
| 6 | OpenAI O1 Preview | 96.3% | EvalPlus leaderboard |
| 7 | Qwen3.5-35B-A3B UD-Q6_K_XL | 95.1% | Local benchmark |
| 8 | Claude Opus 4.5 | 94.2% | zoer.ai Jan 2026 benchmark |
| 9 | Qwen3-Coder-Next (FP16, official) | 94.1% | Model card |
| 10 | Qwen3-Coder-Next UD-Q5_K_XL | 93.9% | Local benchmark |
| 11 | GPT-OSS 120B F16 | 93.3% | Local benchmark |
| 12 | bench-qwen3-coder-ud-q6 | 92.1% | Local benchmark |
| 13 | Claude 3.5 Sonnet | 92.0% | EvalPlus leaderboard |
| 14 | GPT-5.2 Codex | 91.7% | zoer.ai Jan 2026 benchmark |
| 15 | GPT-4o | 90.2% | EvalPlus leaderboard |
| 16 | Llama 3.1 405B | 89.0% | Meta official eval (0-shot, pass@1) |
| 17 | GLM-4.7 Flash Q8_0 | 89.0% | Local benchmark |
| 18 | GPT-OSS 120B (official) | 88.3% | Model card |
| 19 | GLM-4.7 Flash Q4_K_M | 87.8% | Local benchmark |
| 20 | GLM-4.7 (full, not Flash) | 87.0% | zoer.ai |
| 21 | Codestral 25.01 | 86.6% | Mistral AI |
| 22 | Gemini 1.5 Pro | 84.1% | Gemini 1.5 technical report (arxiv:2403.05530) |
| # | Model | HumanEval+ | Source |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 95.1% | Local benchmark |
| 2 | Qwen3.5-122B-A10B UD-Q4_K_XL | 94.5% | Local benchmark |
| 3 | Qwen3.5-27B UD-Q6_K_XL (dense) | 94.5% | Local benchmark |
| 4 | Claude Opus 4.6 (thinking) | 93.9% | Local benchmark |
| 5 | Qwen3-Next-80B-A3B UD-Q5_K_XL | 93.9% | Local benchmark |
| 6 | Qwen3-Coder-Next UD-Q5_K_XL | 90.9% | Local benchmark |
| 7 | Qwen3.5-35B-A3B UD-Q6_K_XL | 90.9% | Local benchmark |
| 8 | OpenAI O1 Preview | 89.0% | EvalPlus leaderboard |
| 9 | bench-qwen3-coder-ud-q6 | 89.0% | Local benchmark |
| 10 | GPT-4o | 87.2% | EvalPlus leaderboard |
| 11 | Qwen2.5-Coder-32B-Instruct | 87.2% | EvalPlus leaderboard |
| 12 | GLM-4.7 Flash Q8_0 | 87.2% | Local benchmark |
| 13 | GPT-OSS 120B F16 | 87.2% | Local benchmark |
| 14 | DeepSeek-V3 / GPT-4-Turbo | 86.6% | EvalPlus leaderboard |
| 15 | GLM-4.7 Flash Q4_K_M | 83.5% | Local benchmark |
| 16 | Claude 3.5 Sonnet | 81.7% | EvalPlus leaderboard |
- All local results use greedy decoding (temperature=0, max_tokens=4096)
- HumanEval+ uses 80x more tests than standard HumanEval (stricter, scores are typically 3-8% lower)
- Local benchmarks produce both HumanEval and HumanEval+ scores
- "vs published" shows difference in HumanEval base score vs the closest official published score
- Many reference models only have HumanEval (not HumanEval+) published — direct comparison on HumanEval+ is limited
- Local scores may differ from published due to quantization, prompt template, and max_tokens differences
- GLM-4.7 Flash is a reasoning model — benchmarked with
--reasoning-format noneto include thinking in output. Scores may be less reliable: the model spends tokens on chain-of-thought reasoning before producing code, and the code extractor must parse it from mixed reasoning+code output. The Q4 > Q8 score inversion is likely caused by this - Claude Opus 4.6 was tested via Claude Code (Max subscription) using a custom agent that solves each problem from the prompt alone — no code execution, no internet, no tools. "vs published" compares against the published Opus 4.5 score (no Opus 4.6 reference available yet)