SWE-bench Alternatives

The authoritative AI coding benchmark on real GitHub issues. Here are 12 similar model benchmarks worth considering as SWE-bench alternatives.

LMArena — Blind-vote LLM battle arena behind the community leaderboard (alternatives)
OpenCompass — Model Benchmark (alternatives)
SuperCLUE — Model Benchmark (alternatives)
C-Eval — Model Benchmark (alternatives)
CMMLU — Model Benchmark (alternatives)
FlagEval — Model Benchmark (alternatives)
AGI-Eval — Model Benchmark (alternatives)
MMLU — Model Benchmark (alternatives)
Open LLM Leaderboard — Model Benchmark (alternatives)
HELM — Stanford's holistic evaluation framework for language models (alternatives)
MMBench — Model Benchmark (alternatives)
MagicArena — Model Benchmark (alternatives)