Accuracy Evaluations
lm-evaluation-harness results ingested from CI runs. Click a leaderboard row to drill into per-sample answers (correct vs incorrect).
Loading evaluations...
lm-evaluation-harness results ingested from CI runs. Click a leaderboard row to drill into per-sample answers (correct vs incorrect).