Accuracy Evaluations

lm-evaluation-harness results ingested from CI runs. Click a leaderboard row to drill into per-sample answers (correct vs incorrect).

Loading evaluations...