An agentic benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization tasks.
Yield and QYI values are weighted across the 9 problem sets by test case count. Quality is back-calculated from weighted QYI and weighted Yield via QYI = 2QY/(Q+Y).
Total API cost estimated from official list pricing × token counts across all 9 problems (5 iterations, T=0). Cost axis is on a logarithmic scale. Hover over a point for details.
| # | Model | QYI ↕ | Yield ↕ | Quality ↕ | Cost ($) ↕ | Released ↕ |
|---|
* GPT-o4-mini:high is measured at T=1 (o-series only supports T=1); its values appear in all temperature views unchanged.
| Model | QYI ↕ | Yield ↕ | Quality ↕ | Solver |
|---|
Per-problem QYI, Yield, and Quality for all 21 evaluated models. Original 9 models ran 10 iterations (T=0); the 12 recent models ran 5 iterations (T=0).
| Model | Stage III — Validity | Stage II — Output Gen. | Stage I — Execution | ||||||
|---|---|---|---|---|---|---|---|---|---|
| @10 | @5 | @1 | @10 | @5 | @1 | @10 | @5 | @1 | |
solves@i: fraction of test cases solved at Stage s within i attempts. Stage I = no crash, Stage II = parseable output, Stage III = valid solution (≡ Yield). — indicates the model was evaluated with 5 iterations only (no @10 data); the first 9 models ran 10 iterations.
@article{chen-heurigym-iclr2026,
title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li
and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong
and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
journal={International Conference on Learning Representations (ICLR)},
year={2026}
}