HeuriGym — LLM Heuristics Benchmark

ICLR 2026 Benchmark

An agentic benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization tasks.

Hongzheng Chen^*, Yingheng Wang^*, Yaohui Cai^*, Hins Hu^*, Jiajie Li^*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

^* Core contributor

Models Evaluated

Problems

Domains

218

Test Cases

GitHub ArXiv 🤗 Dataset

Overall Performance

Yield and QYI values are weighted across the 9 problem sets by test case count. Quality is back-calculated from weighted QYI and weighted Yield via QYI = 2QY/(Q+Y). Select a single company to overlay a chronological trend line of its model releases.

QYI vs. API Cost

Total API cost estimated from official list pricing × token counts across all 9 problems (5 iterations, T=0). Cost axis is on a logarithmic scale. Hover over a point for details. Select a single company to overlay a chronological trend line of its model releases.

Leaderboard (T = 0)

#	Model	QYI ↕▼	Yield ↕	Quality ↕	Cost ($) ↕	Released ↕

* GPT-o4-mini:high is measured at T=1 (o-series only supports T=1); its values appear in all temperature views unchanged.

Per-Problem Results

EDA

Compilers

Computational Biology

Logistics

QYI by Model — Operator Scheduling

Per-problem QYI for all 30 evaluated models, sorted by QYI. Original 9 models ran 10 iterations (T=0); the 21 recent models ran 5 iterations (T=0).

Solve Metrics (T = 0)

Model	Stage III — Validity			Stage II — Output Gen.			Stage I — Execution
Model	@10 ▼	@5	@1	@10	@5	@1	@10	@5	@1

solve_s@i: fraction of test cases solved at Stage s within i attempts. Stage I = no crash, Stage II = parseable output, Stage III = valid solution (≡ Yield). — indicates the model was evaluated with 5 iterations only (no @10 data); the first 9 models ran 10 iterations.

Citation

@article{chen-heurigym-iclr2026,
  title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
  author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li
          and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong
          and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
  journal={International Conference on Learning Representations (ICLR)},
  year={2026}
}