HeuriGym — LLM Heuristics Benchmark

ICLR 2026 Benchmark

An agentic benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization tasks.

Hongzheng Chen^*, Yingheng Wang^*, Yaohui Cai^*, Hins Hu^*, Jiajie Li^*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

^* Core contributor

Models Evaluated

Problems

Domains

218

Test Cases

GitHub ArXiv 🤗 Dataset

Overall Performance

Yield and QYI values are weighted across the 9 problem sets by test case count. Quality is back-calculated from weighted QYI and weighted Yield via QYI = 2QY/(Q+Y).

QYI vs. API Cost

Total API cost estimated from official list pricing × token counts across all 9 problems (5 iterations, T=0). Cost axis is on a logarithmic scale. Hover over a point for details.

Leaderboard (T = 0)

#	Model	QYI ↕▼	Yield ↕	Quality ↕	Cost ($) ↕	Released ↕

* GPT-o4-mini:high is measured at T=1 (o-series only supports T=1); its values appear in all temperature views unchanged.

Per-Problem Results

EDA

Compilers

Computational Biology

Logistics

QYI by Model — Operator Scheduling

Model	QYI ↕▼	Yield ↕	Quality ↕	Solver

Per-problem QYI, Yield, and Quality for all 23 evaluated models. Original 9 models ran 10 iterations (T=0); the 14 recent models ran 5 iterations (T=0).

Solve Metrics (T = 0)

Model	Stage III — Validity			Stage II — Output Gen.			Stage I — Execution
Model	@10 ▼	@5	@1	@10	@5	@1	@10	@5	@1

solve_s@i: fraction of test cases solved at Stage s within i attempts. Stage I = no crash, Stage II = parseable output, Stage III = valid solution (≡ Yield). — indicates the model was evaluated with 5 iterations only (no @10 data); the first 9 models ran 10 iterations.

Citation

@article{chen-heurigym-iclr2026,
  title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
  author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li
          and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong
          and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
  journal={International Conference on Learning Representations (ICLR)},
  year={2026}
}