ICLR 2026 Benchmark

An agentic benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization tasks.

Hongzheng Chen*, Yingheng Wang*, Yaohui Cai*, Hins Hu*, Jiajie Li*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang
* Core contributor
21
Models Evaluated
9
Problems
4
Domains
218
Test Cases

Overall Performance

Yield and QYI values are weighted across the 9 problem sets by test case count. Quality is back-calculated from weighted QYI and weighted Yield via QYI = 2QY/(Q+Y).

QYI vs. API Cost

Total API cost estimated from official list pricing × token counts across all 9 problems (5 iterations, T=0). Cost axis is on a logarithmic scale. Hover over a point for details.

Leaderboard (T = 0)

# Model QYI ↕ Yield ↕ Quality ↕ Cost ($) ↕ Released ↕

* GPT-o4-mini:high is measured at T=1 (o-series only supports T=1); its values appear in all temperature views unchanged.

Per-Problem Results

EDA
Compilers
Computational Biology
Logistics
QYI by Model — Operator Scheduling
Model QYI ↕ Yield ↕ Quality ↕ Solver

Per-problem QYI, Yield, and Quality for all 21 evaluated models. Original 9 models ran 10 iterations (T=0); the 12 recent models ran 5 iterations (T=0).

Solve Metrics (T = 0)

Model Stage III — Validity Stage II — Output Gen. Stage I — Execution
@10 @5 @1 @10 @5 @1 @10 @5 @1

solves@i: fraction of test cases solved at Stage s within i attempts. Stage I = no crash, Stage II = parseable output, Stage III = valid solution (≡ Yield). indicates the model was evaluated with 5 iterations only (no @10 data); the first 9 models ran 10 iterations.

Citation

@article{chen-heurigym-iclr2026,
  title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
  author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li
          and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong
          and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
  journal={International Conference on Learning Representations (ICLR)},
  year={2026}
}