InteractBench
Benchmarking LLMs on Competitive Programming under Unrevealed Information
Jiaze Li1,2,3 Aocheng Shen1 Bing Liu1 Boyu Zhang1 Xiaoxuan Fan1 Qiankun Zhang1,2,3 Xianjun Deng1,3
- 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
- 2 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, China
- 3 Hubei Key Laboratory of Distributed System Security, Wuhan, China
Reasoning when information is not revealed upfront
Most existing competitive-programming benchmarks primarily focus on batch-style tasks where all problem inputs are provided upfront. This overlooks a critical dimension of algorithmic reasoning: the ability of generated programs to operate when key information is not revealed upfront.
Interactive problems embody this challenge. These problems require programs to engage in multi-round interaction with an interactor (a judge program) under strict protocol constraints and limited query budgets, with new information revealed only in response to queries.
What an interactive problem looks like
Find the smallest index x such that A[1..x] contains k zeros, with n = 8 and k = 3. In the batch setting the whole array is given upfront; in the interactive setting it is hidden, and the solver must recover the answer through range-sum queries under a fixed budget.
The interactive half, played out as a solver–interactor exchange:
What InteractBench provides
A benchmark of interactive problems
We curate 322 interactive problems from Codeforces, AtCoder, IOI, and ICPC. Each problem is annotated with consolidated categories and difficulty tiers.
A self-contained evaluation harness
Each problem is packaged with an executable local interactor, enabling fully offline evaluation without external judge submission.
Interaction-aware diagnostics
Beyond pass@k, our harness checks protocol compliance, enforces query budgets, and reports a failure taxonomy that distinguishes algorithmic errors from interaction-specific failures.
Where interactive evaluation is missing
Existing benchmarks focus on batch-style problems. Some rely on external online judges, limiting fully offline evaluation. Even when interactive problems are included, they typically constitute a small subset.
How each problem is built and verified
Select
We select interactive tasks with multi-round protocols and explicit query budgets from Codeforces, AtCoder, IOI, and ICPC.
Propose–Validate–Adjudicate
Proposer models draft candidate generators and interactors. Candidates are then validated against a per-task verification pool of submissions with known official verdicts. Failed candidates receive trace-grounded feedback from an adjudicator model and re-enter the loop; remaining cases are escalated to human experts after a fixed iteration cap.
Audit
As a supplementary sanity check, we conduct a one-time post-hoc online agreement audit.
Interactive tasks expose a persistent gap
- Hard interactive tasks remain unsolved and strongly discriminative.
- Gaps between pass@1 and pass@5 quantify the benefit of resampling.
- Resampling yields uneven gains across models and difficulty tiers.
- Explicit reasoning improves pass rates, but does not remove the Hard barrier.
Separating algorithmic errors from interaction failures
Algorithmic logic
- WA protocol-compliant but incorrect answer
Interaction-specific
- PE protocol / format violations
- QLE query-budget overrun
- IDLE blocked I/O or deadlock terminated by the wall-clock cap
Execution
- CE compile error
- RE runtime error
- TLE time limit exceeded
- MLE memory limit exceeded
What goes wrong, and how often
- Algorithmic logic errors (WA) dominate, but interaction-specific failures occur at different stages.
- Query efficiency limits frontier models as well.
- Runtime and resource-limit failures are present but secondary.
Takeaways
- 01
A significant interaction gap: even the most advanced reasoning models achieve limited success, with consistent degradation as difficulty increases.
- 02
Failures are dominated by algorithmic logic errors, yet protocol violations and query-budget overruns remain frequent even for frontier reasoning models.
- 03
Looking forward, we plan to develop prompting and training strategies that directly target interactive querying, state maintenance, and protocol adherence.
Cite this work
@inproceedings{interactbench2026,
title = {InteractBench: Benchmarking LLMs on Competitive
Programming under Unrevealed Information},
author = {Li, Jiaze and Shen, Aocheng and Liu, Bing and
Zhang, Boyu and Fan, Xiaoxuan and Zhang, Qiankun and
Deng, Xianjun},
booktitle = {Proceedings of the 43rd International Conference on
Machine Learning (ICML)},
year = {2026}
}