InteractBench

Benchmarking LLMs on Competitive Programming under Unrevealed Information

Jiaze Li1,2,3 Aocheng Shen1 Bing Liu1 Boyu Zhang1 Xiaoxuan Fan1 Qiankun Zhang1,2,3 Xianjun Deng1,3

  1. 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
  2. 2 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, China
  3. 3 Hubei Key Laboratory of Distributed System Security, Wuhan, China
Huazhong University of Science and Technology ICML
Motivation

Reasoning when information is not revealed upfront

Most existing competitive-programming benchmarks primarily focus on batch-style tasks where all problem inputs are provided upfront. This overlooks a critical dimension of algorithmic reasoning: the ability of generated programs to operate when key information is not revealed upfront.

Interactive problems embody this challenge. These problems require programs to engage in multi-round interaction with an interactor (a judge program) under strict protocol constraints and limited query budgets, with new information revealed only in response to queries.

Interaction

What an interactive problem looks like

Find the smallest index x such that A[1..x] contains k zeros, with n = 8 and k = 3. In the batch setting the whole array is given upfront; in the interactive setting it is hidden, and the solver must recover the answer through range-sum queries under a fixed budget.

An illustration of the difference between batch-style and interactive problems.
Figure 1. An illustration of the difference between batch-style and interactive problems.

The interactive half, played out as a solver–interactor exchange:

solver ⇄ interactor
Hidden array · A = [ ?, ?, ?, ?, ?, ?, ?, ? ] Query budget 3 / 7
solver › Sum(A[1..4])
interactor → 3
solver › Sum(A[5..6])
interactor → 1
solver › Sum(A[7..7])
interactor → 0
solver › Answer: Index 7
Correct
Contributions

What InteractBench provides

A benchmark of interactive problems

We curate 322 interactive problems from Codeforces, AtCoder, IOI, and ICPC. Each problem is annotated with consolidated categories and difficulty tiers.

A self-contained evaluation harness

Each problem is packaged with an executable local interactor, enabling fully offline evaluation without external judge submission.

Interaction-aware diagnostics

Beyond pass@k, our harness checks protocol compliance, enforces query budgets, and reports a failure taxonomy that distinguishes algorithmic errors from interaction-specific failures.

Benchmark Comparison

Where interactive evaluation is missing

Existing benchmarks focus on batch-style problems. Some rely on external online judges, limiting fully offline evaluation. Even when interactive problems are included, they typically constitute a small subset.

Comparison of representative code and competitive programming benchmarks.
Table 1. Comparison of representative code and competitive programming benchmarks.
Construction Pipeline

How each problem is built and verified

An overview of the InteractBench construction pipeline.
Figure 2. An overview of the InteractBench construction pipeline.
01

Select

We select interactive tasks with multi-round protocols and explicit query budgets from Codeforces, AtCoder, IOI, and ICPC.

02

Propose–Validate–Adjudicate

Proposer models draft candidate generators and interactors. Candidates are then validated against a per-task verification pool of submissions with known official verdicts. Failed candidates receive trace-grounded feedback from an adjudicator model and re-enter the loop; remaining cases are escalated to human experts after a fixed iteration cap.

03

Audit

As a supplementary sanity check, we conduct a one-time post-hoc online agreement audit.

Results

Interactive tasks expose a persistent gap

pass@1 and pass@5 stratified by difficulty on InteractBench.
Table 2. pass@1 and pass@5 (higher is better) stratified by difficulty on InteractBench.
  • Hard interactive tasks remain unsolved and strongly discriminative.
  • Gaps between pass@1 and pass@5 quantify the benefit of resampling.
  • Resampling yields uneven gains across models and difficulty tiers.
  • Explicit reasoning improves pass rates, but does not remove the Hard barrier.
Failure Taxonomy

Separating algorithmic errors from interaction failures

Algorithmic logic

  • WA protocol-compliant but incorrect answer

Interaction-specific

  • PE protocol / format violations
  • QLE query-budget overrun
  • IDLE blocked I/O or deadlock terminated by the wall-clock cap

Execution

  • CE compile error
  • RE runtime error
  • TLE time limit exceeded
  • MLE memory limit exceeded
Failure Analysis

What goes wrong, and how often

Failure composition among unsuccessful executions.
Table 3. Failure composition among unsuccessful executions (higher means more common).
  • Algorithmic logic errors (WA) dominate, but interaction-specific failures occur at different stages.
  • Query efficiency limits frontier models as well.
  • Runtime and resource-limit failures are present but secondary.
Conclusion

Takeaways

  • 01

    A significant interaction gap: even the most advanced reasoning models achieve limited success, with consistent degradation as difficulty increases.

  • 02

    Failures are dominated by algorithmic logic errors, yet protocol violations and query-budget overruns remain frequent even for frontier reasoning models.

  • 03

    Looking forward, we plan to develop prompting and training strategies that directly target interactive querying, state maintenance, and protocol adherence.

BibTeX

Cite this work

@inproceedings{interactbench2026,
  title     = {InteractBench: Benchmarking LLMs on Competitive
               Programming under Unrevealed Information},
  author    = {Li, Jiaze and Shen, Aocheng and Liu, Bing and
               Zhang, Boyu and Fan, Xiaoxuan and Zhang, Qiankun and
               Deng, Xianjun},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026}
}