GitChameleon 2.0 | Version-Aware Code Generation Benchmark

Overview

What it measures

Can AI models generate correct Python code for a specific pinned library version — not just the latest API?

Why it matters

Real codebases are stuck on specific versions due to technical debt, deployment constraints, and compatibility risk. Models trained on the latest docs fail silently on older APIs.

What’s new in 2.0

328 problems across 26 libraries (vs. 116 in v1), execution-based evaluation with visible + hidden tests, and systematic results across LLMs, agents, coding assistants, and RAG.

Predecessor

GitChameleon (2024)

116 problems
Introduced the version-aware benchmark format
Narrower library and evaluation scope

Current release — cite this

GitChameleon 2.0 (ACL 2026)

328 problems across 26 libraries
Visible + hidden test sets
LLMs, agents, coding assistants, and RAG evaluation
3 Python versions (3.7, 3.9, 3.10)

Read the full abstract

GitChameleon 2.0 is an AI coding benchmark comprising 328 Python-based problems conditioned on specific versions of popular libraries for scientific computing and web development. It evaluates whether AI code generation models can correctly use library APIs as they existed at a particular version — a challenging test of version-specific knowledge that existing benchmarks largely ignore.

Unlike prior code benchmarks, GitChameleon 2.0 uses execution-based evaluation: each problem is validated against unit tests that run in a pinned environment. Our evaluation of large language models, LLM-powered agents, code assistants, and RAG systems reveals that even the best enterprise models achieve only 48–51% success under greedy decoding, and up to 59% with retrieval augmentation, highlighting a significant gap in practical library-version awareness.

Benchmark at a Glance

328

Problems

Python Libraries

Python Versions

~50%

Best Greedy Score

~59%

Best RAG Score

Example Task

torch==1.9.0 Python 3.7 other library

"Calculate the logarithm of the cumulative distribution function of the standard normal distribution using available functions. If not available in PyTorch, use another library."

import torch

def log_ndtr(input_tensor: torch.Tensor) -> torch.Tensor:
    # your solution here
    ...

Must pass the visible test:

from scipy.stats import norm
input_tensor = torch.linspace(-10, 10, steps=20)
expected = torch.tensor([-5.3231e+01, ..., -7.6199e-24], dtype=torch.float64)
assert torch.allclose(log_ndtr(input_tensor), expected, rtol=1e-3, atol=1e-3)

Why it's hard: torch.special.log_ndtr was not added until PyTorch 1.11. A model that ignores the pinned version will call a non-existent function. The correct solution falls back to scipy.stats.norm.logcdf.

Results

51.2%

Best greedy LLM
o1, single pass

55.5%

Best agent / coding assistant
Goose CLI (GPT-4.1)

59.4%

Best retrieval-grounded
Claude 4 Sonnet + RAG

Detailed results by evaluation paradigm — hidden test success rates:

Models receive the problem statement and starting code stub and must complete the function in a single forward pass with greedy decoding. No tool use, web search, or iterative refinement — a direct measure of version-specific knowledge baked into model weights.

#	Model	Success Rate
1	o1	51.2%
2	Gemini 2.5 Pro	50.0%
3	GPT-4o	49.1%
4	Claude 3.7 Sonnet	48.8%
5	GPT-4.1	48.5%
6	Grok 3	48.2%
	Open-weights models
7	Qwen 2.5-VL Instruct 72B	48.2%
8	Llama 4 Maverick 400B	40.8%
9	Llama 3.3 Instruct Turbo 70B	36.3%
10	Llama 3.1 Instruct Turbo	30.2%

Self-debugging (iterative re-prompting on test failure) improves all models by ~10–20 pp — see the paper for full results.

Multi-step agentic loop with live web search (DuckDuckGo or Perplexity) and sandboxed code execution. Agents can iteratively retrieve version-specific documentation, run their code, and refine their solution before final submission.

#	Model + Search Tool	Success Rate
1	+ Claude Sonnet 3.5 + DuckDuckGo	55.3%
2	+ Claude Sonnet 3.5 + Perplexity	51.4%
3	+ Gemini 1.5 Pro + Perplexity	46.5%
4	+ Gemini 1.5 Pro + DuckDuckGo	46.0%

See the paper for full results including no-sandbox ablations.

Off-the-shelf AI coding assistants evaluated end-to-end with no additional scaffolding. Each tool is given the problem statement and uses its default agentic loop — file editing, terminal access, and iterative self-correction — to produce a solution.

#	Tool + Model	Success Rate
1	· Goose CLI (GPT-4.1)	55.5%
2	· Cline IDE (GPT-4.1)	54.6%
2	· Cline IDE (GPT-4.1-nano)	54.6%
4	· Claude Code CLI (Claude 3.7 Sonnet)	48.8%

See the paper for full results including without-problem-statement ablations.

Relevant library documentation for the pinned version is retrieved and prepended to the prompt before generation. This tests whether explicit grounding in version-correct docs can compensate for gaps in model knowledge.

#	Model	Success Rate
1	Claude 4 Sonnet	59.4%
2	GPT-4.1	58.5%
3	Gemini 2.5 Pro	56.7%
4	Claude 3.7 Sonnet	56.1%

RAG boosts top models by ~10 pp over greedy decoding. See the paper for full results.

Get Started

🤗 Use the dataset

Browse problems, inspect solutions, or build downstream applications using the Hugging Face dataset.

from datasets import load_dataset

ds = load_dataset(
  "cabbage972/GitChameleon-2.0",
  "problems"
)

⚙️ Run official evaluation

Reproduce paper-faithful results using the official harness. Requires Python 3.9+, Poetry, and Docker.

git clone https://github.com/mrcabbage972/\
  GitChameleonBenchmark.git
cd GitChameleonBenchmark
make evals-setup
evaluate \
  --solution-path YOUR_SOLUTIONS.jsonl

Your solution file should be a JSONL with example_id and answer fields for each of the 328 problems.

Citation

Recommended citation for benchmark use — please use this entry when referencing GitChameleon 2.0 in your work.

@misc{misra2025gitchameleon20evaluatingai, title={GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities}, author={Diganta Misra and Nizar Islah and Victor May and Brice Rauby and Zihan Wang and Justine Gehring and Antonio Orvieto and Muawiz Chaudhary and Eilif B. Muller and Irina Rish and Samira Ebrahimi Kahou and Massimo Caccia}, year={2025}, eprint={2507.12367}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2507.12367} }

Cite the earlier GitChameleon paper only when specifically referring to the original 116-problem release.

FAQ

GitChameleon 2.0 is the current expanded benchmark release, built on the earlier GitChameleon benchmark. The original GitChameleon (2024) introduced the version-aware format with 116 problems. GitChameleon 2.0 expands this to 328 problems across 26 libraries and adds systematic evaluation across LLMs, agents, coding assistants, and RAG — making it the primary reference for new work.

Yes — the dataset is freely available on Hugging Face and suitable for inspection, downstream analysis, and building custom evaluation pipelines. Use the official harness only when you need paper-faithful evaluation with pinned Docker environments.

All official evaluation runs use the GitChameleonBenchmark repository. Docker-based pinned environments ensure reproducibility across different machines.