GitChameleon 2.0 icon

GitChameleon 2.0

Evaluating AI Code Generation Against Python Library Version Incompatibilities

328 problems · 26 libraries · execution-based evaluation · ACL 2026 Main

ACL 2026 — Main Conference

Diganta Misra · Nizar Islah · Victor May · Brice Rauby · Zihan Wang
Justine Gehring · Antonio Orvieto · Muawiz Chaudhary · Eilif B. Muller
Irina Rish · Samira Ebrahimi Kahou · Massimo Caccia

Overview

What it measures

Can AI models generate correct Python code for a specific pinned library version — not just the latest API?

Why it matters

Real codebases are stuck on specific versions due to technical debt, deployment constraints, and compatibility risk. Models trained on the latest docs fail silently on older APIs.

What’s new in 2.0

328 problems across 26 libraries (vs. 116 in v1), execution-based evaluation with visible + hidden tests, and systematic results across LLMs, agents, coding assistants, and RAG.

Predecessor
GitChameleon (2024)
  • 116 problems
  • Introduced the version-aware benchmark format
  • Narrower library and evaluation scope
Current release — cite this
GitChameleon 2.0 (ACL 2026)
  • 328 problems across 26 libraries
  • Visible + hidden test sets
  • LLMs, agents, coding assistants, and RAG evaluation
  • 3 Python versions (3.7, 3.9, 3.10)
Read the full abstract

GitChameleon 2.0 is an AI coding benchmark comprising 328 Python-based problems conditioned on specific versions of popular libraries for scientific computing and web development. It evaluates whether AI code generation models can correctly use library APIs as they existed at a particular version — a challenging test of version-specific knowledge that existing benchmarks largely ignore.

Unlike prior code benchmarks, GitChameleon 2.0 uses execution-based evaluation: each problem is validated against unit tests that run in a pinned environment. Our evaluation of large language models, LLM-powered agents, code assistants, and RAG systems reveals that even the best enterprise models achieve only 48–51% success under greedy decoding, and up to 59% with retrieval augmentation, highlighting a significant gap in practical library-version awareness.

Benchmark at a Glance

328
Problems
26
Python Libraries
3
Python Versions
~50%
Best Greedy Score
~59%
Best RAG Score

Example Task

torch==1.9.0 Python 3.7 other library

"Calculate the logarithm of the cumulative distribution function of the standard normal distribution using available functions. If not available in PyTorch, use another library."

import torch

def log_ndtr(input_tensor: torch.Tensor) -> torch.Tensor:
    # your solution here
    ...

Must pass the visible test:

from scipy.stats import norm
input_tensor = torch.linspace(-10, 10, steps=20)
expected = torch.tensor([-5.3231e+01, ..., -7.6199e-24], dtype=torch.float64)
assert torch.allclose(log_ndtr(input_tensor), expected, rtol=1e-3, atol=1e-3)
Why it's hard: torch.special.log_ndtr was not added until PyTorch 1.11. A model that ignores the pinned version will call a non-existent function. The correct solution falls back to scipy.stats.norm.logcdf.

Results

51.2%
Best greedy LLM
o1, single pass
55.5%
Best agent / coding assistant
Goose CLI (GPT-4.1)
59.4%
Best retrieval-grounded
Claude 4 Sonnet + RAG

Detailed results by evaluation paradigm — hidden test success rates:

Models receive the problem statement and starting code stub and must complete the function in a single forward pass with greedy decoding. No tool use, web search, or iterative refinement — a direct measure of version-specific knowledge baked into model weights.

#ModelSuccess Rate
1
OpenAI
o1
51.2%
2
Gemini
Gemini 2.5 Pro
50.0%
3
OpenAI
GPT-4o
49.1%
4
Anthropic
Claude 3.7 Sonnet
48.8%
5
OpenAI
GPT-4.1
48.5%
6
xAI
Grok 3
48.2%
Open-weights models
7
Alibaba
Qwen 2.5-VL Instruct 72B
48.2%
8
Meta
Llama 4 Maverick 400B
40.8%
9
Meta
Llama 3.3 Instruct Turbo 70B
36.3%
10
Meta
Llama 3.1 Instruct Turbo
30.2%

Self-debugging (iterative re-prompting on test failure) improves all models by ~10–20 pp — see the paper for full results.

Multi-step agentic loop with live web search (DuckDuckGo or Perplexity) and sandboxed code execution. Agents can iteratively retrieve version-specific documentation, run their code, and refine their solution before final submission.

#Model + Search ToolSuccess Rate
1
Anthropic+DuckDuckGo
Claude Sonnet 3.5 + DuckDuckGo
55.3%
2
Anthropic+Perplexity
Claude Sonnet 3.5 + Perplexity
51.4%
3
Gemini+Perplexity
Gemini 1.5 Pro + Perplexity
46.5%
4
Gemini+DuckDuckGo
Gemini 1.5 Pro + DuckDuckGo
46.0%

See the paper for full results including no-sandbox ablations.

Off-the-shelf AI coding assistants evaluated end-to-end with no additional scaffolding. Each tool is given the problem statement and uses its default agentic loop — file editing, terminal access, and iterative self-correction — to produce a solution.

#Tool + ModelSuccess Rate
1
Goose·OpenAI
Goose CLI (GPT-4.1)
55.5%
2
Cline·OpenAI
Cline IDE (GPT-4.1)
54.6%
2
Cline·OpenAI
Cline IDE (GPT-4.1-nano)
54.6%
4
Anthropic·Anthropic
Claude Code CLI (Claude 3.7 Sonnet)
48.8%

See the paper for full results including without-problem-statement ablations.

Relevant library documentation for the pinned version is retrieved and prepended to the prompt before generation. This tests whether explicit grounding in version-correct docs can compensate for gaps in model knowledge.

#ModelSuccess Rate
1
Anthropic
Claude 4 Sonnet
59.4%
2
OpenAI
GPT-4.1
58.5%
3
Gemini
Gemini 2.5 Pro
56.7%
4
Anthropic
Claude 3.7 Sonnet
56.1%

RAG boosts top models by ~10 pp over greedy decoding. See the paper for full results.

Get Started

🤗 Use the dataset

Browse problems, inspect solutions, or build downstream applications using the Hugging Face dataset.

from datasets import load_dataset

ds = load_dataset(
  "cabbage972/GitChameleon-2.0",
  "problems"
)

⚙️ Run official evaluation

Reproduce paper-faithful results using the official harness. Requires Python 3.9+, Poetry, and Docker.

git clone https://github.com/mrcabbage972/\
  GitChameleonBenchmark.git
cd GitChameleonBenchmark
make evals-setup
evaluate \
  --solution-path YOUR_SOLUTIONS.jsonl

Your solution file should be a JSONL with example_id and answer fields for each of the 328 problems.

Citation

Recommended citation for benchmark use — please use this entry when referencing GitChameleon 2.0 in your work.

@misc{misra2025gitchameleon20evaluatingai, title={GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities}, author={Diganta Misra and Nizar Islah and Victor May and Brice Rauby and Zihan Wang and Justine Gehring and Antonio Orvieto and Muawiz Chaudhary and Eilif B. Muller and Irina Rish and Samira Ebrahimi Kahou and Massimo Caccia}, year={2025}, eprint={2507.12367}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2507.12367} }

Cite the earlier GitChameleon paper only when specifically referring to the original 116-problem release.

FAQ

GitChameleon 2.0 is the current expanded benchmark release, built on the earlier GitChameleon benchmark. The original GitChameleon (2024) introduced the version-aware format with 116 problems. GitChameleon 2.0 expands this to 328 problems across 26 libraries and adds systematic evaluation across LLMs, agents, coding assistants, and RAG — making it the primary reference for new work.
Yes — the dataset is freely available on Hugging Face and suitable for inspection, downstream analysis, and building custom evaluation pipelines. Use the official harness only when you need paper-faithful evaluation with pinned Docker environments.
All official evaluation runs use the GitChameleonBenchmark repository. Docker-based pinned environments ensure reproducibility across different machines.