Overview
Can AI models generate correct Python code for a specific pinned library version — not just the latest API?
Real codebases are stuck on specific versions due to technical debt, deployment constraints, and compatibility risk. Models trained on the latest docs fail silently on older APIs.
328 problems across 26 libraries (vs. 116 in v1), execution-based evaluation with visible + hidden tests, and systematic results across LLMs, agents, coding assistants, and RAG.
- 116 problems
- Introduced the version-aware benchmark format
- Narrower library and evaluation scope
- 328 problems across 26 libraries
- Visible + hidden test sets
- LLMs, agents, coding assistants, and RAG evaluation
- 3 Python versions (3.7, 3.9, 3.10)
Read the full abstract
GitChameleon 2.0 is an AI coding benchmark comprising 328 Python-based problems conditioned on specific versions of popular libraries for scientific computing and web development. It evaluates whether AI code generation models can correctly use library APIs as they existed at a particular version — a challenging test of version-specific knowledge that existing benchmarks largely ignore.
Unlike prior code benchmarks, GitChameleon 2.0 uses execution-based evaluation: each problem is validated against unit tests that run in a pinned environment. Our evaluation of large language models, LLM-powered agents, code assistants, and RAG systems reveals that even the best enterprise models achieve only 48–51% success under greedy decoding, and up to 59% with retrieval augmentation, highlighting a significant gap in practical library-version awareness.
Benchmark at a Glance
Example Task
"Calculate the logarithm of the cumulative distribution function of the standard normal distribution using available functions. If not available in PyTorch, use another library."
import torch
def log_ndtr(input_tensor: torch.Tensor) -> torch.Tensor:
# your solution here
...
Must pass the visible test:
from scipy.stats import norm
input_tensor = torch.linspace(-10, 10, steps=20)
expected = torch.tensor([-5.3231e+01, ..., -7.6199e-24], dtype=torch.float64)
assert torch.allclose(log_ndtr(input_tensor), expected, rtol=1e-3, atol=1e-3)
torch.special.log_ndtr was not added until
PyTorch 1.11. A model that ignores the pinned version will call a non-existent function.
The correct solution falls back to scipy.stats.norm.logcdf.
Results
o1, single pass
Goose CLI (GPT-4.1)
Claude 4 Sonnet + RAG
Detailed results by evaluation paradigm — hidden test success rates:
Models receive the problem statement and starting code stub and must complete the function in a single forward pass with greedy decoding. No tool use, web search, or iterative refinement — a direct measure of version-specific knowledge baked into model weights.
| # | Model | Success Rate |
|---|---|---|
| 1 | 51.2% | |
| 2 | 50.0% | |
| 3 | 49.1% | |
| 4 | 48.8% | |
| 5 | 48.5% | |
| 6 | 48.2% | |
| Open-weights models | ||
| 7 | 48.2% | |
| 8 | 40.8% | |
| 9 | 36.3% | |
| 10 | 30.2% | |
Self-debugging (iterative re-prompting on test failure) improves all models by ~10–20 pp — see the paper for full results.
Multi-step agentic loop with live web search (DuckDuckGo or Perplexity) and sandboxed code execution. Agents can iteratively retrieve version-specific documentation, run their code, and refine their solution before final submission.
| # | Model + Search Tool | Success Rate |
|---|---|---|
| 1 | 55.3% | |
| 2 | 51.4% | |
| 3 | 46.5% | |
| 4 | 46.0% |
See the paper for full results including no-sandbox ablations.
Off-the-shelf AI coding assistants evaluated end-to-end with no additional scaffolding. Each tool is given the problem statement and uses its default agentic loop — file editing, terminal access, and iterative self-correction — to produce a solution.
| # | Tool + Model | Success Rate |
|---|---|---|
| 1 | 55.5% | |
| 2 | 54.6% | |
| 2 | 54.6% | |
| 4 | 48.8% |
See the paper for full results including without-problem-statement ablations.
Relevant library documentation for the pinned version is retrieved and prepended to the prompt before generation. This tests whether explicit grounding in version-correct docs can compensate for gaps in model knowledge.
| # | Model | Success Rate |
|---|---|---|
| 1 | 59.4% | |
| 2 | 58.5% | |
| 3 | 56.7% | |
| 4 | 56.1% |
RAG boosts top models by ~10 pp over greedy decoding. See the paper for full results.
Get Started
🤗 Use the dataset
Browse problems, inspect solutions, or build downstream applications using the Hugging Face dataset.
from datasets import load_dataset
ds = load_dataset(
"cabbage972/GitChameleon-2.0",
"problems"
)
⚙️ Run official evaluation
Reproduce paper-faithful results using the official harness. Requires Python 3.9+, Poetry, and Docker.
git clone https://github.com/mrcabbage972/\
GitChameleonBenchmark.git
cd GitChameleonBenchmark
make evals-setup
evaluate \
--solution-path YOUR_SOLUTIONS.jsonl
Your solution file should be a JSONL with example_id and answer fields for each of the 328 problems.
Citation
Recommended citation for benchmark use — please use this entry when referencing GitChameleon 2.0 in your work.
Cite the earlier GitChameleon paper only when specifically referring to the original 116-problem release.