Table of Contents

What Coding Language Do AI Agents Prefer?
#

Suppose I’m about to hand a real chunk of work to a coding agent. Pick a language for the project. Which one minimizes my regret? Time spent troubleshooting is time wasted. I could code it myself, but that means I’m spending less time on project I’m actually interested in. So does a will “Most likely just works” setup look like?

Digging into the “AI prefers X language” rankings online is actually more complicated than I thought. I found the research papers on McEval, AutoCodeBench, and TerminalBench. The trouble is that the accompanying benchmarks underneath them are answering three different questions, and the question closest to what an agent actually does turns out to be the one we have the least public data on… Well at least I couldn’t find a decent sized data set. If someone has please send it my way.

McEval — pass@1 across forty languages
#

McEval was one of the first serious attempt to grade coding models across many languages — forty of them, around 16K test samples covering code generation, completion, and explanation.

Headline scores: GPT-4o leads at 65.2% average pass@1, GPT-4 Turbo at 63.4%, GPT-3.5 Turbo at 52.6%, across 23 tested models. The languages where GPT-4o clears 80% pass@1 cluster around statically typed and multi-paradigm targets. Haskell, Kotlin, Swift, Rust, CoffeeScript, Java, and Groovy all sit in that band. Markup and lower-resource languages — HTML, Markdown, Fortran — sit at the bottom of the table. Models tilt toward object-oriented and multi-paradigm languages over functional and procedural ones.

This is where most “AI prefers X” claims come from. The trouble: pass@1 is single-shot, single greedy-decode, with no retries, no harness (in today’s sense of the word), and no tool use. A coding agent in 2026 doesn’t get one shot. It reads tests, fails, fixes, retries. McEval doesn’t measure that. The paper itself doesn’t flag this (it predates the workflow it would have needed to flag), but reading McEval scores today as a guide to “what an agent prefers” is reading the wrong benchmark for the workflow.

AutoCodeBench — pass@1 grows up, reasoning variants sweep the leaderboard
#

AutoCodeBench is McEval’s successor in spirit. Auto-generated multilingual problems, a sandbox-executed grader, 20 languages, and 3,920 problems on the full set, with a smaller AutoCodeBench-Lite split published alongside it for cheaper runs.

The original paper’s leaderboard (August 2025) had Claude Opus 4 (Think) on top at 52.4% average pass@1, followed by Sonnet 4 (Think) at 51.1%, o3 (high) at 51.1%, and Grok-4 at 50.9%.

The current V2 leaderboard reframes that, though don’t compare V2 numbers to the original ACB numbers. V2 is a curated higher-quality problem set, so part of the climb is benchmark change, not just model improvement.

Top of the chart: Claude Opus 4.5 thinking at 91.3%, Gemini 3 Pro at 87.1%, GPT-5 high at 84.3%. All five shown here are reasoning-mode variants; that holds for places six through nine on the live leaderboard too (Hunyuan 2.0 thinking, Kimi K2 thinking, DeepSeek V3.2 thinking, Seed-1.6 thinking). The phase shift in six months isn’t “the models got better.” It’s that reasoning variants now get benchmarked separately and dominate the standard pass@1 frame.

The gap that matters here: the V2 leaderboard does not surface per-language rankings. All three charts on the site (ACB, ACB-Lite, V2) display overall-average bars. To answer “which language does Opus 4.5 thinking do best on” against AutoCodeBench, you’d have to run the eval yourself against the public HuggingFace dataset.

Same paradigm. Still pass@1. Still single-shoting without accounting for the harness. Better problems and a reasoning-aware leaderboard, but the question being asked hasn’t changed.

TerminalBench 2.1 — workflow-accurate, different leader
#

TerminalBench 2.1, released May 6, 2026, is the closest thing the field has to a benchmark shaped like an agentic workflow. Model plus harness, 89 real terminal tasks, end-to-end completion grading.

It’s a revision of TerminalBench 2.0 — same 89 tasks, 28 of which were fixed, with continuous validation introduced for agentic benchmarks. The authors note that “no task is unsolved in Terminal-Bench 2.1.” Three fix categories: external dependencies (9 tasks), resource mismatches (8), task misspecification.

Top of the chart: GPT-5.3-Codex with Codex CLI at 79.1%, GPT-5.4 with Codex CLI at 77.3%, Gemini 3.1 Pro with Terminus 2 at 70.7%, Opus 4.6 with Claude Code at 70.1%. This is the reverse of the AutoCodeBench V2 picture. The model that wins the pass@1 leaderboard isn’t the model that wins the agentic-workflow leaderboard.

Harness matters too. The same GPT-5.3-Codex on Terminus 2 drops to 68.5%, about a 10-point penalty from changing the agent harness, no model change at all.

And what matters for this post: there’s no per-language breakdown. A Python-tooling task and a Rust-tooling task count the same. The benchmark that best matches how 2026 agents actually work tells us nothing about which language those agents prefer.

lang-comp — the indie gap-filler measuring tokens-to-green
#

A small, single-author benchmark — jovaneyck/lang-comp — is the closest the public record currently gets to agentic and per-language at the same time.

The methodology is a TDD loop. The agent implements, runs the tests, fails, fixes, retries until green. The metric isn’t pass-rate; it’s tokens-to-green and iterations-to-green. That’s agentic by construction.

Two tasks: game-of-life (add a “zombie” cell state) and gilded-rose (the classic refactor, plus Conjured items). Seven languages: TypeScript, JavaScript, C#, F#, Python, Rust, Elixir. Harness: GitHub Copilot default, with some runs on Claude Code or Codex. Five attempts per (model × language × task) cell.

Per-language token cost under Claude Opus 4.6 high, n=10:

Rank	Language	Avg tokens	Avg iterations
1	TypeScript	26,910	1.70
2	JavaScript	28,520	2.00
3	C#	33,520	2.50
4	F#	33,800	2.10
5	Elixir	34,260	2.30
6	Python	34,480	2.40
7	Rust	35,040	2.20

TypeScript runs about 23% cheaper than Rust under the same model on the same task. It also has the lowest iteration count, 1.70, which makes sense: the type system catches errors before tests run, so the agent loops less. Python ends up more expensive than TypeScript despite its training-data dominance.

The harness-vs-model comparison the repo runs on F# shows a +44% token overhead for the Claude Code harness with Opus 4.7 versus the Copilot harness with Opus 4.6 on the same language and same task. GPT-5.4 with Codex needs 3.2–3.4 iterations to Claude’s 2.0–2.3, more tries with comparable final token totals.

I want to be careful about the caveats, because the post collapses without them. Two tasks, both well-known katas almost certainly in every model’s training set. This measures familiar-territory efficiency, not novel-problem capability. Five attempts per cell is a small sample. Only Opus 4.6 ran the full seven-language matrix; cross-language comparisons across other models aren’t supported by the data. No Gemini, Kimi, DeepSeek, or Qwen runs. Single author, single machine, manually triggered, no CI.

It still matters. It’s directional. And until somebody runs TerminalBench 2.x with language tagging, it’s the best answer I’ve found to the agentic-and-per-language question.

Verdict — different benchmarks, different winners
#

McEval’s pass@1 frame says GPT-4o clears 80% on Haskell, Kotlin, Swift, Rust, CoffeeScript, Java, and Groovy. That cluster is what most “AI prefers X” claims actually mean. Wrong question for delegating real work.

AutoCodeBench V2’s reasoning-variant pass@1 frame says Claude Opus 4.5 thinking leads at 91.3%, and every top-nine entry is a reasoning variant. No per-language breakdown surfaced.

TerminalBench 2.1’s agentic end-to-end frame says GPT-5.3-Codex with Codex CLI leads at 79.1% and Opus 4.6 with Claude Code lands fourth at 70.1%. Harness matters as much as model. No language breakdown.

And lang-comp’s agentic token-cost frame says TypeScript is ~23% cheaper than Rust under Opus 4.6 plus Copilot. The type system pays for itself in agent token bills.

What the four benchmarks actually say, taken together:

Bias toward TypeScript when you have the choice. It’s the best per-language agentic signal the public record currently provides, and the second-place finisher (JavaScript) gives you a fallback if TS is overkill.

Don’t pick “the best model” from a single benchmark. AutoCodeBench V2 crowns Claude; TerminalBench 2.1 crowns GPT-Codex. Pick the model whose benchmark profile matches the workflow shape you’re actually running.

Treat the harness as a load-bearing variable. The same Codex model swaps harnesses and loses ten points on TBench 2.1. The Claude Code harness costs +44% on F# versus Copilot on the same task. Harness choice belongs in the decision alongside model choice, not as a footnote.

What we still can’t answer: large-sample agentic per-language pass rates across the frontier model field. The benchmark that would settle this doesn’t exist publicly yet.

Closing
#

Different evaluations crown different languages and different models because they’re measuring different things.

Benchmarks provide us with relative performance and can signal direction. LLMs don’t prefer one language, the quality and quantity of the training data provides inherit bias. Which can then be optimized for (and against) by the harness you run it on.

If you’re picking a language for an agentic project in 2026, pick TypeScript and test your harness it can provide meaningful performance gains.

Treat the rest of the rankings as a leaderboard to watch. Companies will continue to develop better systems.

If you’ve run your own agent + language + token-cost tests, I want to see them. Drop them in the repo or send them my way. The harness story is under-discussed; the workflow story is over-discussed; the language story is under-measured. That last gap is the one I’d like the next benchmark to close.

Sources
#

Benchmarks (papers):

Leaderboards and code:

Releases:

TerminalBench 2.1 release announcement — tbench.ai (May 6, 2026)

— Ender

What Coding Language Do AI Agents Prefer?#

McEval — pass@1 across forty languages#

AutoCodeBench — pass@1 grows up, reasoning variants sweep the leaderboard#

TerminalBench 2.1 — workflow-accurate, different leader#

lang-comp — the indie gap-filler measuring tokens-to-green#

Verdict — different benchmarks, different winners#

Closing#

Sources#

Related