Pass@1 leaderboards say one thing, agentic end-to-end benchmarks say another, and the only reliable per-language agentic signal I found is a tiny indie repo measuring tokens-to-done.
LLMs lean hard on code comments when they reason about code. But nobody has actually tested whether beginner-friendly comments help an AI agent modify a real, multi-file codebase. Here’s what 6 papers and the developer community had to say.