Given the time needed to gather data, write it up, and go through the peer review process, this is about what I would expect for up-to-date empirical findings combined with some background base understanding.
Can you suggest some better and more recent empirical findings?
“Outdated” was perhaps an inaccurate term to use here. The issue isn’t what dates these studies were published on, it’s what models they were evaluating on. This field is moving faster than a speeding bullet. To start with the most recent studies cited:
- Stray2026 covers a “two-year period” of commits. The paper was submitted in September 2025 and revised in January 2026. “Vibe coding” at that time (so from 2023-2025) was still mostly pasting code from chat windows into your IDE or accepting autocomplete suggestions.
- He2026 is similar in the timeframe, submitted November 2026 and revised in January 2026, focused entirely on Cursor, which, at that time, was very different in its focus (code completion/inline chat prompts vs agentic back-and-forth with extensive tool use and autonomy). Again, very different from what reality looks like currently.
- Becker2025 explicitly evaluated Claude 3.5/3.7 Sonnet, an entire generation removed from the current state-of-the-art.
- Xu2025 and Bakal2025 say they evaluted “GitHub Copilot”, which isn’t an AI model but an AI router. I couldn’t find any data on whether they actually evaluated what models the developers’ requests ended up going to.
The point is that there is no recent empirical data because by the time a rigorous study is ready to publish, the industry and its capabilities have already moved on dramatically. The truth is that, as of right now, anyone claiming to have empirical proof of either slowdowns or efficiency gains is wrong. There is no way to tell.
This isn't about "empirical proof of either slowdowns or efficiency gains" but ways in which the process of evaluating its capabilities can be flawed. The first paragraph is:
"Suppose your manager asks you next week to demonstrate that the AI coding tools your company signed up for are worth the subscription cost. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; the sections below explain why."
Another way to be wrong about AI-assisted coding: basing an entire article on mostly outdated empirical findings, some of them wildly so?
I assume you are implying you disagree with the summary given, and not making a general observation about how some studies are flawed.
Aggregated by date:
Given the time needed to gather data, write it up, and go through the peer review process, this is about what I would expect for up-to-date empirical findings combined with some background base understanding.Can you suggest some better and more recent empirical findings?
Which ones are wildly outdated?
“Outdated” was perhaps an inaccurate term to use here. The issue isn’t what dates these studies were published on, it’s what models they were evaluating on. This field is moving faster than a speeding bullet. To start with the most recent studies cited:
- Stray2026 covers a “two-year period” of commits. The paper was submitted in September 2025 and revised in January 2026. “Vibe coding” at that time (so from 2023-2025) was still mostly pasting code from chat windows into your IDE or accepting autocomplete suggestions.
- He2026 is similar in the timeframe, submitted November 2026 and revised in January 2026, focused entirely on Cursor, which, at that time, was very different in its focus (code completion/inline chat prompts vs agentic back-and-forth with extensive tool use and autonomy). Again, very different from what reality looks like currently.
- Becker2025 explicitly evaluated Claude 3.5/3.7 Sonnet, an entire generation removed from the current state-of-the-art.
- Xu2025 and Bakal2025 say they evaluted “GitHub Copilot”, which isn’t an AI model but an AI router. I couldn’t find any data on whether they actually evaluated what models the developers’ requests ended up going to.
The point is that there is no recent empirical data because by the time a rigorous study is ready to publish, the industry and its capabilities have already moved on dramatically. The truth is that, as of right now, anyone claiming to have empirical proof of either slowdowns or efficiency gains is wrong. There is no way to tell.
This isn't about "empirical proof of either slowdowns or efficiency gains" but ways in which the process of evaluating its capabilities can be flawed. The first paragraph is:
"Suppose your manager asks you next week to demonstrate that the AI coding tools your company signed up for are worth the subscription cost. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; the sections below explain why."