I had some thoughts about AI coding I shared with my coworkers and decided to publish it here as well so I can easily link people to it.

The context is that it's a difficult and open problem how to actually use AI tools to drive up software quality.

At present there can be no assurance of quality without human oversight, after all it would be highly concerning (in a Manhattan Project kind of way) for our species and way of life otherwise if an autonomous system already existed that can arbitrarily drive up software quality based only on high level hints and instructions.

One thing I want to emphasize here is that the particular LLM model being driven by a driving tool like cursor or aider is critically important to all aspects, in particular the quality, of output you're going to get out of it. In very practical terms your LLM model itself is the brain and cursor/aider are the body. Both are important for driving the productivity, but the value in it comes from the ability of the LLM to figure out what to do and give responses adhering to strict formats (i'll generally refer to this as function calling).

In terms of state of the art capability leaps I'll point out three that took place in recent times. (1) it was nearly a year ago when Claude 3.5 Sonnet came out. Its AI coding capabilities were significantly beyond what GPT-4 could do at the time. Then (2) we got "thinking" models, first from OpenAI with o1 and then o3-mini etc. came along and demonstrated a large leap in high level reasoning ability. With these models LLMs also overnight gained the ability to do things like arithmetic. It became clear that frontier "thinking" models became more capable of high level reasoning and project planning compared to 3.5 Sonnet, but 3.5 Sonnet still remained the most effective overall model for manipulating changes into constrained "function calling" outputs. Neither did that change for the recent 3.7 Sonnet release, which is better ("smarter") at planning, and less good at directly performing edits.

(3) Last week's Gemini 2.5 Pro is the latest big step change forward in capability.

We all care about quality because we know ensuring quality in software is the only way to provide lasting value in software. The reasoning is then clear, staying on top of the most capable models and prosecuting the best way to get the best results out of them is the name of the game. In April 2025 if you aren't taking advantage of Gemini 2.5 Pro for AI coding, you're not anywhere close to doing things the most effective possible way.

So we can look more closely (like I have been over the past week) at how Gemini 2.5 Pro changed things. I include a screenshot in this comment, it's the aider polyglot benchmark leaderboard as of Apr 1 2025. We see 2.5 Pro top this chart with an 8 percentage point lead in problem solving correctness. Back when 3.5 sonnet came along this particular benchmark didn't exist, but I'd be very surprised if it had a 8 percent lead over GPT-4 at the time!

It has an under 90% correct edit format score, however. This is lower than all of the rest of the runner-ups in the standings. 2.5 Pro, like any other LLM model, has its own quirks and tendencies and behavior when it comes to how it follows instructions. The result in this column perhaps isn't the best possible way to evaluate how well a model follows instructions for AI coding tasks but it's one of the better metrics we have on that today.

Note that one of the things that will help you understand this table is when you have an entry with two LLM models in there, the first one is used as an "architect" model and the second one is the "editor" model. This is a semi-agentic approach that aider established a while back around when o1 came out that always does a two-step process when you send it a prompt: It will present the prompt to the architect model and ask it for a response and that's passed to the editor model. It's very simple.

Aider has a particular diff based edit format that looks somewhat like a git conflict marker, it is <<<<<<< SEARCH , ======= , >>>>>>> REPLACE LLMs must conform to the specified format to issue changes. If the LLM produces a diff that does not match existing code or fails to conform to the required syntax, it's unsafe or impossible to apply the edit. Other tools like cursor, vscode etc will employ different formats and instructions.

Today if you want the best quality you have Gemini 2.5 Pro do your planning, and you have the old reliable Claude 3.5 Sonnet do your edits based on those plans. Which tool you use doesn't matter as much as understanding that some models are better than others at other things.

One area where I can clearly see opportunity for improvement, at least from the perspective of using 2.5 Pro with aider, is that 2.5 Pro isn't reliably adhering to aider's diff format, but it appears to be adept at producing standard unified diff format to present its changes in. Aider just doesn't handle that particular diff format. I figure someone's working on that, and we'll probably find out soon whether 2.5 pro on its own using standard udiff edit format could produce great benchmark numbers. It would certainly reduce the cost and latency overhead of having to employ a second model for this menial "pass the butter" diff translation task.

More generally:

The issue with quality is that unless you have some stuff in place specific to your project to help keep your thumb on quality, you just can't get around having to get familiar with what's going on at least to some degree, which means spending the time it takes putting eyeballs on code and at least skimming it to start to understand which parts make sense and which are misguided in various ways.

Because of this general fact of life, I think that it's imperative in the medium and short term looking forward that software architecture must be maintained primarily for human consumption because not only is the human where the buck stops and this is not going to change any time soon, but, as far as we know, there is (as yet) no alternative representation (human readable or not) that is more effective at communicating effectively to LLMs in a portable way. Some of the above details serve as a good example of why even the cross-LLM portability of basic function calling like "take THIS code and change it to THIS" is problematic!

I started my own agentic framework and I'm exploring different things to use for guiding LLMs to write good software, but i think it's also a fact of life that we wont be able to know what sort of high level instructions are *_the best _*for software *_in general _*for a while because it just takes a lot of experimentation for clear patterns to emerge, and each project is its own beast and clearly we cannot be overly prescriptive on processes or even any sort of guidance. I do think that there should be a lot of value in defining high level concepts and goals that generally apply. Things like robustness, efficiency, testability, simplicity, and i'm hoping that it is possible to define a vocabulary along these lines where we're going to be able to efficiently communicate nuances to what we're trying to do with software instead of having to write out huge paragraphs each time to get our intent across properly to the LLM. I think something that might have legs is some notion of a higher level extension to "software design patterns". If you know your software design patterns, you may have tried using that language to discuss a project with LLMs; from what I've read, it's highly effective, you could ask for some implementation to be rewritten in XYZ pattern and a good LLM will often do a really good job of it in one shot, the patterns are a shortcut for what typically would require a ton of prompting. I'm not really a design pattern guy because i believe promiscuously reaching for abstractions leads to characteristic enterprise bloat, so I don't really know that particular vocabulary, but the concept of having a vocabulary that assigns clear names to nuanced concepts could turn out extra powerful for high level planning and communicating intent. What's interesting is that you can define entire ontologies like this at many levels of abstraction: in general, or specific to some agentic AI coding tool or paradigm, or specific to a given project. There does not seem to be a limit to how deep you may want to go here; getting these conceptual frameworks loaded in for the LLM to take into account is as simple as including the relevant markdown files into their context.

This just occurred to me, but there seems to be a simple metric that could be helpful for evaluating progress toward more autonomy, and that's prompt size to working code output (say, lines of code as a temporary placeholder). The instruction "apply factory pattern to makeWidget" leading to a successful 400 LoC change yields 80 LoC/prompt-word.

As for where the rubber meets the road on quality and I might be overemphasizing this aspect, I'm not sure, I think testing (and test automation) is even more of a force multiplier than it is for human dev teams. It's interesting because the importance and impact of testing on software quality has been established well before LLMs. With LLMs the cost of testing continues to drop because LLMs regularly excel at the maintenance of test code and punch at a higher weight class to its effectiveness in general editing. This makes the ROI of establishing test coverage in your software that much higher. You would have to make a really strong argument to justify not establishing testing in your project, and it's never too late to start (although sometimes the best way to improve a software project can indeed be to build it again from scratch with the existing one being a reference on what not to do).

If you had a mythical test suite that could run quickly and that could somehow provide 100% confidence that a green result means no defects were introduced into the code, then you could let the AI tools run wild on it and ensure quality only ever goes up! In practice it would take an infinite amount of work to reach 100% but even if you only have 85 or 95% functional test coverage for example, that's going to still be a completely game changing leg up over not having any tests.

One vision I have of the future is spending 70% of our time focused on defining and refining tests, and the tools should generally be able to autonomously work out how to satisfy the tests. The remaining would encompass the intervention required to fix architectural blunders.