Just because I don't care doesn't mean I don't understand.
694 stories
·
3 followers

Django: fixing a memory “leak” from Python 3.14’s incremental garbage collection

1 Comment

Back in February, I encountered an out-of-memory error while migrating a client project to Python 3.14. The issue occurred when running Django’s database migration command (migrate) on a limited-resource server, and seemed to be caused by the new incremental garbage collection algorithm in Python 3.14.

At the time, I wrote a workaround and started on this blog post, but other tasks took priority and I never got around to finishing it. But four days ago, Hugo van Kemenade, the Python 3.14 release manager, announced that the new garbage collection algorithm will be reverted in Python 3.14.5, and the next Python 3.15 alpha release, due to reports of increased memory usage.

Here’s the story of my workaround, as extra evidence that reverting incremental garbage collection is a good call.

Python 3.14’s incremental garbage collection

Python (well, CPython) has a garbage collector that runs regularly to clean up unreferenced objects. Most objects are cleaned up immediately when their reference count drops to zero, but some objects can be part of reference cycles, where some set of objects reference each other and thus never reach a reference count of zero. The garbage collector sweeps through all objects to find and clean up these cycles.

Python 3.14 changed garbage collection to operate incrementally. Previously, a garbage collection run would sweep through all objects in one go, but this could lead to “stop the world” stalls where your program’s real work could pause for seconds while the garbage collector did its job. The incremental garbage collection algorithm instead does a fraction of the work at a time, spreading out the cost of garbage collection.

Here’s the full release note (historical source):

Incremental garbage collection

The cycle garbage collector is now incremental. This means that maximum pause times are reduced by an order of magnitude or more for larger heaps.

There are now only two generations: young and old. When gc.collect() is not called directly, the GC is invoked a little less frequently. When invoked, it collects the young generation and an increment of the old generation, instead of collecting one or more generations.

The behavior of gc.collect() changes slightly:

  • gc.collect(1): Performs an increment of garbage collection, rather than collecting generation 1.
  • Other calls to gc.collect() are unchanged.

(Contributed by Mark Shannon in 108362.)

The problem

I’d been helping one of my clients upgrade to Python 3.14 for a few months, chipping away at compatibility work like upgrading dependencies and fixing deprecations. Tests were finally all passing and everything was working on the local development server. The next stop was to launch a temporary deployment using Python 3.14 via Heroku’s review apps feature.

At the basic tier, Heroku review apps use fairly resource-constrained servers, including just 512MB of RAM, with the ability to temporarily burst up to nearly 1GB (200%). Paying for larger servers is an option, but unfortunately the next step up is pretty expensive.

When I launched a review app for my Python 3.14 branch, I found its release phase failed while running migrate. Inspecting the logs, I found the migrations started fine:

$ heroku logs --app example-python-314-wsgk3w --num 1000 | less
...
app[release.6634]: System check identified no issues (26 silenced).
app[release.6634]: Operations to perform:
app[release.6634]: Apply all migrations: admin, auth, contenttypes, ...
app[release.6634]: Running migrations:

…but partway through, these messages started appearing:

heroku[release.6634]: Process running mem=527M(101.5%)
heroku[release.6634]: Error R14 (Memory quota exceeded)

…ramping up until the 200% mark:

heroku[release.9599]: Process running mem=977M(190.3%)
heroku[release.9599]: Error R14 (Memory quota exceeded)

…and finally the termination of the release process:

heroku[release.9599]: Process running mem=1033M(201.7%)
heroku[release.9599]: Error R15 (Memory quota vastly exceeded)
heroku[release.9599]: Stopping process with SIGKILL

These messages came from Heroku’s process management layer, which terminated the memory-hungry release process with SIGKILL after the hard threshold of 1GB memory usage was breached. Repeat attempts hit the same issue.

I was confused: migrations should not consume much memory. While they create a lot of temporary objects (Django model classes and fields) in order to calculate the SQL to send to the database, such objects are all short-lived and should be garbage-collected fairly swiftly. Additionally, migrations worked fine on the local and CI environments, and they’d never had memory issues on previous Python versions.

It looked like there was a memory leak, and it was time to dig in.

Initial investigation

I first profiled memory usage of migrate locally using Memray, the memory profiler that I covered in my previous post, using:

$ memray run manage.py migrate

The profiles revealed that memory usage had slightly increased on Python 3.14 compared to 3.13, but did not find a memory leak (a pattern of continual growth). Still, I made some optimizations to defer some imports, saving about 30% of startup memory usage, and tried again, to no avail.

I then had the idea to profile on a Heroku dyno directly. After hacking the release process to not run migrations, I built a review app and SSH’d into its web server:

$ heroku ps:exec -a example-python-314-rspwtc --dyno web.1 bash
Establishing credentials... done
Connecting to web.1 on ⬢ example-python-314-rspwtc...
~ $

Initially, I tried using Memray’s live mode to profile the migrations as they ran:

$ memray run --live manage.py migrate

While this tool looks great for some situations, it didn’t really work here, especially since it seized up after Heroku terminated the server.

I then tried running the default memray run command:

$ memray run manage.py migrate
Writing profile results into memray-manage.py.724.bin

…then, on my local computer, I repeatedly ran this command to copy down the results file:

$ trash memray-manage.py.724.bin && heroku ps:copy -a example-python-314-rspwtc --dyno web.1 memray-manage.py.724.bin

I was a bit worried here that the Memray binary file might be corrupted due to copying it while memray run was generating it. But with a final truncated copy left over after the server crashed, I asked Memray to generate a flamegraph for it:

$ memray flamegraph memray-manage.py.724.bin

…and it worked! Kudos to the Memray team for making their output format usable even when incomplete.

This more detailed flamegraph revealed more than 50% of the memory usage was allocated in ModelState.render(), which creates temporary model classes:

class ModelState:
    ...

    def render(self, apps):
        """Create a Model object from our current state into the given apps."""
        ...
        return type(self.name, bases, body)

This information hinted that these temporary model classes were hanging around beyond their expected short lifetime, leading to the memory leak. For example, every model class could also end up in a list intended for debugging, but accidentally extending the lifetime of these temporary classes.

I decided to dig a bit deeper using machete-mode debugging, with the below snippet that captures the temporary model classes and logs details about them. I wrote this within the Django settings file, where it was guaranteed to run at Django startup time, before the migrate management command.

import atexit
import gc
import tracemalloc
import weakref
from itertools import islice

from django.db.migrations.state import ModelState

tracemalloc.start(2)

orig_render = ModelState.render

rendered_classes = weakref.WeakSet()


def wrapped_render(*args, **kwargs):
    cls = orig_render(*args, **kwargs)
    rendered_classes.add(cls)
    return cls


ModelState.render = wrapped_render


@atexit.register
def show_referrers():
    print(f"🎯 {len(rendered_classes)} classes referred to.\n")

    for cls in islice(rendered_classes, 2):
        print(f"🎁🎁🎁 {cls!r} 🎁🎁🎁")
        for i, referrer in enumerate(gc.get_referrers(cls), start=1):
            print(f"🍌 Referrer #{i}: {referrer!r}")
            if tb := tracemalloc.get_object_traceback(referrer):
                print("\n".join(tb.format(most_recent_first=True)))
            print()
        print()
        print()

Note:

  1. tracemalloc.start() starts Python’s built-in memory allocation tracking.
  2. The ModelState.render() method was monkeypatched with a wrapper that stores every temporary model class in a WeakSet.
  3. The @atexit.register-decorated function runs at the end of the program, and logs two things.
  4. The first piece of logging is the number of temporary model classes still alive at the end of the program, which should be close to zero. (Some may stick around from the final migration state.)
  5. The second piece of logging iterates over the first two live temporary model classes and logs their name and their referring objects, discovered via gc.get_referrers(). For each referring object, it also logs the traceback of where that object was allocated, using tracemalloc.get_object_traceback() (which is why tracemalloc.start() was needed at the beginning).
  6. The emojis are a bit of fun to make the log messages easier to skim through. I have no idea why I picked 🎁 and 🍌!!

The output from this hook was voluminous, even with the limit to the first two live classes. For example, here’s the output for a temporary ContentType model class:

🎁🎁🎁 <class '__fake__.ContentType'> 🎁🎁🎁
🍌 Referrer #1: <generator object WeakSet.__iter__ at 0x1234ef300>
  File "/.../example/core/apps.py", line 45
    for cls in islice(rendered_classes, 2):

...

🍌 Referrer #11: {'name': 'model', ..., 'model': <class '__fake__.ContentType'>}
  File "/.../.venv/lib/python3.14/site-packages/django/utils/functional.py", line 47
    res = instance.__dict__[self.name] = self.func(instance)
  File "/.../.venv/lib/python3.14/site-packages/django/db/models/fields/__init__.py", line 1210
    self.validators.append(validators.MaxLengthValidator(self.max_length))

I checked the live referrers for a few classes, and they all seemed to be expected. However, it did reveal just how many cycles exist between ORM objects. For example, model classes refer to their field objects, which in turn refer back to their model classes, thanks to Django’s Field.contribute_to_class() creating this reference:

def contribute_to_class(self, cls, name, private_only=False):
    ...
    self.model = cls
    ...

Anyway, from comparing the output between Python 3.13 and 3.14, I could see that no new references were being created on Python 3.14. It seemed likely that the incremental garbage collection algorithm was the culprit.

The workaround

Given the investigation, I wanted to work around the issue by forcing a full garbage collection sweep with gc.collect() after each migration file ran. I came up with the below code, saved as management/commands/migrate.py in one of the project’s Django apps. It extends the default migrate command to run gc.collect() after each successful migration (where “apply” is forwards and “unapply” is backwards).

import gc

from django.core.management.commands.migrate import Command as BaseCommand


class Command(BaseCommand):
    """Extended 'migrate' command."""

    def migration_progress_callback(self, action, migration=None, fake=False):
        """
        Extend Django’s migration progress reporting to force garbage
        collection after each migration. This is a workaround to keep memory
        usage low, especially because we have a low limit on Heroku. It seems
        the incremental garbage collector introduced in Python 3.14 cannot
        keep up with the migration process’s tendency to create many cyclical
        objects, so our best fallback is to force collection of everything
        after each migration is applied or unapplied.

        https://adamj.eu/tech/2026/04/20/django-python-3.14-incremental-gc/
        """
        super().migration_progress_callback(action, migration=migration, fake=fake)
        if action in ("apply_success", "unapply_success"):
            gc.collect()

It felt a bit hacky, but it did the trick! The review app succeeded to launch, showing a flat memory profile as before.

We then continued to deploy to staging and production without any issues, and the team have been happily using Python 3.14 for over a month now.

Fin

Well, that’s where the tale ends right now. After the incremental garbage collection algorithm is reverted in Python 3.14.5, I guess I’ll be able to remove this workaround.

While it would be nice to have incremental garbage collection work well, it’s clear that the current implementation has some issues. I think the core team is making the right call reverting it, but hopefully there will be energy to improve the feature for the future.

May your garbage be collected efficiently and without fuss,

—Adam

Read the whole story
jgbishop
16 hours ago
reply
What a niche bug! Good to know that this exists, however...
Raleigh, NC
Share this story
Delete

Updating Gun Rocket through 10 years of Unity Engine

1 Comment

About 10 years ago I made Gun Rocket.

It was early in my game development journey. I had released 5 prototype games on Game Jolt, and it was time to sit down and make something worth paying for. I started with the idea "What if n++...but with the Asteroids ship?"

Development took about a month. The result was a game with 100 levels, multiple ships with different stats to pilot, and even a LAN multiplayer combat mode. Gun Rocket also stands out as my most lucrative personal project. After a successful Steam Greenlight process I was approached and licensed the Steam distribution rights for the game for a few years.

Recently I was reflecting on my game development journey. I tried to boot up Gun Rocket to play it. But it refused. No matter how hard I clicked the game would not open. The log is empty. I guess some driver or Windows API just doesn't work anymore.

So it is time to roll up my sleeves and bring Gun Rocket into 2026. Come along won't you? I could use the company.

Let's start by opening the game in Unity Editor. We'll test the game in its current editor version and re-acquaint ourselves here before moving on. The version of a Unity project is stored in /ProjectSettings/ProjectVersion.txt. It's a simple file with a simple purpose. Here's what I see:

m_EditorVersion: 5.5.0f3

Looking back at the git history of this file, I can see that I actually developed the game in 4.6.0p1 in 2015. The ProjectVersion file was created when migrating from 4.6 to 5.5 in 2018 hoping it would fix a bug (it didn’t). So there's our first interesting factoid about how Unity has changed. Crazy how time flies.

Anyway! Looks like Gun Rocket was most recently developed in Unity 5.5.0f3. The current Unity tech stream is 6.5 beta. That doesn't seem so bad! Just one major version bump, right?

WRONG!

Some time around 2017, Unity decided that its numbering was not corporate-friendly. At that time they were trying to expand from gaming into more verticals. I guess corporations love versioning their software by year, so that's what Unity did. It makes the messaging about long-term support easier. Let's say Unity supports a release for 3 years. When does that end? It's much easier to talk about that for Unity 2017 (2017 + 3 = 2020) than for Unity 5.5 (???).

Nowadays Unity is back to simple numbers. Today’s major version number is 6. At least...that's what the website says. Unity version numbers now look something like 6000.4.1f1. I find this hilarious. It reminds me of Loony Tunes technology naming. Roadrunner Catcher 3000 anyone? Again, there is a good reason for this. 6000 > 2023. 2023 is Unity's last year-named version. So all of the version sorting code will continue to Just Work TM. A Good Reason. But I still find it funny.

So I open Unity Hub and look for 5.5.0f3. It's not one of the readily available options. Unity presents Official Releases (long-term support and the latest supported minor release 6000.4.1f1), Pre-releases (currently just the 6000.5.0b1 beta), and ArChIvE. We'll be spending a lot of time in the archive. I like to think of it as the back room in the basement where folks store things they just can't bear to part with yet. It's super excellent that all of these versions are kept around. It means my ambition to bring Gun Rocket into 2026 has legs - if only barely. The archive only goes back to Unity 5. Good thing I upgraded from 4.6 in 2018!

Wow, all this history and we haven't even opened the editor yet. Let's try that now.

It does the same thing as the game build on Steam: just closes with no information in the log. Shoot.

Some Google research tells me this might be related to the license check. Unity 5 pre-dates Unity Hub. So sure, it makes sense that it could be a license check issue. I try to open from the Unity.exe rather than through Hub as suggested. No luck.

Ok then, let's try a newer version. I wanted to verify the game in 5.5, but I guess I am out-of-luck. I nab the most recent Unity 5: version 5.6.7f1. Again, it doesn't launch from Unity Hub, but that's what I expect at this point. What about launching from the Unity.exe?

Adblock test (Why?)

Read the whole story
jgbishop
2 days ago
reply
Good read
Raleigh, NC
Share this story
Delete

Let’s talk about LLMs

1 Comment

Everybody seems to agree we’re in the middle of something, though what, exactly, seems to be up for debate. It might be an unprecedented revolution in productivity and capabilities, perhaps even the precursor to a technological “singularity” beyond which it’s impossible to guess what the world might look like. It might be just another vaporware hype cycle that will blow over. It might be a dot-com-style bubble that will lead to a big crash but still leave us with something useful (the way the dot-com bubble drove mass adoption of the web). It might be none of those things.

Many thousands of words have already been spent arguing variations of these positions. So of course today I’m going to throw a few thousand more words at it, because that’s what blogs are for. At least all the ones you’ll read here were written by me (and you can pry my em-dashes from my cold, dead hands).

Terminology, and picking a lane

But first, a couple quick notes:

I’m going to be using the terms “LLM” and “LLMs” almost exclusively in this post, because I think the precision is useful. “AI” is a vague and overloaded term, and it’s too easy to get bogged down in equivocations and debates about what exactly someone means by “AI”. And virtually everything that’s contentious right now about programming and “AI” is really traceable specifically to the advent of large language models. I suppose a slightly higher level of precision might come from saying “GPT” instead, but OpenAI keeps trying to claim that one as their own exclusive term, which is a different sort of unwelcome baggage. So “LLMs” it is.

And when I talk about “LLM coding”, I mean use of an LLM to generate code in some programming language. I use this as an umbrella term for all such usage, whether done under human supervision or not, whether used as the sole producer of code (with no human-generated code at all) or not, etc.

I’m also going to try to limit my comments here to things directly related to technology and to programming as a profession, because that’s what I know (I have a degree in philosophy, so I’m qualified to comment on some other aspects of LLMs, but I’m deliberately staying away from them in this post because I find a lot of those debates tedious and literally sophomoric, as in reminding me of things I was reading and discussing when I was a sophomore).

If you’re using an LLM in some other field, well, I probably don’t know that field well enough to usefully comment on it. Having seen some truly hot takes from people who didn’t follow this principle, I’ve thought several times that we really need some sort of cute portmanteau of “LLM” and “Gell-Mann Amnesia” for the way a lot of LLM-related discourse seems to be people expecting LLMs to take over every job and field except their own.

No silver bullet

A few years ago I wrote about Fred Brooks’ No Silver Bullet, and said I think it may have been the best thing Brooks ever wrote. If you’ve never read No Silver Bullet, I strongly recommend you do so, and I recommend you read the whole thing for yourself (rather than just a summary of it).

No Silver Bullet was published at a time when computing hardware was advancing at an incredible rate, but our ability to build software was not even close to keeping up. And so Brooks made a bold prediction about software:

There is no single development, in either technology or management technique, which by itself promises even a single order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity.

To support this he looked at sources of difficulty in software development, and assigned them to two broad categories (emphasis as in the original):

Following Aristotle, I divide them into essence—the difficulties inherent in the nature of the software—and accidents—those difficulties that today attend its production but that are not inherent.

A classic example is memory management: some programming languages require the programmer to manually allocate, keep track of, and free memory, which is a source of difficulty. And this is accidental difficulty, because there’s nothing which inherently requires it; plenty of other programming languages have automatic memory management.

But other sources of difficulty are different, and seem to be inherent to software development itself. Here’s one of the ways Brooks summarizes it (emphasis matches what’s in my copy of No Silver Bullet):

The essence of a software entity is a construct of interlocking concepts: data sets, relationships among data items, algorithms, and invocations of functions. This essence is abstract, in that the conceptual construct is the same under many different representations. It is nonetheless highly precise and richly detailed.

I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation. We still make syntax errors, to be sure; but they are fuzz compared to the conceptual errors in most systems.

If this is true, building software will always be hard. There is inherently no silver bullet.

And to drive the point home, he also explains the diminishing returns of only addressing accidental difficulty:

How much of what software engineers now do is still devoted to the accidental, as opposed to the essential? Unless it is more than 9/10 of all effort, shrinking all the accidental activities to zero time will not give an order of magnitude improvement.

This is a straightforward mathematical argument. If its two empirical premises—that the accidental/essential distinction is real and that the accidental difficulty remaining today does not represent 90%+ of total—are true, then the conclusion which rules out an order-of-magnitude gain from reducing accidental difficulty follows automatically.

I think most programmers believe the first premise, at least implicitly, and once the first premise is accepted it becomes very difficult to argue against the second. In fact, I’d personally go further than the minimum required for Brooks’ argument. His math holds up as long as accidental difficulty doesn’t reach that 90%+ mark, since anything lower makes a 10x improvement from eliminating accidental difficulty impossible. But I suspect accidental difficulty, today, is a vastly smaller proportion of the total than that. In a lot of mature domains of programming I’d be surprised if there’s even a doubling of productivity still available from a complete elimination of remaining accidental difficulty.

There’s also a section in No Silver Bullet about potential “hopes for the silver” which addresses “AI”, though what Brooks considered to be “AI” (and there is a tangent about clarifying exactly what the term means) was significantly different from what’s promoted today as “AI”. The most apt comparison to LLMs in No Silver Bullet is actually not the discussion of “AI”, it’s the discussion of automatic programming, which has meant a lot of different things over the years, but was defined by Brooks at the time as “the generation of a program for solving a problem from a statement of the problem specifications”. That’s pretty much the task for which LLMs are currently promoted to programmers.

But Brooks quotes David Parnas on the topic: “automatic programming always has been a euphemism for programming with a higher-level language than was presently available to the programmer.” And Brooks did not believe higher-level languages on their own could be a silver bullet. As he put it in a discussion of the Ada language:

It is, after all, just another high-level language, and the biggest payoff from such languages came from the first transition, up from the accidental complexities of the machine into the more abstract statement of step-by-step solutions. Once those accidents have been removed, the remaining ones are smaller, and the payoff from their removal will surely be less.

Many people are currently promoting LLMs as a revolutionary step forward for software development, but are doing so based almost exclusively on claims about LLMs’ ability to generate code at high speed. The No Silver Bullet argument poses a problem for these claims, since it sets a limit on how much we can gain from merely generating code more quickly.

In chapter 2 of The Mythical Man-Month, Brooks suggested as a scheduling guideline that five-sixths (83%) of time on a “software task” would be spent on things other than coding, which puts a pretty low cap on productivity gains from speeding up just the coding. And even if we assume LLMs reduce coding time to zero, and go with the more generous No Silver Bullet formulation which merely predicts no order-of-magnitude gain from a single development, that’s still less than the gain Brooks himself believed could come from hiring good human programmers. From chapter 3 of The Mythical Man-Month:

Programming managers have long recognized wide productivity variations between good programmers and poor ones. But the actual measured magnitudes have astounded all of us. In one of their studies, Sackman, Erikson, and Grant were measuring performances of a group of experienced programmers. Within just this group the ratios between best and worst performances averaged about 10:1 on productivity measurements and an amazing 5:1 on program speed and space measurements!

(although I’m personally skeptical of the “10x programmer” concept, the software industry overall does seem to accept it as true)

Anecdote time: much of what I’ve done over my career as a professional programmer is building database-backed web applications and services, and I don’t see much of a gain from LLMs. I suppose it looks impressive, if you’re not familiar with this field of programming, to auto-generate the skeleton of an entire application and the basic create/retrieve/update/delete HTTP handlers from no more than a description of the data you want to work with. But that capability predates LLMs: Rails’ scaffolding, for example, could do it twenty years ago.

And not just raw code generation, but also the abstractions available to work with, have progressed to the point where I basically never feel like the raw speed of production of code is holding me back. Just as Fred Brooks would have predicted, the majority of my time is spent elsewhere: talking to people who want new software (or who want existing software to be changed); finding out what it is they want and need; coming up with an initial specification; breaking it down into appropriately-sized pieces for programmers (maybe me, maybe someone else) to work on; testing the first prototype and getting feedback; preparing the next iteration; reviewing or asking for review, etc. I haven’t personally tracked whether it matches Brooks’ five-sixths estimate, but I wouldn’t be at all surprised if it did.

Given all that, just having an LLM churn out code faster than I would have myself is not going to offer me an order of magnitude improvement, or anything like it. Or as a recent popular blog post by the CEO of Tailscale put it:

AI’s direct impact on this problem is minimal. Okay, so Claude can code it in 3 minutes instead of 30? That’s super, Claude, great work.

Now you either get to spend 27 minutes reviewing the code yourself in a back-and-forth loop with the AI (this is actually kinda fun); or you save 27 minutes and submit unverified code to the code reviewer, who will still take 5 hours like before, but who will now be mad that you’re making them read the slop that you were too lazy to read yourself. Little of value was gained.

More simply: throwing more patches into the review queue, when the review queue still drains at the same rate as before, is not a recipe for increased velocity. Real software development involves not just a review queue but all the other steps and processes I outlined above, and more, and having an LLM generate code more quickly does not increase the speed or capacity of all those other things.

So as someone who accepts Brooks’ argument in No Silver Bullet, I am committed to believe on theoretical grounds that LLMs cannot offer “even a single order-of-magnitude improvement … in productivity, in reliability, in simplicity”. And my own experience matches up with that prediction.

Practice makes (im)perfect

But enough theory. What about the empirical actual reality of LLM coding?

Every fan of LLMs for coding has an anecdote about their revolutionary qualities, but the non-anecdotal data points we have are a lot more mixed. For example, several times now I’ve been linked to and asked to read the DORA report on the “State of AI-assisted Software Development”. And initially it certainly seems like it’s declaring the effects of LLMs are settled, in favor of the LLMs. From its executive summary (page 3):

[T]he central question for technology leaders is no longer if they should adopt AI, but how to realize its value.

And elsewhere it makes claims like (page 34) “AI is the new normal in software development”.

But then, going back to the executive summary, things start sounding less uniformly positive:

The research reveals a critical truth: AI’s primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.

And then (still on page 3):

The greatest returns on AI investment come not from the tools themselves, but from a strategic focus on the underlying organizational system: the quality of the internal platform, the clarity of workflows, and the alignment of teams. Without this foundation, AI creates localized pockets of productivity that are often lost to downstream chaos.

Continuing on to page 4:

AI adoption now improves software delivery throughput, a key shift from last year. However, it still increases delivery instability. This suggests that while teams are adapting for speed, their underlying systems have not yet evolved to safely manage AI-accelerated development.

“Delivery instability” is defined (page 13) in terms of two factors:

  • Change fail rate: “The ratio of deployments that require immediate intervention following a deployment.”
  • Rework rate: “The ratio of deployments that are unplanned but happen as a result of an incident in production.”

Later parts of the report get into more detail on this. Page 38 charts the increase in delivery instability, for example. And elsewhere in the section containing that chart, there’s a discussion of whether increases in throughput (defined by DORA as a combination of lead time for changes, deployment frequency, and failed deployment recovery time) are enough to offset or otherwise make up for this increase in instability (page 41, emphasis added by me):

Some might argue that instability is an acceptable trade-off for the gains in development throughput that AI-assisted development enables.

The reasoning is that the volume and speed of AI-assisted delivery could blunt the detrimental effects of instability, perhaps by enabling such rapid bug fixes and updates that the negative impact on the end-user is minimized.

However, when we look beyond pure software delivery metrics, this argument does not hold up. To assess this claim, we checked whether AI adoption weakens the harms of instability on our outcomes which have been hurt historically by instability.

We found no evidence of such a moderating effect. On the contrary, instability still has significant detrimental effects on crucial outcomes like product performance and burnout, which can ultimately negate any perceived gains in throughput.

And the chart on page 38 appears to show the increase in instability as quite a bit larger than the increase in throughput, in any case.

Curiously, that chart also claims a significant increase in “code quality”, and other parts of the report (page 30, for example) claim a significant increase in “productivity”, alongside the significant increase in delivery instability, which seems like it ought to be a contradiction. As far as I can tell, DORA’s source for both “productivity” and “code quality” is perceived impact as self-reported by survey respondents. Other studies and reports have designed less subjective and more quantitative ways to measure these things. For example, this much-discussed study on adoption of the Cursor LLM coding tool used the results of static analysis of the code to measure quality and complexity. And self-reported productivity impacts, in particular, ought to be a deeply suspect measure. From (to pick one relevant example) the METR early-2025 study (emphasis added by me):

This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

LLM coding advocates have often criticized this particular study’s finding of slower development for being based on older generations of LLMs (more on that argument in a bit), but as far as I’m aware nobody’s been able to seriously rebut the finding that developers are not very effective at self-estimating their productivity. So to see DORA relying on self-estimated productivity is disappointing.

The DORA report goes on to provide a seven-part “AI capabilities model” for organizations (begins on page 49), which consists of recommendations like: strong version control practices, working in small batches, quality internal platforms, user-centric focus… all of which feel like they should be table stakes for any successful organization regardless of whether they also happen to be using LLMs.

Suppose, for sake of a silly example, that someone told you a new technology is revolutionizing surgery, but the gains are not uniformly distributed, and the best overall outcomes are seen in surgical teams where in addition to using the new thing, team members also wash their hands prior to operating. That’s not as extreme a comparison as it might sound: the sorts of practices recommended for maximizing LLM-related gains in the DORA report, and in many other similar whitepapers and reports and studies, are or ought to be as fundamental to software development as hand-washing is to surgery. The Joel Test was recommending quite a few of these practices a quarter-century ago, the Agile Manifesto implied several of them, and even back then they weren’t really new; if you dig into the literature on effective software development you can find variations of much of the DORA advice going all the way back to the 1970s and even earlier.

For a more recent data point, I’ve seen a lot of people talking about and linking me to CircleCI’s 2026 “State of Software Delivery” which, like the DORA report, claims an uneven distribution of benefits from LLM adoption, and even says (page 8) “the majority of teams saw little to no increase in overall throughput”. The CircleCI report also raises a worrying point that echoes the increase in “delivery instability” seen in the DORA report (CircleCI executive summary, page 3):

Key stability indicators show that AI-driven changes are breaking more often and taking teams longer to fix, making validation and integration the primary bottleneck.

CircleCI further reports (page 11) that, year-over-year, they see a 13% increase in recovery time for a broken main branch, and a 25% increase for broken feature branches. And (page 12) they also say failures are increasing:

[S]uccess rates on the main branch fell to their lowest level in over 5 years, to 70.8%. In other words, attempts at merging changes into production code bases now fail 30% of the time.

For comparison, their own recommended benchmark of success for main branches is 90%.

The cost of these increasing failures and the increasing time to resolve them is quantified (emphasis matches the report, page 14):

For a team pushing 5 changes to the main branch per day, going from a 90% success rate to 70% is the difference between one showstopping breakage every two days to 1.5 every single day (a 3x increase).

At just 60 minutes recovery time per failure, you’re looking at an additional 250 hours in debugging and blocked deployments every year. And that’s at a relatively modest scale. Teams pushing 500 changes per day would lose the equivalent of 12 full-time engineers.

The usual response to reports like these is to claim they’re based on people using older LLMs, and the models coming out now are the truly revolutionary ones, which won’t have any of those problems. For example, this is the main argument that’s been leveled against the METR study I mentioned above. But that argument was flimsy to begin with (since it’s rarely accompanied by the kind of evidence needed to back up the claim), and its repeated usage is self-discrediting: if the people claiming “this time is the world-changing revolutionary leap, for sure” were wrong all the prior times they said that (as they have to have been, since if any prior time had actually been the revolutionary leap they wouldn’t need to say this time will be), why should anyone believe them this time?

Also, I’ve read a lot of studies and reports on LLM coding, and these sorts of findings—uneven or inconsistent impact, quality/stability declines, etc.—seem to be remarkably stable, across large numbers of teams using a variety of different models and different versions of those models, over an extended period of time (DORA does have a bit of a messy situation with contradictory claims that “code quality” is increasing while “delivery instability” is increasing even more, but as noted above that seems to be a methodological problem). The two I’ve quoted most extensively in this post (the DORA and CircleCI reports) were chosen specifically because they’re often recommended to me by advocates of LLM coding, and seem to be reasonably pro-LLM in their stances.

The other expected response to these findings is a claim that it’s not necessarily older models but older workflows which have been obsoleted, that the state of the art is no longer to just prompt an LLM and accept its output directly, but rather involves one LLM (or LLM-powered agent) generating code while one or more layers of “adversarial” ones review and fix up the code and also review each other’s reviews and responses and fixes, thus introducing a mechanism by which the LLM(s) will automatically improve the quality of the output.

I’m unaware of rigorous studies on these approaches (yet), but several well-publicized early examples do not inspire confidence. I’ll pick on Cloudflare here since they’ve been prominent advocates for using LLMs in this fashion. In their LLM rebuild of Next.js:

We wired up AI agents for code review too. When a PR was opened, an agent reviewed it. When review comments came back, another agent addressed them. The feedback loop was mostly automated.

But their public release of it, vetted through this process and, apparently, some amount of human review on top, was initially unable to run even the basic default Next.js application, and also was apparently riddled with security issues. From one disclosure post (emphasis added by me):

AI is now very good at getting a system to the point where it looks complete.

One specific problem cited was that the LLM rebuild simply did not pull in all the original tests, and therefore could miss security-critical cases those tests were checking. From the same disclosure post:

The process was feature-first: decide which viNext features existed, then port the corresponding Next.js tests. That is a sensible way to move quickly. It gives you broad happy-path coverage.

But it does not guarantee that you bring over the ugly regression tests, missing-export cases, and fail-open behavior checks that mature frameworks accumulate over years.

So middleware could look “covered” while the one test that proves it fails safely never made it over.

For example, Next.js has a dedicated test directory (test/e2e/app-dir/proxy-missing-export/) that validates what happens when middleware files lack required exports. That test was never ported because middleware was already considered “covered” by other tests.

On the whole, that post is somewhat optimistic, but considering that the Next.js rebuild was carried out by presumably knowledgeable people who presumably were following good modern practices and prompting good modern LLMs to perform a type of task those LLMs are supposed to be extremely good at—a language and framework well-represented in training data, well-documented, with a large existing test suite written in the target language to assist automated verification—I have a hard time being that optimistic.

And though I haven’t personally read through the recent alleged leak of the Claude Code source, I’ve read some commentary and analysis from people who have, and again it seems like a team that should be as well-positioned as anyone to take maximum advantage of the allegedly revolutionary capabilities of LLM coding isn’t managing to do so.

So the consistent theme here, in the studies and reports and in more recent public examples, is that being able to generate code much more quickly than before, even in 2026 with modern LLMs and modern practices, is still no guarantee of being able to deliver software much more quickly than before. As the CircleCI report puts it (page 3):

The data points to a clear conclusion: success in the AI era is no longer determined by how fast code can be written. The decisive factor is the ability to validate, integrate, and recover at scale.

And if that sounds like the kind of thing Fred Brooks used to say, that’s because it is the kind of thing Fred Brooks used to say. Raw speed of generating code is not and was not the bottleneck in software development, and speeding that up or even reducing the time to generate code to effectively zero does not have the effect of making all the other parts of software development go away or go faster.

So at this point it seems clear to me that in practice as well as in theory LLM coding does not represent a silver bullet, and it seems highly unlikely to transform into one at any point in the near future.

On being left behind

When expressing skepticism about LLM coding, a common response is that not adopting it, or even just delaying slightly in adopting it, will inevitably result in being “left behind”, or even stronger effects (for example, words like “obliterated” have been used, more than once, by acquaintances of mine who really ought to know better). LLMs are the future, it’s going to happen whether you like it or not, so get with the program before it’s too late!

I said I’ll stick to the technical mode here, but I’ll just mention in passing that the “it’s going to happen whether you like it or not” framing is something I’ve encountered a lot and found to be pretty disturbing and off-putting, and not at all conducive to changing my mind. And milder forms like “It’s undeniable that…” are rhetorically suspect. The burden of proof ought to be on the person making the claim that LLMs truly are revolutionary, but framing like this tries to implicitly shift that burden and is a rare example of literally begging the question: it assumes as given the conclusion (LLMs are in fact revolutionary) that it needs to prove.

Meanwhile, I see two possible outcomes:

  1. The skeptical position wins. LLM coding tools do not achieve revolutionary silver-bullet status. Perhaps they become another tool in the toolbox, like TDD or pair programming, where some people and companies are really into them. Perhaps they become just another feature of IDEs, providing functionality like boilerplate generators to bootstrap a new project (if your favorite library/framework doesn’t provide its own bootstrap anyway).
  2. The skeptical position loses. LLM coding tools do achieve true revolutionary silver-bullet status or beyond (consistently delivering one or more orders of magnitude improvement in software development productivity), and truly become a mandatory part of every working programmer’s tools and workflows, taking over all or nearly all generation of code.

In the first case, delayed adoption has no downside unless someone happens to be working at one of the companies that decide to mandate LLM use. And they can always pick it up at that point, if they don’t mind or if they don’t feel like looking for a new job.

As to the second case: based on what I’ve argued above about the status and prospects of LLMs up to now, I obviously think that continuing the type of progress in models and practices that’s been seen to date does not offer any viable path to a silver bullet. Which means a truly revolutionary breakthrough will have to be something sufficiently different from the current state of the art that it will necessarily invalidate many (or perhaps even all) prior LLM-based workflows in addition to invalidating non-LLM-based workflows.

And even if that doesn’t result in a completely clean-slate starting point with everyone equal—even if experience with older LLM workflows is still an advantage in the post-silver-bullet world—I don’t think it can ever be the sort of insurmountable advantage it’s often assumed to be. For one thing, even with vastly higher average productivity, there likely would not be sufficient people with sufficient pre-existing LLM experience to fill the vastly expanded demand for software that would result (this is why a lot of LLM advocates, across many fields, spend so much time talking about the Jevons paradox). For another, any true silver-bullet breakthrough would have to attack and reduce the essential difficulty of building software, rather than the accidental difficulty. Let us return once again to Brooks:

I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation.

Much of the skill required of human LLM users today consists of exactly this: specifying and designing the software as a “conceptual construct”, albeit in specific ways that can be placed into an LLM’s context window in order to have it generate code. In any true silver-bullet world, much or all of that skillset would have to be rendered obsolete, which significantly reduces the penalty for late adoption if and when the silver bullet is finally achieved.

Power to the people?

Aside from impact on professional programmers and professional software-development teams, another claim often made in favor of LLM coding is that it will democratize access to software development. With LLM coding tools, people who aren’t experienced professional programmers can produce software that solves problems they face in their day-to-day jobs and lives. Surely that’s a huge societal benefit, right? And it’s tons of fun, too!

Setting aside that the New York Times piece linked above was written by someone who is an experienced professional, I’m not convinced of this use case either.

Mostly I think this is a situation where you can’t have it both ways. It seems to be widely agreed among advocates of LLM coding that it’s a skill which requires significant understanding, practice, and experience before one is able to produce consistent useful results (this is the basis of the “adopt now or be left behind” claim dealt with in the previous section); strong prior knowledge of how to design and build good software is also generally recommended or assumed. But that’s very much at odds with the democratized-software claim: that someone with no prior programming knowledge or experience will simply pick up an LLM, ask it in plain non-technical natural language to build something, and receive a sufficiently functional result.

I think the most likely result is that a non-technical user will receive something that’s obviously not fit for purpose, since they won’t have the necessary knowledge to prompt the LLM effectively. They won’t know how to set up directories of Markdown files containing instructions and skill definitions and architectural information for their problem. They won’t have practice at writing technical specifications (whether for other humans or for LLMs) to describe what they want in sufficient detail. They won’t know how to design and architect good software. They won’t know how to orchestrate multiple LLMs or LLM-powered agents to adversarially review each other. In short, they won’t have any of the skills that are supposed to be vital for successful LLM coding use.

There’s also the possibility that “natural” human language alone will never be sufficient to specify programs, even to much more advanced LLMs or other future “AI” systems, due to inherent ambiguity and lack of precision. In that case, some type of specialized formal language for specifying programs would always be necessary. Edsger W. Dijkstra, for example, took this position and famously derided what he called “the foolishness of ‘natural language programming’”, which is worth reading for some classic Dijkstra-isms like:

When all is said and told, the “naturalness” with which we use our native tongues boils down to the ease with which we can use them for making statements the nonsense of which is not obvious.

Another possible outcome for LLM coding by non-programmers is the often-mentioned analogy to 3D printing, which also was hyped up as a great democratizer that would let anyone design and make anything, but never delivered on that promise and, at the individual level, became a niche hobby for the small number of enthusiasts who were willing and able to put in the time, money, and effort to get moderately good at it.

But the nightmare result is that non-programmer LLM users will receive something that seems to work, and only reveals its shortcomings much later on. Given how often I see it argued that LLMs will democratize coding and write utility programs for people working in fields where privacy and confidentiality are both vital and legally mandated, I’m terrified by that potential failure mode. And I think one of the worst possible things that could happen for advocates of LLM adoption is to have the news full of stories of well-meaning non-technical people who had their lives ruined by, say, accidentally enabling a data breach with their LLM-coded helper programs, or even “just” turning loose a subtly-incorrect financial model on their business. So even if I were an advocate of LLM coding, I’d be very wary of pushing it to non-programmers.

But ultimately, the only situation in which LLMs could meaningfully democratize access to software development is one where they achieve a true silver bullet, by significantly reducing or removing essential difficulty from the software development process. And as noted above, LLM advocates seem to believe that even in the silver-bullet situation there would still be such a gap between those with pre-existing LLM usage skills and those without, that those without could never meaningfully catch up. Although I happen to disagree with that belief, it remains the case that advocates can’t have it both ways: either LLM coding will be an exclusive club for those who built up the necessary skills, XOR it will be a great democratizer and do away with the need for those skills.

Takeaways

I’m already over 6,000 words in this post, and though I could easily write many more, I should probably wrap it up.

If I had to summarize my position on LLM coding in one sentence, it would be “Please go read No Silver Bullet”. I think Brooks’ argument there is both theoretically correct and validated by empirical results, and sets some pretty strong limits on the impact LLM coding, or any other tool or technique which solely or primarily attacks accidental difficulty, can have.

Of course, limits on what we can do or gain aren’t necessarily the end of the world. Many of the foundations of computer science, from On Computable Numbers to Rice’s theorem and beyond, place inflexible limits on what we can do, but we still write software nonetheless, and we still work to advance the state of our art. So the No Silver Bullet argument is not the same as arguing that LLMs are necessarily useless, or that no gains can possibly be realized from them. But it is an argument that any gains we do realize are likely going to be incremental and evolutionary, rather than the world-changing revolution many people seem to be expecting.

Correspondingly, I think there is not a huge downside, right now, to slow or delayed adoption of LLM coding. Very few organizations have the strong fundamentals needed to absorb even a relatively moderate, incremental increase in the amount of code they generate, which I suspect is why so many studies and reports find mixed results and lots of broken CI pipelines. Not only is there no silver bullet, there especially is no quick or magical gain to be had from rushing to adopt LLM coding without first working on those fundamentals. In fact, the evidence we have says you’re more likely to hurt than help your productivity by doing so.

I also don’t think LLMs are going to meaningfully democratize coding any time soon; even if they become indispensable tools for programmers, they are likely to continue requiring users to “think like a programmer” when specifying and prompting. We would be much better served by teaching many more people how to think rigorously and reason about abstractions (and they would be much better served, too) than we would by just plopping them as-is in front of LLMs.

As for what you should be doing instead of rushing to adopt LLM coding out of fear that you’ll be left behind: I think you should be listening to what all those whitepapers and reports and studies are actually telling you, and working on fundamentals. You should be adopting and perfecting solid foundational software development practices like version control, comprehensive test suites, continuous integration, meaningful documentation, fast feedback cycles, iterative development, focus on users, small batches of work… things that have been known and proven for decades, but are still far too rare in actual real-world software shops.

If the skeptical position is wrong and it turns out LLMs truly become indispensable coding tools in the long term, well, the available literature says you’ll be set up to take the greatest possible advantage of them. And if it turns out they don’t, you’ll still be in much better shape than you were, and you’ll have an advantage over everyone who chased after wild promises of huge productivity gains by ordering their teams to just chew through tokens and generate code without working on fundamentals, and who likely wrecked their development processes by doing so.

Or as Fred Brooks put it:

The first step toward the management of disease was replacement of demon theories and humours theories by the germ theory. That very step, the beginning of hope, in itself dashed all hopes of magical solutions. It told workers that progress would be made stepwise, at great effort, and that a persistent, unremitting care would have to be paid to a discipline of cleanliness. So it is with software engineering today.

Read the whole story
jgbishop
11 days ago
reply
What a refreshing take on all of this!
Raleigh, NC
Share this story
Delete

Trump Warns Iran To Accept His Ultimatum Or Face Wrath Of Next Ultimatum

1 Comment

WASHINGTON—Threatening to continue issuing threats if the Islamic Republic did not quickly agree to his demands, President Donald Trump warned Iran on Monday to accept his ultimatum or face the wrath of his next ultimatum. “Lay down your weapons now or I will have no choice but to ask you to lay down your weapons later,” the commander in chief wrote on Truth Social, adding that the Iranian regime only had two more days to consider his terms before he would give them eight more days to consider his terms. “Mark my words, this is your last chance before your next last chance. If you do not act immediately, I won’t hesitate to wait even longer. You may think I’m bluffing, but believe me when I say you will feel the full weight of my social media posts.” At press time, Trump urged Iran not to try his patience because they would find it much, much greater than they expected.

The post Trump Warns Iran To Accept His Ultimatum Or Face Wrath Of Next Ultimatum appeared first on The Onion.

Read the whole story
jgbishop
14 days ago
reply
Isn't The Onion supposed to be satire, and not actual fact?
Raleigh, NC
Share this story
Delete

The Argyle Sweater - 2026-04-05

1 Comment
Read the whole story
jgbishop
16 days ago
reply
Hahaha!
Raleigh, NC
Share this story
Delete

Vulnerability Research Is Cooked

1 Comment

Vulnerability Research Is Cooked

Thomas Ptacek's take on the sudden and enormous impact the latest frontier models are having on the field of vulnerability research.

Within the next few months, coding agents will drastically alter both the practice and the economics of exploit development. Frontier model improvement won’t be a slow burn, but rather a step function. Substantial amounts of high-impact vulnerability research (maybe even most of it) will happen simply by pointing an agent at a source tree and typing “find me zero days”.

Why are agents so good at this? A combination of baked-in knowledge, pattern matching ability and brute force:

You can't design a better problem for an LLM agent than exploitation research.

Before you feed it a single token of context, a frontier LLM already encodes supernatural amounts of correlation across vast bodies of source code. Is the Linux KVM hypervisor connected to the hrtimer subsystem, workqueue, or perf_event? The model knows.

Also baked into those model weights: the complete library of documented "bug classes" on which all exploit development builds: stale pointers, integer mishandling, type confusion, allocator grooming, and all the known ways of promoting a wild write to a controlled 64-bit read/write in Firefox.

Vulnerabilities are found by pattern-matching bug classes and constraint-solving for reachability and exploitability. Precisely the implicit search problems that LLMs are most gifted at solving. Exploit outcomes are straightforwardly testable success/failure trials. An agent never gets bored and will search forever if you tell it to.

The article was partly inspired by this episode of the Security Cryptography Whatever podcast, where David Adrian, Deirdre Connolly, and Thomas interviewed Anthropic's Nicholas Carlini for 1 hour 16 minutes.

I just started a new tag here for ai-security-research - it's up to 11 posts already.

Tags: security, thomas-ptacek, careers, ai, generative-ai, llms, nicholas-carlini, ai-ethics, ai-security-research

Read the whole story
jgbishop
17 days ago
reply
This is one wild ride; it will be interesting to see what happens in the next few months.
Raleigh, NC
GaryBIshop
17 days ago
or terrifying. I'm very glad I'm no longer responsible for production systems.
Share this story
Delete
Next Page of Stories