Why AI Coding Agents Fail When Software Gets Real

A new paper names a problem every engineering team will recognize: AI coding agents look much better when they can freestyle than when they have to follow the rules of a real backend.

Source note: Francesco Dente (EURECOM), Dario Satriani (University of Basilicata), and Paolo Papotti (EURECOM). “Constraint Decay: The Fragility of LLM Agents in Backend Code Generation.” arXiv:2605.06445v1, submitted May 7, 2026.

Why This Paper Matters

Most AI coding benchmarks are still fairly loose. They ask an agent to solve a bug, write a function, or build something from a prompt. In those settings, the agent has a lot of freedom. It can organize the code however it wants, use whatever library seems easiest, and ignore most professional software patterns as long as the output appears to work.

Production software is not like that. A real backend comes with constraints. The team may require PostgreSQL. It may require a specific architecture so the code stays maintainable. It may require an Object-Relational Mapper (ORM) instead of hand-written database calls. The agent is not just being asked to make something work. It is being asked to make something work inside an existing way of building software.

This paper studies what happens when coding agents stop freelancing and have to obey ordinary engineering rules. The short version: as the constraints stack up, performance falls. That matters because most business value is not in demos. It is in software that fits an existing codebase, database, and operating model.

The Idea in Plain English

Imagine a brilliant but erratic junior developer. If the task is “build a simple store,” they do well. But if the task is “build a simple store, put the database logic in this exact layer, use this exact query library, and make sure no route handler talks directly to the database,” the quality suddenly drops.

They still write code that looks plausible. But now they forget basic database logic, misuse the library, or put behavior in the wrong layer. The extra rules do not merely add work. They seem to interfere with the agent’s ability to solve the core problem.

That is what the paper calls constraint decay: the performance drop that happens when an LLM agent is forced to satisfy real software constraints at the same time as the functional task.

What the Researchers Tested

The researchers used a testbed called the “Conduit” API, based on the RealWorld project. It is a medium-sized social blogging app with 19 endpoints for articles, comments, users, and profiles.

They tested two agent setups:

  1. Mini-SWE-Agent: A simple script that lets the AI use a terminal and bash commands.
  2. OpenHands: A fuller agent platform that can search code, edit files, and track tasks.

They paired those agents with strong coding models, including GPT-5.2, MiniMax-M2.5, and Qwen3-Coder-Next.

Then they created four levels of constraint:

  • Level 0: Build the API in a specified web framework.
  • Level 1: Add one structural rule: Clean Architecture, SQLite, or PostgreSQL.
  • Level 2: Combine two rules, such as Clean Architecture plus PostgreSQL, or PostgreSQL plus an ORM.
  • Level 3: Use the full professional stack: Clean Architecture, a specified database, and a specified ORM. The ORM is SQLAlchemy for Python and Sequelize for Node.js.

They tested this across eight web frameworks, including Flask, FastAPI, Django, and Express, to see how much the framework changed the outcome.

What They Found

The results were not subtle. Performance fell as constraints were added.

1. The 30-Point Drop

Across the strongest configurations, success dropped by roughly 30 percentage points when moving from Level 0 to Level 3. In other words, an agent that looked strong when given freedom became much less reliable when it had to follow a full professional stack.

2. The Database Caused the Most Trouble

The biggest single source of failure was the data layer. Requiring PostgreSQL, instead of letting the agent choose a simpler path, produced the steepest decline in quality.

3. The “Hard” Pass Failure

Agents often passed many individual assertions but failed to pass the entire suite. For the most complex greenfield tasks, the strongest configuration reached a 78.6% assertion pass rate but only 8.3% pass@1. That is the practical warning: partial correctness can look impressive while still being too brittle for production.

4. Frameworks Matter

Framework choice changed the results a lot. Agents performed best in minimalist frameworks like Flask or Express. These frameworks have fewer hidden rules and force the developer to be explicit.

Agents struggled more with convention-heavy frameworks like Django or FastAPI. These frameworks ask the developer to know where files belong, how type hints behave, and which defaults matter. The agents kept tripping over those details.

Why Agents Fail Here

The researchers also looked at the logs to see why the agents failed. A few patterns showed up repeatedly.

Data-Layer Defects

The largest source of failure was the database. Agents wrote SQL queries that looked right but were conceptually wrong. For Qwen3-Coder-Next, incorrect query logic made up 25.5% of logic errors, while database and ORM runtime errors added another 21.2%. MiniMax-M2.5 showed the same two categories at 15% each.

Even when they used an ORM (a library designed to make database work easier), they failed. They would misuse the library’s API, causing the server to crash the moment it tried to save a comment or fetch a user profile.

Framework Idiosyncrasies

Agents often missed framework defaults. One model failed repeatedly because it did not realize that Fastify, a Node.js framework, rejects empty-body POST requests by default. The code made sense in the abstract, but failed because the agent did not know that specific framework behavior.

The Constraint-Logic Trade-off

The most useful finding is about cognitive load. When an agent has to worry about whether a file belongs in the repository layer or service layer, it seems to have less capacity left for the actual feature logic. The constraints are not just extra work. They interfere with the agent’s reasoning about the core problem.

What This Means for Builders

For anyone building AI agents or using them in a real workflow, the durable lesson is simple: functionality is easier than structure.

1. Move Beyond Freestyle Benchmarks

If a test suite only checks whether a function returns the right value, it is not testing the real world. Teams also need to test structural compliance. Does the agent follow the team’s folder structure? Does it use the expected database patterns? Does it respect the dependency boundaries?

This research shows that an agent that looks smart in a sandbox can still be weak inside a real repo.

2. Provide Structural Scaffolding

Since agents struggle with convention-heavy work, they should not be asked to build from a blank slate unless the goal is a prototype. A skeleton repository with folders, database connection, and example patterns already in place gives the agent less room to drift. The less structural thinking it has to do, the more room it has for feature logic.

3. Better Constraint-Aware Planning

Agents need to treat the “how” as part of the task, not as a style preference. One practical workflow is to make the agent produce an implementation plan first, check that plan against the repo’s structure, and only then let it write code.

What This Means for Buyers and Operators

For the business side, this paper is a reality check on the “AI-native company” story.

1. Agents Are Stronger at Prototypes Than Production

Right now, agents are much stronger at prototypes than production-grade backend work. They can spin up a working demo quickly if given freedom. Once that demo has to fit security rules, database schemas, and architectural patterns, the risk changes.

2. The Human Tax Remains High

Because the strongest complex greenfield run still had only 8.3% pass@1, senior engineering judgment still matters. The time saved by having the AI write code can disappear into fixing structural misunderstandings.

3. Choose Explicit Over Magical

If a team is designing software that AI agents will help maintain, explicitness matters. Hidden behavior, file-name conventions, and implicit configuration all make the agent’s job harder. The more visible the rules are, the easier they are to follow.

What to Watch Next

The next thing to watch is agent-native software design. Software has spent years optimizing for what makes sense to experienced humans. Django’s conventions are a good example: efficient once a developer knows them, confusing before that context exists. The industry may now need to learn what software looks like when agents are expected to read, modify, and maintain it too.

That could mean frameworks designed to be easier for LLMs to read. It could mean documentation that explains structural constraints, not just API calls. It could also mean repo-level agent guides that spell out how code should be organized before the agent starts editing.

The other thing to watch is “constraint-oriented pre-training.” The next generation of useful coding models may need more than exposure to a lot of code. They may need specific training on the relationships between architectural layers and database constraints.

Limitations and Caveats

This study focused specifically on backend web development in Python and Node.js. Those are important domains, but the results might be different for frontend development, mobile apps, data pipelines, or systems engineering.

The agent field also moves quickly. The tested models include strong 2026 systems, but better planning, retrieval over framework docs, and constraint-aware training could change the numbers.

Source

Dente, F., Satriani, D., & Papotti, P. (2026). Constraint decay: The Fragility of LLM Agents in Backend Code Generation. arXiv preprint arXiv:2605.06445. Available at: https://arxiv.org/html/2605.06445