Let’s start with a scenario many developers can relate to.
You approved a pull request last week. Everything seemed in order—clean code, tests passed, no glaring issues. Yet just two days later, something unexpectedly went awry in production.
If you haven't experienced this yet, consider yourself fortunate. I’m seeing this pattern much too frequently, and it reveals a wider problem.
AI tools have undeniably sped up the coding process for many teams. However, they haven’t improved the ability of a codebase to safely incorporate that influx of new code. This discrepancy—a rapid production rate without an equivalent capacity to manage risk—is where many issues arise.
Spotting the Hidden Bugs
This problem is now known as the "illusion of correctness."
AI-generated code is polished on the surface. It adheres to syntax rules, compiles correctly, and passes tests. During code reviews, it could easily be mistaken for work done by a skilled engineer. Yet the underlying assumptions—those critical, often invisible nuances—can lead to problems.
Here are four common pitfalls that I've observed leading to production failures:
Boundary assumptions. For example, assume a field is always available. It’s not. A downstream service might’ve suffered an outage months ago and now fails to provide that field under specific circumstances. Everything looks fine in staging, yet a real-world transaction crashes the system at 2 AM.
Concurrency assumptions. You might trust that a particular API call is idempotent. What happens when it isn't? This can lead to double charges for customers—all because the code appeared flawless during the review process. The AI might have recognized a retry pattern without understanding the implications of calling that endpoint multiple times.
Domain assumptions. Assuming two order statuses are identical can be disastrous. Your fulfillment and billing teams could treat them differently, but that nuance remains unrecorded as a formal rule. The AI’s inability to grasp this stems from a lack of explicit guidance in the code.
Security assumptions. It's a risky proposition to assume a request from an internal service is inherently safe. Internal networks aren't infallible security barriers. This assumption often slips through the cracks during code reviews, gaining acceptance due to its "clean" presentation.
The irony is that while the code compiles flawlessly, real-world user interactions reveal the gaps in understanding and support for these assumptions.
The danger lies in the fact that these weaknesses often do not surface during code reviews—they only become apparent through incidents.
The Paradox of Speed
Here's a framework I've adopted with my teams.
Every system has a finite capacity for absorbing changes. This capacity encompasses aspects such as contracts, invariant test coverage, observability, and component coupling. When the rate of incoming changes exceeds a system's ability to absorb them, instability inevitably follows. It may take time to manifest, but it does happen reliably.
What many find surprising is that pushing harder on an overwhelmed system can slow down overall delivery speed. The gains made from hastily generated code can be eclipsed by the time lost in debugging, rolling back changes, and redoing work.
AI can amplify how quickly changes are generated, but refactoring is essential for ensuring a system can absorb those changes. The disparity between these two aspects represents your actual risk exposure.
Teams that excel in AI-enhanced development aren’t necessarily leveraging superior models. Instead, they've crafted an engineering environment capable of seamlessly integrating AI-generated changes without accruing hidden liabilities.
Refactoring: A Strategic Advantage
Many teams misconceive refactoring. Typically, they view it through one of three lenses: as cleanup, payment of technical debt, or as a recurring item on the roadmap.
These perceptions do little to support the goal of maintaining speed while ensuring safety.
The correct perspective is this: refactoring is instrumental in reducing the costs associated with change, enabling systems to handle more frequent and larger updates while minimizing the risk of hidden fragility. In an AI-driven context, refactoring becomes a catalyst for acceleration rather than a burden.
What continuous refactoring achieves is clear boundaries, enabling reliable change propagation; reduced coupling, preventing unexpected side effects; well-defined ownership to eliminate ambiguity; testable invariants, ensuring code reviews are more effective; and enhanced observability, allowing discrepancies to be identified before they reach customers.
The antithesis of this approach is to accelerate AI-powered delivery on a foundation of unresolved technical debt. This leads to quicker accumulation of inconsistencies, increased regressions in production, and overall velocity that can plummet as teams struggle to catch up during rework.
Four Essential Practices: CATS
I've implemented a framework called CATS within my teams and present it in my talks. This framework outlines four crucial practices that help maintain pace without sacrificing stability.
C — Contracts
Defining clear boundaries is essential. This includes API specifications, event schemas, data contracts, and statements of ownership.
Consider a situation I've witnessed multiple times. Three teams rely on a shared pricing service. With no official contract in place, only vague agreements exist—and those are seldom updated. An engineer leverages AI to modify the response structure, and while it looks tidy and passes tests, chaos erupts two days later as dependent teams experience failures due to altered or missing fields.
This is not a flaw of AI; it’s a matter of lacking a proper contract.
With a solid contract—a versioned API spec, an event schema detailing field definitions and ownership—the internal components can evolve without jeopardizing external interactions. AI-generated code is much more reliable when aligned with explicit contracts than when attempting to infer implicit norms.
Every time your team has said, "I thought that field was always supposed to be there," that’s a signal for a potential contract. Document it thoroughly—not just the structure but also its intended meaning, acceptable values, and the point of contact in case of issues.
A — Automated Verification
Implement tests that enforce domain invariants instead of just providing happy-path coverage. Incorporate schema validation within continuous integration processes, along with security checks in the deployment pipeline.
While AI is adept at generating test code, it struggles with determining which domain rules to validate since these nuances often reside within post-mortem analyses and team knowledge—not the codebase itself.
A common issue arises when teams generate a test suite using AI, achieving seemingly impressive coverage stats. However, this coverage often reflects only the scenarios that the AI can logically infer from the code rather than covering the edge cases that could lead to production failures.
Your role is to clearly articulate the invariants. AI can then focus on ensuring these are thoroughly tested. Schema validation in CI catches discrepancies during merges rather than allowing them to manifest in production, while automated security checks reveal weaknesses hidden behind assumed safety when requests come from internal sources.
T — Telemetry
Utilize logs, metrics, and traces that reveal what's genuinely occurring, rather than merely reflecting assumptions based on the code.
Code reviews elucidate what the code is meant to do; telemetry uncovers what it actually does. These two perspectives are often misaligned, especially when merging numerous PRs at a fast pace.
For instance, a team may deploy a revised order processing workflow. Reviews check out, load tests appear fine. Yet, a minor adjustment in how null values are handled can lead to specific edge-case orders can fail silently—no error messages, just incorrect states. Without proper alerts on order state transitions, they discover the issue only after receiving customer complaints.
With effective drift detection, that problem could be caught at a concerning 0.3% error rate, rather than ringing alarm bells like "why did revenue drop on Thursday?"
Beyond simply identifying issues, implement feature flags, canary thresholds, and a practical rollback checklist that can be effectively used without requiring multi-person discussions at late hours. If a rollback mandates a conference call, your operations won’t withstand the pressures of AI-speed deployments.
S — Simplification
Commit to the continuous reduction of hidden coupling and vague ownership—not as a separate project but as an integrated aspect of feature development.
If refactoring requires a discussion and scheduling, it likely won’t happen. Teams that succeed incorporate refactoring into their feature work, improving the code in the same workflow where changes occur, minimizing extra coordination costs.
AI can assist here by identifying redundancies and suggesting potential contracts, but validation against domain knowledge is essential. AI can reveal patterns, but you know if its suggestions genuinely align with your system’s needs.
And it’s crucial to measure relevant metrics—not merely the aesthetic cleanliness of the code or the number of lines revised. Focus on whether it’s becoming less costly to implement changes over time, as this will indicate if simplification efforts are genuinely effective.
What This Looks Like in Practice
Let’s look at two contrasting scenarios.
Without CATS: Imagine AI generates a service. The pull request looks excellent, but there’s no formal API contract. Downstream teams proceed based on assumptions. Months later, when someone alters the response structure, two teams receive alerts around the same time on a Friday night. The post-mortem reveals "communication breakdown," yet the real culprit is the lack of a contract.
With CATS: AI produces the same service. Prior to merging, a contract is documented—including the API spec, field meanings, ownership, and versioning. Schema validation is included in CI. When a refactor occurs months later, the contract version is updated, and any discrepancies are caught in CI failures before reaching production.
The latter approach doesn’t slow down the process. With a contract in place, every subsequent change happens more quickly because the blast radius of alterations is clearly defined and documented. That investment pays dividends for all future pull requests.
Fast without CATS: speed breeds fragility. Fast with CATS: speed multiplies.
A Two-Week Sprint to Get Started (Without Pausing Features)
This initiative needn’t be a drawn-out effort. Here’s how a focused, two-week sprint looks—without halting ongoing development.
Week 1: Contracts and Safety
Identify two or three weak points. Where are failures most frequent? Where have team members expressed, "I thought that was always the case?" These are your candidates for formal contracts.
Document the contracts. Go beyond the structure; detail what each field signifies. What values are acceptable? Who is responsible for it? Who do you contact if an error occurs?
Incorporate contract verification in CI. Implement schema validation as a merging prerequisite.
Create one invariant test. Select the domain rule that would cause the most harm if violated. Focus on writing just one test suite—not tackling the entire backlog.
Week 2: Observability and Simplification
Implement drift detection dashboards and alerts. Monitor the failure modes identified in week one. Be proactive—know when issues arise at the 0.3% error rate, rather than reacting to customer complaints.
Eliminate one significant coupling point. Identify a shared dependency that causes widespread disruptions upon changes. Refactor it and define ownership clearly.
Add safe rollout protocols. Develop templates for feature flags, canary thresholds, and a practical rollback checklist that can be followed without assembling a team conference call.
Measure performance. Track PR sizes, incidents per change, and coordination overhead. Establish baseline measurements now and review at week four.
This approach won’t resolve every issue but will yield measurable risk reductions in just two weeks. Moreover, it initiates the development of habits that ensure fast delivery remains sustainable.
The Shift That Matters
The platform engineering sector has long been working on better resources—internal developer platforms, golden paths, service meshes, standardized stacks for observability. All of these tools presume that teams can operate at high speeds without the system becoming more fragile over time.
The acceleration introduced by AI has dramatically enhanced the speed component of this equation. However, the safety aspect hasn’t evolved in tandem.
Organizations effectively managing the AI landscape share several characteristics. Contracts are prioritized as essential artifacts, not treated as an afterthought. Testing domain invariants is recognized as a distinct practice, not simply a metric. Observability genuinely reflects operational realities rather than just code expectations. Lastly, refactoring is an ongoing process integrated into feature cycles rather than relegated to a backlog project.
None of these practices are new, but their importance is now amplified. Without them, high-speed AI-assisted development risks becoming inconsistent—spreading fast for a brief period before stumbling under the weight of hidden debt.
Closing Thoughts
The era of AI champions speed but also reveals fragility at a pace we’re still adjusting to, as this fragility can accumulate more rapidly than ever before.
The teams that will thrive won't be the ones simply generating the most code but those that have established systems capable of integrating AI-generated changes safely—through solid contracts, automated verification, effective telemetry, and ongoing simplification efforts.
If your team is currently operating at high speeds, the real question becomes whether you've built up enough invisible debt that your velocity could be starting to decline.
As you prepare for Monday, take a moment to identify one brittle boundary, document its contract, and draft one invariant test.
That's your starting point.