A team of engineers used vibe coding tools to build a production platform. Eighty percent of the application was functionally complete within the first week (screens built, flows wired, enough to demo). The remaining twenty percent (production readiness for an MVP) took the rest of the month.
Over time, security failures, lack of structural integrity and difficulty supporting, among other things, led to compounding technical debt and a long-term total cost of ownership that climbed well past the initial build.
The gap exposed in that sprint did not surface at launch. This gap is the whole argument.
The question "Will AI replace software engineers?" has generated an enormous amount of noise and very little signal. The answer, "No, not entirely", gets buried under either breathless enthusiasm or defensive dismissal. Neither framing is useful.
The more productive question is: "What does AI actually contribute to software development, and where does that contribution run dry?"
The Framing Problem
The confusion stems partly from treating "AI" as a monolith. There is a meaningful difference between AI coding assistants like GitHub Copilot, Claude Code or Cursor autocompleting a function and a vibe coding platform like Lovable generating entire frontend screens from a natural language prompt. Both are AI tools in software development, but they operate at different levels of abstraction and carry different failure modes.
The other source of confusion is conflating what AI can do in a narrow benchmark with what it should do in a production system. A model that generates clean, functional code in isolation is not the same as a system that produces maintainable, secure, architecturally coherent software at scale. The gap between these two things is where engineering judgment lives; and it is a big gap.
What is happening in the industry is not replacement. It is compression. AI tools are compressing certain parts of the development workflow, particularly the coding labor, while expanding the surface area of what engineers are expected to oversee and orchestrate. According to Mubashir Ali, Director of Engineering at Levitate Data, this represents the third major shift in what software engineering actually means: from coding as the primary skill, to data and model training in the neural network era, to orchestration now (managing which models to use, how to chain them, and how to verify their outputs.)
What AI Is Actually Good At
To be concrete: AI tools are genuinely effective at generating boilerplate, completing repetitive patterns, translating well-defined requirements into working code, and accelerating time-to-prototype significantly.
A 2025 controlled study comparing copilot-style tools against agentic coding tools like Claude Code found that agents improved task correctness by 35% and reduced hands-on user effort by roughly 50% compared to traditional coding assistants. The gains were sharpest on isolated, well-scoped tasks where the input and expected output were already clear.
The 80 percent in that opening example is real. For many applications, these tools can scaffold a working demo in a fraction of the time it would take a human engineer working alone. For founders validating ideas, internal tooling with low stakes, or rapid iteration on a concept, that speed is genuinely useful and should not be dismissed.
Closing the gap between demo-complete and production-ready?
We help teams audit AI-assisted workflows—from vibe-coded prototypes to hybrid models—and pinpoint where engineering judgment, security, and architecture need to hold before you scale.
Request a DemoWhere the Seams Show
"The 20 percent is where things get complicated, and it is a disproportionately expensive 20 percent."
Structural integrity
Vibe coding tools tend to generate isolated pages and components with no shared state, no reusable patterns, and no consistent data architecture. Each section gets built as if the rest of the application does not exist. That works for a demo, but it creates significant maintenance debt in a production system.
This has emerged as a pattern in independent research. A 2025 difference-in-differences study of Cursor's impact across GitHub projects found a significant, large, but transient increase in development velocity, followed by a significant and persistent increase in static analysis warnings and code complexity. The researchers found that the rise in code complexity acted as a major factor causing long-term velocity slowdown. AI tools optimize for generating code that looks correct at the point of creation, not code that holds its shape when integrated with the surrounding system.
A separate 2025 randomized controlled trial by METR added further nuance: when experienced developers worked on complex, large-scale codebases using frontier AI tools (primarily Cursor Pro with Claude Sonnet), they actually took 19% longer to complete tasks than without AI assistance. The productivity gains that dominate the headlines are real but they are concentrated in well-defined, lower-complexity work. The more ambiguous and architecturally involved the task, the more the gains compress.
Confidence without correctness
LLMs hallucinate, confidently and convincingly. A model that returns wrong output with high confidence is worse than one that signals uncertainty, because the failure is invisible until it surfaces in production. This is a structural characteristic of how these models work, not a deficiency that better prompting resolves. It is also a vulnerability that attackers have started taking advantage of.
A comprehensive academic study published in May 2025, titled "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs," analyzed 16 popular code-generating AI models across Python and JavaScript ecosystems. The researchers found that 19.7% of the packages recommended by these LLMs did not actually exist, providing an enormous attack surface for malicious actors.
That attack surface was demonstrably exploited in May 2026. The Mini Shai-Hulud supply chain campaign, attributed to threat actor TeamPCP, showed what exploitation of supply chain weaknesses looks like at scale:
- 323 or more npm packages compromised in a single 22-minute automated wave, with malicious payloads injected directly into trusted source files rather than isolated install scripts, making detection substantially harder.
- The campaign reached directly into AI development tooling. The MistralAI Python package on PyPI (version 2.4.6) was found to contain malicious code that downloaded and executed malware on Linux systems, disguised to resemble Hugging Face tooling. An AI coding assistant recommending that SDK for an ML project would have done so with complete confidence and no signal that the package had been tampered with.
- Once installed, the malware specifically harvested VS Code configuration files and stored GitHub tokens. Those tokens were then used to push malicious commits to the developer's own repositories and publish poisoned package updates under their credentials, propagating the infection laterally through their entire GitHub footprint without any further action required.
- OpenAI publicly acknowledged exposure to the related TanStack wave of the same campaign.
Application security
Experienced engineers are not necessarily security specialists, but they know what patterns to flag: common injection vectors, authentication edge cases, data exposure risks. Vibe coding tools skip this reasoning almost entirely. The output looks correct; the attack surface often is not visible in the generated code.
In October 2025, Veracode tested over 100 LLMs across Java, JavaScript, Python, and C# on tasks where the requested functionality could be implemented securely or insecurely.
Result: AI-generated code introduced an OWASP Top 10 security vulnerability in 45% of tests.
The more pointed finding is that security performance is largely flat across model size and generation. Newer, larger models did not produce meaningfully more secure code than smaller, older ones. The code ran correctly. The vulnerabilities were invisible to the tool that wrote the code.
Model selection
It is worth noting that tools like GitHub Copilot are model pickers, not model creators. They give engineers access to a range of frontier models (Claude, GPT, Gemini, and others), each with different strengths. Choosing which model to use for which task, and knowing when to distrust its output, is itself an engineering judgment call that requires understanding the problem domain, not just the tool.
Pooya Golchian wrote about this in a blog earlier in May. A founder's team had spent three months on an AI feature that was producing wrong answers two times out of ten. The assumption was that the technology was the problem. When asked whether anyone on the team understood the difference between Sonnet and Opus for the kind of reasoning the feature required, the answer was: "They had picked Sonnet because it was cheaper." Within two weeks of implementing a model-routing layer (Haiku for classification and extraction, Sonnet for general reasoning, Opus for architectural decisions), accuracy on the same test set went from 80% to 97%. The technology never changed. The routing did.
What Engineers Actually Own
The decisions that determine whether a system is built correctly (architecture, technology selection, deployment strategy, requirement analysis, and product design) are still engineering responsibilities that no current AI tool can own. These are not tasks that require writing code. They require judgment about constraints, tradeoffs, organizational context, and what is going to hold up when real users show up with real load.
Talha Turab, a Senior ML Engineer at Levitate Data, points to something instructive about the nature of engineering work that AI cannot replicate: the information flow inside a development team is non-linear. A client requirement passes through a PM, gets routed to the right engineer, escalates when it hits a constraint, loops back with feedback. That process is stateful across time and deeply contextual. LLMs operate within a fixed context window with no persistent state across that kind of multi-step organizational chain. They can assist with pieces of it. They cannot run it.
Talha also observes the reduction in team size that AI enables.
"Where a project might have required four engineers, one or two strong engineers with AI assistance can cover the same ground. That is real compression. But compression is not elimination. It is a change in the leverage ratio, and it makes the quality of each remaining engineer more consequential, not less."
The Hybrid Model
What does effective human-AI collaboration actually look like at the production level?
The approach that worked for Levitate on a recent frontend project offers a practical model. Lovable was used to generate high-level UI components quickly. Those components were then integrated into a properly structured application by engineers who handled data communication, state management, and architectural coherence. The AI handled the generation labor. Engineers handled the connective tissue, arguably the most important part of software development and where most production systems fail.
AI tools can produce and increasingly augment the act of writing software. What they cannot be trusted with is quality engineering at a macro scale. A quality engineer operating as an orchestrator brings something no model currently carries: domain and subject-matter expertise, knowledge of product and business roadmaps, and the accumulated context from conversations with users, investors, and stakeholders across months of a project's life. Re/writing large chunks of a codebase at each AI-assisted step without that context in view is a maintenance nightmare. Production incidents climb. Debugging becomes harder. And the consequences extend well beyond the specific service being built.
Engineer vs. AI roles by phase
| SW Engineering Phase | The Engineer's Role (The Architect & Decision Maker) | The AI Assistant's Role (The Accelerator) |
|---|---|---|
| Requirements & Design |
|
|
| Coding & Implementation |
|
|
| Testing & QA |
|
|
| Debugging & Maintenance |
|
|
| Security & Compliance |
|
|
The Cost Does Not Disappear. It Shifts
A pattern worth tracking as AI becomes embedded in engineering workflows is that the cost reduction is real, but it does not vanish. It relocates.
Salesforce CEO Marc Benioff said on the All-In podcast that Salesforce expects to spend roughly $300 million on Anthropic tokens in 2026, almost entirely on coding. The company paused engineering hiring after improving engineering productivity by more than 30% through AI systems. The headcount savings are real. So is the token bill that replaced them.
This shift matters for how engineering teams think about AI integration. The question is not just "how many engineers can we do without?" It is "what does AI-spend replace, and what does it unlock?". The teams that treat this as pure cost reduction will miss the compounding return: smaller, more focused engineering teams using AI to operate and compete at a scale previously reserved for organizations ten times their size.
A Practical Framework
Audit your workflow by layer
The places where AI tools work well (isolated generation, pattern completion, component scaffolding) and the places where they fail (integration, security review, architectural decisions) are predictable. Map your development workflow against these categories and identify where AI acceleration actually fits, versus where human judgment is required regardless of what the tool produces.
Budget the 20 percent explicitly
If you are using vibe coding tools to reach a demo or prototype, plan for the engineering work that follows before committing to a production timeline. The structural debt is consistent and substantial. Teams that treat the 80 percent as "basically done" routinely get burned by the 20 percent that remains.
Treat model selection as an infrastructure decision
The choice of which AI model or tool to apply to which task, including the ongoing judgment about when to trust its output, is not a default setting. It requires understanding the domain, the model's known failure modes, and the cost of a wrong answer in that context. Apply the same rigor you would to any other infrastructure decision.
Redefine what engineering means on your team
As coding labor gets absorbed by AI tools, the premium on engineering judgment moves upstream. Orchestration, verification, architectural thinking, and the ability to distinguish good output from plausible-looking output are the skills that compound. Teams that adapt their engineering roles to reflect this will be more effective than those still optimizing around raw code production.
What the Transition Actually Requires
No engineer who has worked with these tools at the production level believes AI is close to owning the full development lifecycle. What is happening is more interesting than replacement: a compression of the labor-intensive layer of development, and a corresponding expansion of what engineering judgment is responsible for.
AI amplifies skills. It does not replace them.
Put high-quality engineering judgment behind the tooling, and the leverage is real. Remove it, and what looks like speed becomes fragility on a delayed fuse.
Schedule a consultation
If you're currently figuring out your AI strategy or need help auditing your existing one, schedule a free consultation with us.