Why 95% of Enterprise AI Delivers Nothing - and How the 5% Measure Differently

Enterprise AI fails because companies measure the wrong thing. They count pilots launched, demos shipped, and "AI users" activated. The 5% that win count something different: did the tool make the humans using it measurably faster at real work? Get that question right and everything else - what to buy, what to build, what to kill - follows from it.

The 95% problem is a measurement problem

MIT's Project NANDA published "The GenAI Divide: State of AI in Business 2025" in July 2025, drawing on 150 executive interviews, a survey of 350 employees, and an analysis of 300 public AI deployments. The headline finding: roughly 95% of enterprise generative-AI pilots deliver no measurable P&L impact. After an estimated $30-40 billion in enterprise spend, nineteen out of twenty programs cannot point to a dollar.

That is worth sitting with. Not because AI is overhyped across the board - the 5% that work are genuinely compounding - but because the 95% figure is not random noise. MIT NANDA identified the core barrier as learning, not infrastructure, regulation, or talent. "Most GenAI systems do not retain feedback, adapt to context, or improve over time." In other words, the tools are stateless and the organizations adopting them are measuring inputs (seats, integrations, experiments launched) rather than outputs (time recaptured, mistakes caught, decisions made faster). You get what you measure.

The organizations inside that 5% share a habit: they instrument the work itself. They ask whether the person who used the tool finished in less time, made fewer re-work loops, or caught a blocker before it cost a sprint. Vanity metrics do not get them to that question. Productivity metrics do.

Vanity metric Usefulness metric
Pilots launched Pilots with measured P&L delta
AI seats activated Tasks completed faster with AI vs. without
Features shipped with AI assist Re-work loops eliminated
Demos run for leadership Time-to-onboard for new hire or new agent
"Users" (opened the tool) Users who would slow down if you removed it

The column on the right is harder to fill in. That is the point.

Feeling fast is not the same as being fast

In July 2025, METR - an AI safety and evaluation organization - published a randomized controlled trial of AI coding tools on experienced open-source developers (arxiv: 2507.09089). Sixteen developers with an average of five years of experience on their own repositories completed 246 tasks, randomly assigned to allow or disallow AI use. When AI tools were available (primarily Cursor Pro with Claude 3.5/3.7 Sonnet), the developers took 19% longer than without them.

Here is the part that matters for enterprise AI strategy: before starting, those same developers predicted AI would make them 24% faster. And after completing the tasks - after experiencing the slowdown firsthand - they still estimated the AI had sped them up by around 20%.

The perception was wrong in the optimistic direction even after direct experience. This is not a story about bad developers or bad tools. It is a story about how hard it is to measure your own productivity while doing the work. The developers were doing more things - more context switching, more AI-generated code to review, more friction in the verification loop - and interpreting the activity as speed. The wall clock told a different story.

METR is careful about scope: the study covered a specific cohort of tasks and tools at a specific point in early 2025, and the authors explicitly note the findings may not generalize to all settings. But the finding's implication for anyone buying or building enterprise AI is hard to dismiss: subjective confidence is not a measurement. It is a starting point for getting a measurement.

How serious teams are starting to define usefulness

The teams building AI-native products understand this acutely, because their livelihood depends on the tool actually working in the real world.

Cognition - the company behind Devin, the AI software engineer - published a 2025 annual performance review based on real customer data. The framing is instructive: not SWE-bench scores or lines of code generated, but concrete time deltas per task. One large organization saved 5-10% of total developer time by running Devin on security fixes. Another saw a 20x efficiency gain on vulnerability work - 30 minutes per fix for human developers, 1.5 minutes for Devin. A Java migration ran 14x faster. Cognition also tracks the merge rate of Devin's pull requests as a primary signal: 67% of Devin's PRs were merged in late 2025, up from 34% a year earlier. Merged PRs are usefulness. Opened PRs that get closed are activity.

None of those metrics are easy to manipulate with good demos. You either shaved the time or you did not. The merge rate either went up or it did not.

That is the pattern across teams doing this right. They translate "AI is helping" into a measurable proxy that cannot be gamed by enthusiasm - time on task, re-work rate, cycle time per workstream, decisions that did not get re-litigated. The measurement varies by use case. The discipline of measuring does not.

Why context is where enterprise AI loses the most time

MIT NANDA's core finding about learning is more specific than it sounds. The systems that fail are the ones that start from zero every session. A developer asks the AI to help with a pull request; the AI has no idea what the team decided in March, why the current caching layer exists, or which approach was tried and abandoned last quarter. So it produces competent-looking code that violates a constraint nobody thought to paste into the prompt. The developer reviews it, catches what they catch, and ships something that is subtly wrong in a way a test will not find.

That is not a model problem. That is a context problem. The model is doing what it can with what it has. What it has is not the team's knowledge - it is a snapshot of one conversation.

The re-work from this failure mode is hard to see in aggregate because it is distributed across hundreds of small moments: the decision that got quietly reversed, the pattern re-introduced after the team spent two weeks pulling it out, the blocker a new hire hit on day three that the team solved eighteen months ago and never wrote down anywhere callable. Each one is a small tax. Across a team of twenty running agents alongside humans, the tax is constant and invisible - not a pilot failure, just a team that is inexplicably slower than expected.

The solution is not a better wiki or a smarter search bar. Search returns documents a human reads. An AI-native company needs context that is structured, callable, and current - the decision that is actually in force today, not the one that was written down eighteen months ago and has quietly drifted.

The measurement Ody bets on

Ody is the team operating system for AI-native companies. It compiles the decisions, context, and runbooks scattered across a team's tools - Slack, Linear, GitHub, docs, and coding agents like Claude Code and Cursor - into one living, callable team knowledge graph. People and agents work from the same source of truth.

Ody's own bet on usefulness is exactly the one this article is arguing for: it earns its place only if it makes the team measurably faster at real work. That means surfacing the decision a developer would otherwise re-derive from three Slack threads. It means catching the blocker before it costs two days. It means ramping a new hire - or a new coding agent - in days instead of weeks, because the context they need is callable rather than locked in someone's head.

The ceiling of Ody's autonomy is a nudge. It senses continuously and automatically, but it acts only when a human says so. No silent overwrites, no surprise commits. That restraint is not accidental. A tool that acts on your behalf without your sign-off is exactly the kind of tool that produces confidently wrong output - which is, again, a measurement and trust problem before it is anything else.

Ody is callable over MCP (so Claude Code and Cursor read the team graph directly), from the CLI, from Slack, and from the web. The data stays in the EU, your LLM is your own, and nothing becomes training data. Security details are at /security.

What Ody does not do is make claims about P&L impact that it cannot yet back with data from your team. That is the honest answer. If you run it and the people using it would slow down without it - measured by the same rigor METR brought to their study, not by asking people how they feel - then it is earning its place. If not, that feedback is worth more than a successful demo.

What to do if you are inside the 95%

Most enterprise AI programs are not failing because they picked the wrong vendor. They are failing because they are measuring whether AI exists in the organization rather than whether it is making the organization faster.

A practical way to check: pick one workflow where AI is "in use." Now ask - if you removed the AI tomorrow, would the team slow down in a way that shows up in a metric? If the honest answer is no, or "we're not sure," the tool is present but not useful. That is the decisions-getting-lost version of the same problem: the signal exists but nobody is reading it.

Getting into the 5% is not about buying better tools or running more pilots. It is about holding any tool - AI or otherwise - to the same standard: does removing it make measurable work slower? If the answer is yes, keep it. If no, cut it. If you do not know, get a measurement before adding more tools on top of the ambiguity.

The teams that win on enterprise AI are the ones that get bored with demos quickly and go looking for the wall clock.

If you are building an AI-native team and want to see what a week looks like when everyone - humans and agents - works from the same context, join the waitlist at useody.com.

Common questions

Why do most enterprise AI pilots fail to deliver business impact?

MIT's Project NANDA found the core barrier is learning, not infrastructure or talent: most GenAI systems do not retain feedback, adapt to context, or improve over time. Organizations also tend to measure inputs (seats activated, pilots launched, demos run) rather than outputs like time saved or re-work eliminated. Tools that do not compound on what the team already knows produce impressive demos and invisible P&L impact.

What did the METR 2025 study actually find about AI coding tools?

METR ran a randomized controlled trial with 16 experienced open-source developers completing 246 tasks with and without AI tools. Developers with AI access took 19% longer on average, not faster. Before starting, those same developers predicted AI would make them 24% faster, and even after experiencing the slowdown they still estimated AI had sped them up. The study covers a specific cohort and the authors caution against over-generalizing, but the lesson is that confidence in a tool is not a measurement of its effect.

How should a team measure whether an AI tool is actually useful?

The removal test is the simplest version: if you switched the tool off tomorrow, would a metric move? If the honest answer is no or we don't know, the tool is present but not useful. More precisely, measure time-on-task with and without the tool for the same category of work, track re-work loops, and compare cycle times before and after adoption. Subjective surveys are a starting point, not an answer.