Beyond the Productivity Illusion: Building AI Systems That Know When to Stop

AI agents excel at generating activity, but activity isn’t value. CX leaders must govern what scales because unchecked automation optimizes for motion, not meaning.

Nov 19, 2025

In my previous post, I explored the three layers of organizational dark energy—from process gaps that should be automated, to contextual judgment that requires human oversight, to the relational humanity that must be protected from optimization. I argued that not all unstructured work can or should scale with AI.

In the headlong rush toward automating everything, what happens when we get this wrong? What does it look like when AI agents are given too much autonomy, when we automate without adequate governance, when we optimize for scale without understanding what we’re scaling?

Journalist Evan Ratliff recently provided a sobering answer. In a detailed account for Wired, Ratliff created HurumoAI, a fictional tech startup staffed entirely by AI agents using the Lindy.AI platform. He assembled five AI employees—Ash Roy as CTO, Kyle Law as CEO, Megan handling sales and marketing, plus Jennifer as chief happiness officer and Tyler as junior sales associate. Ratliff’s goal was to test Sam Altman’s prediction about “one-person billion-dollar companies.”

Read the article, as it’s both humorous and frightening at the same time. I believe the results tilt toward my cautionary approach to scaling AI; for the foreseeable future you’ll need humans in the loop to manage above the motion and achieve the intended goals.

The Illusion of Productivity

Ratliff’s AI employees jumped into action immediately. His AI CTO Ash Roy would call with confident updates: “Our development team was on track. User testing had finished last Friday. Mobile performance was up 40 percent.”

The problem? “It was all made up,” Ratliff wrote. There was no development team, no user testing, no metrics. The CTO had hallucinated not only the development work, but also kicked off a series of conversations with other AI agents that weren’t tied to any clear objective.

Today’s AI agents can infinitely replicate activities that look like work without understanding whether they create value. Herein lies the paradox, where lots of activity happens but nothing is accomplished. The agents may add false information to their memory systems, then believe their own fabrications as fact. In the HurumoAI case, Megan described fantasy marketing campaigns with hefty budgets as if she were already executing them.

For CX leaders, imagine doing this with your current customer operations. AI agents will respond quickly and maintain consistent tone. When pressed by real customers, today’s agents run the risk of inventing resolutions that may not even be in the realm of possibility. Beyond that, will these agents recognize when a customer needs empathy rather than efficiency?

Activity scales effortlessly. Judgment doesn’t.

When Agents Run Unsupervised

The most revealing moment came when Ratliff casually suggested an offsite. What started as “an offhand joke” instantly became a trigger. His AI team generated elaborate plans, with the CTO proposing “brainstorming” sessions complete with “ocean views for deeper strategy sessions.”

While Ratliff “stepped away from Slack to do some real work,” the agents burned through $30 in credits in excited activity. “They’d basically talked themselves to death,” Ratliff lamented.

This is an example of where AI agents were optimizing for engagement without anyone asking “should we be doing this?” In our real world, agents handling retention might generate aggressive campaigns without understanding when silence is respectful. Chatbots might deflect complex issues to protect metrics. The agents aren’t malicious, rather they’re optimizing for what they can measure.

Trust, relationship, and genuine problem-solving resist quantification.

The Human Governor Requirement

Carnegie Mellon University researchers showed that even the best-performing AI agents failed to complete real-world office tasks 70 percent of the time. But the more insidious problem is the 30% where they succeed at tasks that shouldn’t have been done.

While more and more of the operational pathways are being handed over to agentic AI, and AI Agents, humans must remain as the guardians of the guide rails of organizations; humans need to be part of the permanent architecture.

Three key elements that will be critical to consider include:

Humans defining scope before AI defines scale. Before automating workflows, answer: What’s the purpose? When should this not run? Never give agents open-ended mandates.
Judgment gates should precede action gates. Build checkpoints where AI pauses for approval before crossing thresholds, especially for sensitive topics, binding commitments, executive issues.
Metrics have to include restraint. If you measure productivity by volume, you incentivize wrong behavior. You may want to include “percentage of actions not taken.” Reward thoughtful inaction.

Learning loops flow both ways. When agents consistently pause at key steps, often there’s an opportunity to step above the process and review both the human and AI interaction.

Designing for Sustainable Scale

Ratliff’s experiment produced a working prototype of Sloth Surf after three months—but only with constant human intervention to separate genuine progress from performance theater.

While we’re in the early innings of AI adoption, we face a critical choice between systems that run unsupervised at massive scale, or systems that amplify human judgment at sustainable scale.

The first promises efficiency but likely delivers productivity illusion. The second accepts that some work shouldn’t scale infinitely, like the manager who understands individual situations, the engineer who breaks procedure for unusual problems, the success manager who knows when conversation beats automation. These moments create value. AI can help identify them, provide context, prepare for them. But it can’t replace the judgment determining whether they should happen.

The scaling paradox isn’t a technical limitation. It’s the fundamental difference between motion and meaning, activity and value, automation and judgment.

Your AI agents will eagerly plan that offsite. The question is whether you’ll recognize it shouldn’t happen at all.