Working with Robots, Volume 1

The Honeymoon and the Hangover

If you've spent any time with Claude Code or a similar agentic coding tool, this scene will feel familiar. You open a fresh session, type something vague like "add a feature that lets users export their data," and let the agent run. Five minutes later there's a working-looking prototype on your screen. The first time it happens, it's genuinely thrilling.

Then you start digging in. There's a bug, then a second bug, then an authentication assumption the agent just made up. By the fifth issue you've found, the awe has turned into suspicion, and somewhere around the tenth you're wondering whether this whole technology is overhyped.

We've watched this honeymoon-to-hangover arc play out enough times now to be confident the problem isn't the agent. The reality is, the engineer asked the agent to do something they would never let a human teammate do unsupervised, and then judged the agent for doing exactly, and only, what they asked.

The numbers back this up. METR's 2025 randomized study (1) put 16 experienced open-source developers on tasks in their own mature repositories. Going in, they expected AI tools to speed them up by 24%. After the study, they believed AI had sped them up by 20%. The measured result was that they were 19% slower.

A Methodology Problem, Not a Tool Problem

In this series, Working with Robots, we share some of the knowledge and insights we've gained while working with AI agents in practice. We've built and refined our approach across multiple clients running large, established software platforms, along with our own internal tooling. Across all of them, the same conclusion keeps surfacing: teams getting frustrated with agentic coding are usually running into a methodology problem, not a model problem.

This first post is about one specific lever: where to make the agent stop. Two extremes don't work. Letting the agent run unsupervised on a vague prompt produces the disillusionment above. Hovering over every diff gives back the productivity gains people came for. The useful position is structural, well-placed checkpoints designed into the skill itself.

Checkpoints aren't anti-autonomy. They're what makes responsible autonomy possible.

The Less the Agent Knows, the More Decisive It Sounds

A contrarian claim worth stating directly, because it changes how you design with the agent rather than against it: agent confidence and agent understanding are inversely correlated. The agent that sounds most certain about your codebase is usually the one that has looked at it least.

This isn't a calibration bug to fix. It's a structural property of how today's instruction-tuned models were trained. Tian et al. (5) showed that RLHF-tuned models (ChatGPT, GPT-4, Claude) are "very poorly calibrated"; their stated confidence does not track whether they're right. Kalai et al. (3) traced the cause to the training itself, where standard procedures reward confident guessing over expressing uncertainty. Saying "I don't know" earns the model nothing, so it learns to bluff.

The downstream effect on people is what should worry you. In Perry et al. (4), participants given an AI coding assistant wrote less secure code than the unassisted control group, and were more likely to believe their code was secure. Wrong code, paired with higher confidence in it. That combination is what makes this failure mode so dangerous.

Why End-of-Run Review Doesn't Save You

The natural instinct, when this becomes clear, is to lean harder on review at the end. Let the agent run; catch the problems in code review. In our experience, this rarely works.

Most engineers have had this moment: the agent produces a wall of changes, and somewhere around file fifteen your eyes start to glaze over. You stop reviewing in any meaningful sense and start skimming with the goal of reaching the end. This is not a discipline problem. If you weren't in the loop while the work was being built, you don't carry the mental model required to review it efficiently. You'd have to reconstruct the agent's reasoning from scratch, and at that point, writing the code yourself would have been cheaper.

The cost asymmetry is the deeper issue. Catching a misread requirement at the planning stage costs a sentence of redirection. Catching the same misread at validation costs the artifact, the review cycle, and the sunk reasoning that has to be unwound. Boehm and Basili (2) put the cost difference at roughly 100x for a defect caught in maintenance versus design. The multiplier is smaller in iterative workflows, but the directional point holds.

Agent Checkpoints

The fix is structural. We design our skills to enforce specific stopping points we call Agent Checkpoints. A quick disambiguation, because the term already means several things in this space. We don't mean the Cursor or Claude Code feature that lets you rewind to a previous snapshot. We don't mean LangGraph's state-persistence layer. We mean something different.

An Agent Checkpoint is a deliberate, skill-mandated stop where the agent surfaces three things before being allowed to continue: its current artifact, the open questions it could not resolve on its own, and the assumptions it had to make to get this far.

The three-part disclosure is the work. Without it, a stop is just a pause; with it, the human gets the minimum context required to redirect cheaply. We deliberately put this construct inside the skill, not the framework or the IDE. That keeps it portable across harnesses, and it survives the next round of tool churn.

Five Common Checkpoints

In practice, most software engineering workflows decompose into the same five steps, and each transition between steps is a natural checkpoint. Each one corresponds to a specific failure mode that shows up when the skill skips the stop.

Gather context and create a research plan. Without the checkpoint, the agent anchors on the first plausible signal it finds and builds the rest of its understanding on top of a misread. It searches confidently in the wrong corner of the codebase.
Execute the research plan. Without the checkpoint, ambiguous findings get quietly resolved into facts. The open questions that would have changed the design never reach you.
Create an execution plan. Without the checkpoint, the agent commits to an approach that looks right at the file level but cuts across architectural boundaries you'd never have endorsed. Plans that are 95% reasonable smuggle in 5% that wrecks the design.
Execute the plan to generate the output artifact. Without the checkpoint, mid-implementation deviations don't get surfaced. The agent silently revises the plan to match what it actually built, and by the end, the plan you approved no longer matches the artifact.
Validate the output artifact. Without the checkpoint, the agent self-validates against its own (possibly flawed) understanding of the requirement. It reports success because, by its own standards, it succeeded.

This is far from an exhaustive list of where checkpoints can help, but in our experience it's a practical starting point. Between checkpoints, the agent runs efficiently through the grunt work, where it genuinely shines. At the boundaries, the human directs.

A skill without checkpoints produces confident output that masks numerous uncertainties and assumptions.

In Summary

The most important thing to internalize about today's coding agents is that they have no awareness of their own blind spots. Rather than surfacing uncertainty when they hit it, they do what they were trained to do, infer an answer, and treat the inference as fact. The checkpoint is the only place you reliably catch them before that inference compounds into the artifact.

We'll continue this series with posts on the other levers that have shaped how we work with agents in practice: managing context across large codebases, decomposing work into agent-sized pieces, encoding team standards into reusable skills, building traceability into agent runs, and the meta-layer of building tools that build tools. We hope it proves helpful.

If you've designed checkpoints into your own workflows, or run into failure modes we didn't name, we'd genuinely like to hear about them. Drop us a line in a comment section below.

Sources

Becker, J., Rush, N., Barnes, B., & Rein, D. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. arXiv:2507.09089. Retrieved from: https://arxiv.org/abs/2507.09089
Boehm, B. W., & Basili, V. R. (2001). "Software Defect Reduction Top 10 List." IEEE Computer, 34(1), 135–137. Retrieved from: https://www.cs.cmu.edu/afs/cs/academic/class/17654-f01/www/refs/BB.pdf
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why Language Models Hallucinate. OpenAI. arXiv:2509.04664. Retrieved from: https://arxiv.org/abs/2509.04664
Perry, N., Srivastava, M., Kumar, D., & Boneh, D. (2023). "Do Users Write More Insecure Code with AI Assistants?" Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23). Retrieved from: https://arxiv.org/abs/2211.03622
Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., & Manning, C. D. (2023). "Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Retrieved from: https://aclanthology.org/2023.emnlp-main.330/

Working with Robots, Volume 1 Designing Agents That Stop and Ask for Guidance

The Honeymoon and the Hangover

A Methodology Problem, Not a Tool Problem

The Less the Agent Knows, the More Decisive It Sounds

Why End-of-Run Review Doesn't Save You

Agent Checkpoints

Five Common Checkpoints

In Summary

Sources

Explore our writing

Pitching the Refactor

Sunk Cost

Building an Audience

Culture and Code: The Hidden Parallels

Company Culture: All Hype or Actually Worthwhile

Effective Context Switching

Pull Request Structure

Thoughts for our thoughts