Skip to content
agents-when-claude-works-autonomously.md

Agents: When Claude Works Autonomously

#claude-code#ai#agents#automation#developer-tools
Agents: When Claude Works Autonomously

The last post covered MCP servers. Giving Claude direct access to your infrastructure. Your NAS, your databases, your running services. Skills tell Claude how to do things. MCP servers give Claude access to the systems where things happen.

But in both cases, you're still driving. You ask a question, Claude answers. You give an instruction, Claude executes. Every step goes through you.

Agents change that. You define a task and a set of boundaries. Claude figures out the steps, delegates work, runs things in parallel, and comes back with results. The shift isn't from manual to automatic. It's from directing every action to defining the scope and letting execution happen within it.

This is the final post in the series. It's also where the other pieces come together. Skills define methodology. MCP servers provide access. Agents use both to work independently.

What an Agent Actually Is

In Claude Code, an agent is a scoped instance of Claude that handles one part of a larger task. You'll see them called subagents. The idea is straightforward: instead of one Claude doing everything in sequence, you spin up focused instances that each handle a specific job.

Each subagent gets its own context window, its own tool access, its own area of focus. This matters more than it sounds. Context windows are finite. A subagent that only thinks about schema validation doesn't waste tokens on performance metrics or content analysis. It does one thing, and it does it with full attention.

An agent definition is a markdown file with frontmatter. A name, a description, the tools it's allowed to use. Below that, its instructions. Same format as a skill, but the intent is different. A skill tells Claude how to do something. An agent tells Claude to go do it.

---
name: seo-technical
description: Technical SEO specialist. Analyzes crawlability,
  indexability, security, URL structure, mobile optimization,
  Core Web Vitals, and JavaScript rendering.
tools: [Read, Bash, Write, Glob, Grep]
---

The description serves the same purpose as a skill trigger. It tells the orchestrating agent what this subagent is good at, so it knows when to delegate.

Subagents in Practice

The clearest example I have is the SEO audit from the plugins post. I mentioned it there but didn't go deep. Here's what actually happens.

When I run /seo audit on a URL, the orchestrator skill spawns 6 subagents in parallel:

  • Technical analyses crawlability, indexability, security headers, URL structure, mobile optimisation, Core Web Vitals
  • Content evaluates E-E-A-T signals, readability, content depth, thin content detection
  • Schema detects and validates structured data, generates missing markup
  • Sitemap validates XML sitemaps, checks URL coverage, identifies gaps
  • Performance measures Core Web Vitals, analyses page load waterfall
  • Visual takes screenshots at desktop and mobile breakpoints, checks above-the-fold content

Each runs independently. They don't share context. They don't wait for each other. The orchestrator waits for all 6 to complete, then synthesises the results into a scored report with a prioritised action plan.

What would take an hour of sequential analysis finishes in minutes. Not because any individual check is faster, but because they all run at the same time. The parallelism is the point.

The other benefit is less obvious. Each subagent is a specialist. The schema agent knows schema types, validation rules, and Google's current requirements. It doesn't need to know about robots.txt parsing or content readability scores. Narrower focus means better results on each individual check.

Agent Libraries: Context Without Repetition

Subagents solve the parallelism problem. Agent libraries solve the knowledge problem.

I have a homelab. TrueNAS server, n8n for workflow automation, Docker containers, Nginx Proxy Manager for routing. Multiple projects deploy to it. This site, a price comparison tool, a YouTube automation pipeline, a file converter app. Every project needs the same infrastructure knowledge. IP addresses, deployment procedures, Docker conventions, n8n API patterns.

The obvious approach is to put all of that in each project's CLAUDE.md. It works, but it duplicates everything. Update the n8n API endpoint? Change it in 5 files. Add a new deployment convention? Same story. And every project loads context it doesn't need for the current task.

So I built a homelab agent library. It's a layered context system with four levels:

Global loads every time. The infrastructure map. IP addresses, service endpoints, network layout, conventions that apply everywhere.

Technology loads when working with specific tools. There's an n8n layer with API patterns, workflow design rules, and known gotchas. A Docker layer with container management patterns. Each one only loads when relevant.

Purpose loads for specific activities. The deployment layer knows how to get a service from a local Docker Compose file to a running container on TrueNAS. It doesn't load when you're just editing code.

Project loads for specific codebases. The iammattl layer knows this site runs on Cloudflare Workers. The techpartprices layer knows its deployment target is different. Project-specific context without polluting the global scope.

layers:
  - path: layers/global
    scope: always
  - path: layers/n8n
    scope: technology
  - path: layers/docker
    scope: technology
  - path: layers/deployment
    scope: purpose
  - path: layers/projects/iammattl
    scope: project

Alongside the layers, there are skills and rules. Skills are reusable procedures. deploy-container knows the exact steps: validate the compose file, transfer to TrueNAS, build and start, verify the health check, optionally set up the reverse proxy. Rules are hard constraints. deployment-safety.md defines what's not allowed regardless of which agent runs. docker-wsl.md captures a specific gotcha about Docker credential helpers in WSL2.

The compound effect is that new projects get deployment knowledge without duplicating anything. I add a project layer with the specifics, and the existing infrastructure knowledge is already there.

Multi-Agent Orchestration

The first two examples are practical and approachable. This one is the far end of the spectrum. Most people won't need it. But it shows where the model goes when you push it.

I forked and extended a multi-agent orchestration system called autonomous-coder. It coordinates 7 specialised agents: frontend, backend, design, QA, DevOps, documentation, and research. Given a set of tasks with dependencies, it figures out what can run in parallel and executes them simultaneously.

The process works like this:

  1. Plan. Analyse task dependencies. Build a dependency graph. Group tasks into levels where everything in a level can run concurrently.
  2. Spawn. Each task gets assigned to a specialised agent based on its type. Agents launch as separate OS processes. True parallelism, not async.
  3. Coordinate. A state manager handles inter-agent communication through file-based IPC with proper locking. Each agent sends heartbeats every 10 seconds. If a heartbeat stops for 60 seconds, the coordinator flags it as crashed.
  4. Verify. Design tasks hit a quality gate. The system checks for screenshots at desktop and mobile breakpoints. No screenshots, no pass. It auto-creates blocker tasks with Playwright instructions if the verification is missing.
  5. Recover. Checkpoints save progress. If an agent crashes mid-task, work resumes from the last checkpoint instead of restarting from scratch.

The result is 2-3x speedup on multi-component tasks. 13 Python modules, 2,757 lines of coordination code. The agents themselves are the simple part. The coordination is where the complexity lives.

I'm being specific about the numbers because they tell the real story. The orchestrator, the state manager, the heartbeat monitor, the recovery system. That's more code than the agents do actual work. If you're thinking about building something like this, know that the hard problem isn't giving Claude tasks. It's managing what happens when multiple Claude instances work on the same codebase at the same time.

The Trust Question

This is the part people actually want to talk about. How much can you trust an agent to work unsupervised?

The honest answer: it depends entirely on the task.

Where agents work well:

  • Well-scoped tasks with clear success criteria. "Analyse this page for SEO issues and score it" has a defined output.
  • Repetitive work across a known pattern. Deploying containers, running audits, generating boilerplate.
  • Parallel analysis where each piece is independent. The SEO audit works because the 6 subagents don't need to coordinate.
  • Anything where the methodology is fully defined and the cost of being wrong is low.

Where they don't:

  • Ambiguous requirements. If you can't define the success criteria, an agent can't either.
  • Novel architecture decisions. Agents are good at following established patterns, not inventing new ones.
  • High-stakes operations with slow feedback loops. An agent that deploys to production needs more guardrails than one that reads logs.
  • Tasks that require cross-agent coordination on shared state. This is technically solvable (autonomous-coder does it) but the overhead is significant.

The practical rule I use: if I'd hand the task to a competent developer with clear written instructions, an agent can probably handle it. If the task needs judgement that comes from experience and context I can't easily write down, I stay in the loop.

Guardrails matter more than capability. The TrueNAS MCP server from the last post blocks privileged containers and dangerous mounts by default. That's a guardrail baked into the infrastructure, not into a prompt. When an agent has deployment access, the constraints need to live in the system, not in the instructions. Instructions get ignored under edge cases. System-level constraints don't.

Trust builds incrementally. Start with read-only agents. Things that analyse, report, and suggest but don't modify anything. Once you're confident in the analysis, graduate to write access. Then to autonomous execution. Same way you'd onboard a new team member. You don't hand someone production access on day one.

What I Got Wrong

Too many concurrent agents hit limits faster than expected. Six subagents running simultaneously means six context windows, six sets of tool calls, six streams of output. The resource consumption scales linearly but the coordination overhead scales worse than that. I've learned to be deliberate about how many agents run at once rather than parallelising everything because I can.

Overly broad agent definitions produce mediocre work. Same lesson as skills. An agent defined as "handle all frontend tasks" makes worse decisions than one defined as "analyse CSS specificity issues and propose fixes." Narrower scope, better results.

Autonomous doesn't mean unchecked. The visual verification gate in autonomous-coder exists because I shipped broken UI without it. The agent finished the task, reported success, and the layout was wrong. Now design tasks don't pass without screenshots proving the output looks right. Every quality gate I've added was in response to something going wrong.

Coordination is harder than execution. The state management, heartbeats, and recovery system in autonomous-coder account for more code than the agents themselves. If your agents need to share state or depend on each other's output, expect the coordination layer to be the bulk of the work.

Where to Start

If you've followed this series and have skills and MCP servers set up, adding agents is the natural next step.

Start with a subagent in an existing skill. Take something that runs sequentially and parallelise one piece. If your deployment skill checks three things in sequence and they're independent, make them three subagents.

Start read-only. An agent that analyses but doesn't modify is low risk and immediately useful. Let it prove itself before you give it write access.

Define boundaries before capabilities. What the agent can't do matters more than what it can. Blocked operations, restricted file paths, required verification steps. Set these first.

Build a context library when you see duplication. If you're copying the same infrastructure context into multiple CLAUDE.md files, extract it into a shared library. The layered loading means agents only get the context they need.

The Series Arc

Five posts. One progression.

Getting started. Claude as a conversation partner. Give it good inputs, get better outputs.

Building projects. Claude as a daily tool. CLAUDE.md as institutional memory. The workflow that makes it reliable.

Skills and plugins. Claude remembers how to do things. Package expertise so it runs the same way every time.

MCP servers. Claude connects to real infrastructure. Your databases, your servers, your running services. Access without tab switching.

Agents. Claude works autonomously within your boundaries. Delegation, parallelism, and knowing when to step back.

Each layer compounds on the last. Skills are more useful with MCP access. Agents are more useful with skills and MCP servers combined. The whole stack works because each piece does one thing and they compose naturally.

The goal was never full autonomy. It's the right amount of autonomy for the task at hand. Sometimes that's a chat message. Sometimes it's 6 agents running in parallel across your infrastructure. The skill isn't in building the most autonomous system possible. It's in knowing which level of autonomy the current task actually needs.

Agents: When Claude Works Autonomously | Matt Lambert