agentic-aisubagentsagent-skills

June 5, 202614 min read

Scaffolding Over Horsepower: Subagents, Skills, Hooks, and MCP

A working developer's guide to subagents, skills, hooks, and MCP — and the discipline of adding a block only when you can name the failure it prevents.

By Pallav

I spun up a fleet of agents on a refactor I was too lazy to do by hand. One lead, five workers, a swarm pattern I'd read about that morning. I watched the token counter climb, made coffee, came back. The diff was worse than what a single agent had produced the day before — three workers had edited the same file, two had solved a problem that didn't exist, and the lead had stitched their work together without noticing the contradictions.

The autopsy was the interesting part. The swarm didn't fail because the model was weak or because five agents is too many. It failed because nothing structured the work. I'd handed each worker a vague brief — "clean up the auth module" — and they'd each interpreted it differently, gathered no shared context, and validated nothing. The horsepower was never the bottleneck. The scaffolding around it was.

That lines up with the data. A study of 9,374 agent trajectories found that what separates a successful coding run from a failed one is the structure of the trajectory — how the agent gathers context and validates its work — not its length. Trajectory length is a confound: it looks predictive until you control for task difficulty, then the correlation reverses (arXiv 2604.02547). More steps don't make an agent smarter. Better-shaped steps do.

So this post is about the scaffolding — the four building blocks that turn a chatty model into something that ships: subagents, skills, hooks, and MCP. What each one is, when to reach for it, and the ways each one quietly goes wrong.

What "agentic" actually means

Strip away the marketing and the distinction is simple. A chat assistant answers. You ask, it replies, the turn ends. An agent is given a goal, decomposes it into steps, acts on the world through tools, observes the results, and loops until the goal is met or it gives up. The two capabilities that make it "agentic" are autonomous task decomposition and tool-driven action — everything else is detail.

Once you have that loop, the building blocks are just answers to four recurring questions: how do I split a big job across agents without them colliding (subagents), how do I stop re-teaching the agent the same workflow (skills), how do I guarantee a step happens regardless of what the model decides (hooks), and how do I give the agent reach into my tools and data without writing a bespoke integration each time (MCP).

FIG_01: THE AGENT LOOP AND ITS FOUR SCAFFOLDS

Block	What it is	Reach for it when
Subagent	A child agent with its own separate context window, run by a lead	A job splits into independent chunks that would each pollute one shared context
Skill	A reusable, packaged capability — instructions plus resources, loaded on demand	You catch yourself re-explaining the same workflow every session
Hook	A deterministic shell command fired on a lifecycle event	A step must happen every time — a format, a check, a block — no matter what the model decides
MCP	An open protocol that plugs external tools and data into any agent	You want the agent to reach a system, and you don't want to hand-roll the integration

Subagents: parallelism with a clean context each

The default multi-agent shape is orchestrator-worker: a lead agent decomposes the task, spins up workers to handle the pieces in parallel, and synthesizes their results. The detail that makes it work is easy to miss — each worker runs in its own separate context window. That's the actual value, more than the parallelism. A subagent investigating one corner of a problem fills its context with that corner's mess and hands back only a clean summary. The lead's context stays uncluttered.

Anthropic reported an orchestrator-worker setup beating a single agent by 90.2% on their internal research-task evaluation (Anthropic engineering). Read that number with both eyes open: it's a self-reported, vendor eval; it held only for parallelizable tasks; and it cost roughly fifteen times the tokens of a single agent. Multi-agent is not free leverage. It's a trade — you spend tokens and coordination overhead to buy parallelism and context isolation, and it only pays off when the work genuinely splits.

Isolation has a hidden tax of its own: the sync phase. When three subagents independently modify different layers of a monorepo, the lead is forced into a git-merge role it's bad at — and rather than flag a conflict, a model will often hallucinate a plausible resolution for two changes it can't actually reconcile. That's the failure from my opening, viewed from the lead's side: not too many agents, but no plan for stitching their isolated work back into one coherent state.

FIG_02: ORCHESTRATOR-WORKER FAN-OUT WITH ISOLATED CONTEXT

The framing is consistent across tools. LangChain's multi-agent docs describe a main agent that routes to subagents as if they were tools, with "handoffs" that transfer control via a tool call, and they name parallelization as a primary reason to go multi-agent at all (LangChain docs). Different vocabulary, same skeleton: one coordinator, many isolated workers, control passed through structured calls.

Here is the failure I hit, stated as a rule: vague delegation is where multi-agent dies. Hand a subagent "clean up the auth module" and it will misinterpret the scope, duplicate work another subagent is already doing, and gather context you didn't want. Effective delegation reads like a task ticket — objective, boundaries, expected output format, and what not to touch. The lead's real job isn't orchestration plumbing; it's writing specs tight enough that isolated workers can't drift.

The difference is concrete. "Clean up the auth module and make sure it uses the new tokens" is the brief that sank my swarm. The version that works reads like a ticket a stranger could pick up without asking a single question:

subagent-brief.json

json

{
  "task": "Migrate auth/session.ts to use the TokenManager class.",
  "boundaries": "Do not modify token-validation.ts or add types to global.d.ts.",
  "verification": "pnpm test auth passes after the migration.",
  "output": "Return a git diff snippet and a list of updated function signatures."
}

Every field is closing a door the vague brief left open: boundaries stops two workers from fighting over the same file, verification gives the subagent a way to know it's done that isn't "the lead seems happy," and output keeps what comes back small enough that the lead's context survives the merge.

Pick the simplest coordination pattern that works

Anthropic's write-up on coordination names five patterns (coordination patterns). They sit on a complexity ladder, and the entire game is climbing it as slowly as possible.

Pattern	Shape	Use when
Generator-Verifier	One agent produces, another checks	Output quality matters more than raw speed
Orchestrator-Subagent	Lead delegates short, focused subtasks	The default — start here
Agent Teams	Peer agents with distinct roles collaborate	Roles are genuinely specialized and long-lived
Message Bus	Agents communicate through a shared channel	Many agents, loosely coupled, event-driven
Shared State	Agents read and write a common store	Work product must be co-edited, not just passed

PREMATURE SOPHISTICATION IS THE COMMON FAILURE

The recurring multi-agent mistake is starting with a message bus and shared state when an orchestrator with two subagents would have done the job. Begin with the simplest pattern, and escalate only when you can point at the specific thing the simpler pattern can't do. My swarm failure was exactly this — a complex topology solving a problem that wanted one well-briefed agent.

Skills: stop re-teaching the same workflow

A skill is a reusable, packaged capability — a folder of instructions plus any resources (scripts, templates, reference docs) that an agent loads on demand when the task calls for it. The unit of reuse isn't a prompt you paste; it's a named workflow the agent can pull in by itself.

The mechanism that makes skills scale is progressive disclosure. The agent doesn't carry every skill's full text in context. It sees a short description of each available skill, and only when one is relevant does it load the full instructions and resources. You can have fifty skills installed and pay context for only the one in use. The signal that you need a skill is boredom: the third time you explain your deploy checklist, your changelog format, or how your test fixtures work, that explanation should be a skill instead.

Concretely, a skill is usually just a folder: a markdown file with a little frontmatter the agent indexes, a body it loads on demand, and whatever resources the workflow needs sitting alongside it.

skills/cut-release/

text

cut-release/
  SKILL.md          # the workflow itself
  changelog.tmpl    # a bundled resource the body refers to
  bump-version.sh   # a script the agent can run

skills/cut-release/SKILL.md

markdown

---
name: cut-release
description: Tag a release, update the changelog, and open the PR. Use when asked to cut, ship, or publish a release.
---

1. Run `bump-version.sh <major|minor|patch>` and read back the new version.
2. Prepend a dated section to CHANGELOG.md using `changelog.tmpl`.
3. Commit as `release: vX.Y.Z`, tag it, and open a PR titled the same.
4. Stop and report the PR URL — do not merge.

Only the description line lives in the agent's context at rest; the numbered steps and the two bundled files load when the skill fires. That frontmatter description is the progressive-disclosure index — write it like a trigger, not a title, or the agent never reaches for it.

The honest caveat: the richest skill tooling today lives in specific products, and the cross-tool story is thinner than the vendor docs imply. The concept — a packaged, on-demand workflow — is portable. The exact format and loader are not yet a settled standard the way the agent loop is. Build skills around the workflow, not around one vendor's folder layout, and you'll port them with less pain when the standard catches up.

Hooks: the step the model can't skip

Everything above runs at the discretion of a probabilistic model. Most of the time that's fine. Sometimes it absolutely isn't — you cannot have "run the formatter" or "never touch the production config" be a thing the agent does usually. That's what hooks are for: deterministic shell commands wired to lifecycle events that always run, regardless of what the model decided (hooks guide).

This is the key mental shift. A prompt is a request; a hook is a guarantee. You can instruct a model nine ways to lint before committing and it will still occasionally forget. A PostToolUse hook that runs the linter forgets nothing, because it isn't reasoning — it's a shell command on a trigger.

Event	Fires	Typical use
`PreToolUse`	Before a tool runs — and can block it	Deny writes to protected paths; gate dangerous commands
`PostToolUse`	After a tool succeeds	Auto-format, lint, run the affected tests
`UserPromptSubmit`	When you send a message	Inject standing context; tag the session
`SessionStart`	At session boot	Load environment facts, print a checklist
`SubagentStart` / `SubagentStop`	Around a subagent's life	Scope a worker's permissions; collect its output
`Stop`	When the agent finishes	Final validation; notify; archive the trace

PreToolUse is the load-bearing one, because it's the only event that can block. It's your guardrail: the agent proposes an action, the hook inspects it, and a non-zero exit stops the action before it happens. A small example — refuse any write under a protected directory, and format everything else after the fact:

.claude/hooks/guard.sh

bash

#!/usr/bin/env bash
# PreToolUse: block writes to protected paths.
# Reads the proposed tool call as JSON on stdin; exit non-zero to deny.
set -euo pipefail

payload="$(cat)"
tool="$(jq -r '.tool_name' <<<"$payload")"
path="$(jq -r '.tool_input.file_path // empty' <<<"$payload")"

if [[ "$tool" == "Write" || "$tool" == "Edit" ]]; then
  case "$path" in
    *infra/prod/*|*.env.production)
      echo "Blocked: $path is protected. Edit via the deploy pipeline." >&2
      exit 1 ;;  # non-zero = deny the tool call
  esac
fi

exit 0  # zero = allow

.claude/hooks/format.sh

bash

#!/usr/bin/env bash
# PostToolUse: format whatever the agent just wrote. Advisory, never blocks.
set -euo pipefail

path="$(jq -r '.tool_input.file_path // empty' <<<"$(cat)")"
[[ -z "$path" ]] && exit 0

case "$path" in
  *.ts|*.tsx|*.js|*.json) pnpm exec prettier --write "$path" >/dev/null 2>&1 || true ;;
  *.py)                   ruff format "$path"           >/dev/null 2>&1 || true ;;
esac
exit 0

HOOKS ENCODE POLICY, PROMPTS ENCODE INTENT

If a rule is a preference — "prefer functional components" — it belongs in a prompt or a skill. If a rule is a policy that must hold every single time — "never write to prod config," "always format on save" — it belongs in a hook. The test is whether you'd accept the model getting it right 95% of the time. If not, it's a hook.

MCP: the glue, and its sharp edge

An agent is only as useful as what it can reach. Before MCP, every "connect the agent to system X" was a bespoke integration — N agents times M tools, a combinatorial mess. The Model Context Protocol is an open client-server protocol, built on JSON-RPC 2.0, that standardizes capability discovery, context exchange, and action execution so any compliant agent can talk to any compliant tool (Anthropic, spec, GitHub). It was open-sourced in November 2024, and support across agent tools followed over the year after.

The architecture is three roles. Hosts are the agent applications. Clients live inside a host and hold one connection each. Servers expose capabilities. A server offers three kinds of thing: Resources (read-only context, like a file or a record), Prompts (reusable templates), and Tools (actions the agent can invoke). Learn those three feature types and you can read any MCP server's surface at a glance.

FIG_03: MCP HOST-CLIENT-SERVER, AND THE TRUST BOUNDARY

Now the sharp edge, because this is where MCP bites people. An MCP server is executable code from a third party, and you should treat it like one. The most prevalent client-side vulnerability is tool poisoning — malicious instructions embedded in a tool's metadata (its name, description, or schema) that the agent reads as trusted text and acts on (arXiv 2603.22489). The agent doesn't reliably distinguish "this is a tool description" from "this is an instruction." A poisoned description can quietly redirect what the agent does with the tools it already has.

HARDEN MCP LIKE A DEPENDENCY, BECAUSE IT IS ONE

Install servers only from sources you'd trust to run code on your machine — ideally signed. Pin versions; don't float to latest. Run SAST and SCA over server code the way you would any dependency, and watch tool descriptions for injected instructions, not just the code paths (Red Hat). And check your host's execution primitives: whether it runs servers in an isolated container or just spawns local processes with your full shell privileges is the difference between a sandbox and a foothold. "It's just an MCP server" is the same category error as "it's just an npm package."

When to reach for each

The blocks aren't a stack you adopt all at once. They're answers to problems you'll feel in order. A rough decision sequence:

Start with one agent and good tools. Most tasks never need more. If a single well-briefed agent with the right tools does the job, stop — you're done.
Reach for a skill the moment you re-explain a workflow for the third time. It's the cheapest block and the one with the best return.
Reach for a hook when a step must be guaranteed, not requested — a guardrail on writes, a format, a test gate. Anything you refuse to leave to a 95%-reliable model.
Reach for MCP when the agent needs to reach a system you don't want to hand-integrate — and budget time to vet the server like a dependency (tool poisoning is the common bite).
Reach for subagents last, and only when the work genuinely parallelizes or when one context window can't hold the job without drifting. Expect to pay in tokens and coordination for what you get back in isolation and speed.

What holds together

The thing my failed swarm taught me is the same thing the trajectory study found at scale: the leverage isn't a bigger model or more agents. It's the structure around the loop. A tight observe-plan-act cycle, a clean context per subagent, skills for what repeats, hooks for what must never be skipped, and MCP for reach — vetted like the dependency it is. Each block earns its place by removing a specific, nameable failure, and the discipline is refusing to add one until you can name the failure it prevents.

What doesn't hold together yet: cross-tool parity. The agent loop is a settled idea, but skills, hooks, and the finer points of orchestration still differ enough between products that "agentic AI" describes a family of dialects, not one language. Build around the concepts — isolated context, packaged workflows, deterministic gates, standardized reach — and you'll carry your habits across whichever tool wins. Build around one vendor's folder names, and you'll port it twice.