How To Train Your AI Agent

Developing an AI agent comes down to a number of intertwined yet distinct challenges:

Create the runtime infrastructure for the agent itself
Add tools for the task
Planning: Making agent tackle the specific problem you’re aiming to solve
Create a knowledge base the agent can use to inform its execution steps
Set up an evaluation infrastructure to track progress and catch agent mistakes
Integrate the agent into the larger application

Whether the AI agent is fully autonomous or works with human-in-the-loop oversight, development includes roughly the steps outlined above. Each step calls for its own set of techniques and tools.

Steps To Develop An AI Agent

Examples: Windsurf, Cursor and Claude Code

Let’s start with some examples the reader may be familiar with:

AI Editors like Windsurf, Cursor, Github Copilot
Command line tools like Claude Code.

The latter, Claude Code, also has a VSCode extension by the same name.

In each case, the underlying AI agents are integrated with the text editor (our Step 6).

While Windsurf, Cursor and Claude Code each have their own agent infrastructure (our Step 1), the latter, Claude Code is different in that it is available as a command line tool which can be executed standalone, in interactive or non-interactive mode.

Anthropic has also developed a Claude Agent SDK, with Typescript, Python and shell command flavors. The Claude Agent SDK can be used both for coding tasks and for more general tasks - for example, for document processing, or for support message processing, or for marketing research and outreach.

An Agent SDK is not available for Windsurf or Claude Code.

The knowledge base (Step 3) for these agents is embedded into the language models themselves. These coding agents can code in Python, Typescript, C, C++ or Java. The language models are trained on these programming languages.

While these agents do not technically require an external knowledge base for pure coding skills, they use a web search tool to look up product and API documentation. The web search tool comes handy when implementing coding tasks against specific coding libraries and APIs.

Windsurf, Cursor, Github Copilot use multiple-vendor language models, in addition to models used in-house. Claude Code on the other hand, only uses Anthropic language models.

Custom AI Agents, though, don’t have all necessary knowledge baked into their LLM model - and often need an external knowledge base, adapted to their task (Step 3).

For example - one of our customer companies develops a lab information management system that uses a scripting language to define web forms. The forms are used to manage lab processes in Biotech, Food Manufacturing, or Forensics. The AI Agent we developed employs an external knowledge base to provide full documentation and examples for the scripting language. The generalist LLMs, while trained on Python, Typescript and other programming languages, would not have baked-in information about the custom scripting language for web forms.

Custom AI agents need to implement their own external knowledge base and problem-solving tools. These are both integrated into the Agent using an MCP Server.

Extending Claude Code

MCP support is already available in all AI Editors. The editors can act as MCP Clients, and can, for example, read external Github repositories through a Github MCP Server. Any MCP Server is supported.

An MCP Server, thus, can be used to make your AI agent act on your specific tasks, and to access your external knowledge base.

But having the tools and the knowledge base is not sufficient. The agent must also be steered through task planning (our Step 4).

The approach each of the AI editors takes to steer their coding task is proprietary, and slightly different. A lot of the steering is done through proprietary system prompts. Some have reverse engineered these system prompts through intercepting API calls - but, for the most part, they remain hidden.

Each AI editor, however, supports extending the system prompt through a configuration file: CLAUDE.md for Claude Code, Cursor Rules for Cursor,Windsurf Rules for Windsurf. It is these configuration files - CLAUDE.md in particular - that will be used to steer our Custom AI Agent.

Windsurf, Cursor, Github Copilot also use a tab-completion model, which employs a smaller, faster LLM acting directly in the editor buffer. Claude Code does not have that feature.

Evaluation for these AI editors is also proprietary (Step 5). Custom AI Agents would need their own evaluation infrastructure developed from scratch.

Here is a Comparison Table for the AI editors we described:

Step	Windsurf, Cursor, Github Copilot	Claude Code	Custom Agent
1. Runtime Infrastructure	Proprietary, integrated with text editors	Proprietary, available as command-line tool + VSCode extension	Claude Code + Claude Agent SDK (TypeScript/Python/Shell)
2. Task Planning	Proprietary system prompt. Windsurf Rules file for Windsurf. Cursor Rules file for Cursor	Proprietary system prompt. CLAUDE.md for custom planning.	Custom via CLAUDE.md + editable system prompts (Claude Agent SDK)
3. Knowledge Base	Embedded in LLM training + web search for API docs. Extensible through MCP.	Embedded in LLM training + web search for API docs. Extensible through MCP.	External vector DB (Pinecone/Weaviate/Qdrant) + custom MCP server tools
4. Evaluation Infrastructure	Proprietary, internal evaluation systems	Proprietary, internal evaluation systems	Custom evaluation scripts + LLM Judge + web visualization dashboard
5. Integration	Native text editor integration	Command-line + VSCode extension	Custom VSCode extension

Separation of concerns: Infrastructure vs. Task-Specific Planning

When it comes to building the agent, we separate out the Infrastructure design from the Task Planning, for a couple of reasons:

Often times, the task solved by the agent needs to be flexible and can change in the product lifecycle. This calls for infrastructure to be separated out from the task customization. Sometimes, the task or its subtasks are not even fully known or understood at the outset of the AI Agent project.
The infrastructure is now available ready-made, and may not have to be built from scratch.

Adopting a ready-made, battle tested AI Agent Infrastructure that is steerable for your task will accelerate your AI Agent design.

Claude Code Online System Architecture

The AI Agent Infrastructure

As of this writing, our go-to stack for agent creation is the Claude Code shell command, complemented by the Claude Agent SDK. Its advantages:

A suite of powerful features like CLAUDE.md customization, subcommands for modular execution, subagents for hierarchical delegation
MCP server support, which lets you seamlessly extend the agent, and steer it to solve the task at hand.
Built-in prompt caching, and discounted packaged price for utilization, when using a Claude subscription.

No agent is complete without a solid knowledge base, and without good evaluation infrastructure.

The Claude Agent SDK doesn’t ship with either out of the box. You’ll need to craft them yourself.

For the vector database powering the knowledge base, we’ve used Pinecone, prized for its straightforward SaaS model that gets you up and running quick. Pinecone documentation is also very accessible.

That said, there are multiple Vector DB alternatives: Weaviate, Qdrant, or ChromaDB. Even traditional players like Postgres or MongoDB have jumped on the vector bandwagon, now offering built-in support for vector tables or collections.

In our architecture, we centralize all custom interfaces for the agent within the MCP server. This ensures portability and testability across setups. The MCP server includes a vector_db_lookup() interface to let the agent look up the knowledge base.

Indexing Your Knowledge Base

To build the knowledge base, you need:

A collection of knowledge base files - we usually convert them all to Markdown, and store them in an S3 bucket, or in a MongoDB collection
A vector DB index in your DB of choice
An indexing script, that runs periodically, and uploads the knowledge base markdown files into the vector DB index, mapped through an LLM embedding.

Knowledge Base Indexing Flow

The lookup operation is what you’d expect - the searched text is mapped through the same LLM embedding, and looked up for similarity with existing chunks in the vector db index - returning the most closely similar chunks from the knowledge base.

Tuning AI Agent Actions via the MCP Server

With the knowledge base available, and searchable - we now let the AI agent use it through MCP Server tools.

These tools can:

Look up knowledge base articles for similarity
Perform file reads and edits that are specialised to your file format
Validate file edits, and run unit tests

The tools take parameters, and are self-describing. The LLM model can figure out, to a good extent, which tools to call, and which parameters to pass - however, that is not sufficient for good agent design, because:

The same operation may be solved by different tools.
Or, the tools need to be called in a certain order.

This is why step is crucial: Task Planning, which steers the agent in how it uses the MCP Server tools.

Task Planning

Task planning is designed partly through editing CLAUDE.md, and partly through changing the system prompt.

With the command-line Claude Code tool, its internal system prompt is not editable.
But with Claude Agent SDK, it is.

In CLAUDE.md, we specify how to approach the various custom problems the agent might encounter. Claude Code has built-in support for the ToDoWrite tool. In CLAUDE.md, we specify what are the ToDo tasks, for each problem the AI Agent might encounter.

The Claude Code Event Loop

The Claude Code agent itself has a baked-in event loop, which takes user input, file attachments and screenshots - puts CLAUDE.md in context, determines its ToDo tasks, and iterates through MCP Server tools and other built-in tools to solve the task.

The user can interrupt the flow at any moment, and continue it with altered instructions.

From there, the agent marches forward, step by step. It can go solo or pause for human nods, calling on more MCP and built-in tools as needed to create new files, edit existing ones, or query the knowledge base.

Claude Code Event Loop

Tuning the Agent through Evaluation

Once the infrastructure is stood up, the knowledge base available, the MCP server working, and the agent begins to be integrated in the larger application - you need to tune the agent and adapt it to the task at hand.

Tuning can mean:

Adding or updating to the knowledge base
Adding or updating the MCP server tools
Improving the CLAUDE.md design, to steer the running agent through its __ToDo__s.

Evaluation is an essential step here.

Claude Code Offline System Architecture

Kick things off by:

Assembling a ground-truth dataset — a curated set of benchmark problems mirroring real-world scenarios.
Create an evaluation script that, for each item, spins up a fresh Claude Code instance to tackle the problem in pure automated mode, assuming no user intervention.
Once the _Claude Code instance completes processing a dataset item, a second forked instance of Claude Code runs to evaluate and judge the output, for a number of metrics, as LLM Judge.

The metrics tracked by the LLM Judge are:

Accuracy: How spot-on are the results?
Completeness: Does it cover all requirements without omissions?
For MCP interfaces leaning on vector DB retrieval: Accuracy and completeness of the chunks pulled back.
Completion rate of the ToDo’s
Drilling deeper, for select MCP tools: Per-tool breakdowns of response accuracy and completeness.
Number of failed validations during the actual run
Validation tests run by the LLM judge at eval time, using the same MCP validation tools you added
Other statistics: number of steps to solve the question, LLM models used, in and out tokens, cached tokens, cost

Bringing Eval to Life: Visualization

Raw eval logs are hard to monitor without an at-a-glance view.

In our approach, we build a custom web visualiser of eval results, with:

A quick-scan table, one row per question, for instant overviews.
A radar chart aggregating metrics into a visual story of strengths and gaps.
Distributions to reveal patterns across the board.
Per-question deep dives on tool usage: At-a-glance summaries of which tools fired, what inputs they took, the outputs they spat, and any ToDo progress where relevant.

Iteration and Improvement

Here’s how the development loop flows in practice:

Bootstrap your ground-truth dataset with questions to get baseline momentum.
Start the evaluation suite and scrutinize the metrics
For each metrics shortfall, triage the root cause:
- Does the knowledge base require new or updated data?
- Is the problem caused by Vector DB chunking, indexing, or retrieval?
- Is an MCP tool not working properly? Or is a new MCP tool needed?
- Or perhaps CLAUDE.md itself needs a refresh to better guide the agent?

As problems are discovered, and fixed, the ground truth dataset is expanded. Re-running over previously-working dataset items ensures the new fixes don’t break old functionality.