Sam Breed

Product Developer, Investor

← wiki

(wip)

In 2021 when GPT-3 was the only game in town, context windows were small, RLHF hadn’t yet refined responses, prompting was difficult, and returning structured data was rough. Tool calling was possible if you were clever enough to coax the model into returning xml.

In 2025 those constraints have evaporated to reveal a new set of challenges for developers: managing long running tasks, pruning and compacting larger context windows, and ever-increasing model capabilities make building AI products non-obvious.

These are the patterns for building AI applications you should know.

This is not a piece about the user experience of AI products. There are plenty of good posts about that on LukeW’s blog. Rather, this is about right patterns for shipping better AI apps.

I’ve split this into 3 sections:

  1. Agent patterns discusses the client/server architectures that emerge in mature AI products.
  2. Model strats covers picking the right model for the right job and how to optimize for cost and control.
  3. Power tools is all about how good tool design can solve upstream context management problems before they begin.

Table of Contents

Agent patterns

From my vantage point, we got AGI in June 2024 when Claude Sonnet 3.5 with tool calling was released. OpenAI had been the first lab to popularize tool use, but their 2023-2025 models were not on par with Claude for tool calls. But why are tool calls so important?

Boundaries. Tool calls are an important boundary between systems. The control model is in the driver’s seat. It knows about the user and the system, has a larger system prompt that includes important context about the user, and it has the ability to discover and call tools. Tools open the door to having sub-systems that generate and manage context with different goals than the main thread.

Flexibility. Building for AI integrations means balancing trade-offs between provider-specific features and the desire to have swappable parts. Any good system will leave ample room for switching between models at all layers; switching between models might someday unlock new capabilities and should not require a re-write.

Background everything

When a user submits a prompt it should always finish. Even if the client disconnects because of a weak network connection or a closed tab, a good system will return a result the as if the user had been watching stream in real time. Furthermore, as models become increasingly capable of handling long-running tasks background execution has become table stakes for AI agents

There are two paths: websockets or event streams.

If you choose websockets, it should look something like this

Using event streams, you need to employ a similar trick. The AI SDK team released their own clever solution a few weeks after our team had shipped the websockets solution described above..

Other considerations:

Stream everything

The AI SDK provides an excellent set of patterns to follow for delivering LLM responses as an HTTP event stream.

However, there are several areas where we can improve on the AI SDK defaults:

LLM -> structured data -> create view model LLM -> view update -> JSON patch -> view model

Files and RAG

RAG only works well if you have good retrieval. A good retrieval system starts with a file processing pipeline.

Different files need different strategies:

It’s a lot!

Once you have all (or part) of that done, then you need to implement the retriever itself. This will look like a combination of vector (cosine similarity) and keyword (bm25) search paired with a re-ranker to prune sources by relevance. The you need to append the user’s prompt with relevant chunks and some instructions with how to cite them.

Citations are an area of great interest, because they often make or break the credibility of a response. The ability to provide a corpus of sources up-front and receive accurate well-formed citations is an area to watch as model providers.

Model strategies

It all starts by picking the right model for the job. Models have a strike zone and each model’s is different.

Usually it comes down between picking where you want to be on this axis:

Fast model <--------------> Smart model

Or you can just pick Claude.

Multi-Provider fallbacks

This is a must. Providers can go down with outages, and you’ll be rudely reminded of this the first time your sole provider has an incident that tanks your entire app.

The easiest way out of this is to use OpenRouter or something like it.

If you decide to implement your own fallback logic, you’ll need to consider:

Provider-Specific Error Handling:

Error handling

Building products around AI models requires sophisticated error handling, with plenty of room for fallbacks, retries, and failovers. Different types of failures need their own unique recovery strategies to enable graceful degradation and intelligent recovery.

Importantly, pushing a decision back to the user is a strategy we can build in these types of apps.

AuthRequiredError isAISDKErrorToRetry isContextLengthError isImageDimensionError isShouldStopToolError isUseClientSideToolError ShouldStopTool SystemAbortedError UseClientSideTool UserAbortedError

Tool Error Recovery:

// Convert tool execution errors to tool results for model visibility
if (ToolExecutionError.isInstance(part.error)) {
    logger.debug('Converting tool execution error to tool result', {
        error: part.error,
        toolCallId: part.error.toolCallId,
        toolName: part.error.toolName
    });

    // Create error result that the model can see and respond to
    const errorResult = {
        type: 'tool-result' as const,
        toolCallId: part.error.toolCallId,
        toolName: part.error.toolName,
        result: {
            error: part.error.message || 'Tool execution failed',
            artifactsCreated: [],
            annotations: []
        }
    };

    // Write error result to stream for model processing
    streamWriter.write(messageConverter.transformChunk(errorResult));
    continue; // Continue processing instead of failing
}

Error Context and Logging:

Context management

Even as context windows increase over time, there’s no free lunch. Large contexts slow down inference, balloon costs, and degrade performance.

The system implements sophisticated context length management to ensure conversations stay within AI model token limits while preserving the most important information. This involves intelligent message truncation, token tracking, and optimization strategies.

Intelligent Truncation Strategy:

Prompt caching

Big system prompts lead to slow responses.

Claude models currently have the best support for fine-grained caching, which can cache portions of the conversation context to reduce token processing costs and improve response times for subsequent requests.

Provider-Specific Cache Headers:


let headerString = '';
if (isClaude4(modelId)) {
    headerString = 'interleaved-thinking-2025-05-14';

    // Extended cache TTL only available on Anthropic provider
    if (modelDef.provider === 'anthropic') {
        headerString += ',extended-cache-ttl-2025-04-11';
    }
}

Cache Performance Monitoring:

Cache Invalidation Strategy:

Power tools

Tool calls are king. Tools ain’t free; sooner or later you run out of space in the system prompt. Tools definitions can inserted into chat history at any time, creating an opportunity to select tools via another tool call.

Tools allow you to combine the strengths of different models and offset their weaknesses. Gemini has blazing-fast large context models that are weaker at writing prose or complex code but excellent at searching and summarizing large documents; OpenAI has great image generation and unique voice, but is prone to straying from factual grounding; Anthropic has the best instruction following, tool calling, and coding models. There are benefits to mixing your models and evaluating them in different scenarios.

Tool design

Two things are held in tension

Take a tool to crawl a webpage. The simplest version might look like this:

{
  name: "crawl_webpage",
  description: "fetch the contents of a webpage"
  arguments: {
    url: z.string()
  }
}

When you first use this, it will probably work ok! But it has some problems. For one, webpages can get pretty big, and if the tool result contains the entire response you’re likely to run out of context relatively quickly. You might try extracting the text with something like mozilla/readability, but you’re still subject to the same problem.

There’s a larger problem with this naive pattern: the burden of dealing with the response is placed back on the control model. The more times you call the tool, the more context gets dedicated to attending to the previous response regardless if they’re relevant to the session.

{
  name: "browse",
  description: "Web page browsing and content extraction",
  arguments: {
    urls: z.array(z.string())
  }
}

Dynamic loading

The dynamic tool loading system is one of the most sophisticated aspects of the orchestrator, implementing a multi-layered approach to tool management that adapts to user context, permissions, and system state.

Tool Categories & Classification:

  1. Core Tools: Essential tools available to all users (ask-user, complete, create-document)
  2. MCP Server Tools: Tools provided by Model Context Protocol servers
  3. Connected Non-Core Tools: External service tools with active connections
  4. Non-Connected Tools: Tools requiring setup or authentication
  5. Browser Extension Tools: Tools requiring browser extension capabilities

Context-Aware Loading Strategy:

Load-Time vs Runtime Tool Selection: The system implements a two-tier tool selection mechanism:

  1. Load-Time Selection (loadToolsEnabled flag):

    • When enabled, only core tools are initially loaded to the AI model
    • Reduces token usage and improves response time
    • Non-core tools are loaded on-demand when needed
  2. Runtime Tool Override (overrideToolSelection):

    • Allows dynamic tool selection for specific subtasks
    • Enables focused tool sets for specialized operations
    • Supports tool restriction for security or performance reasons

Permission-Based Tool Filtering:

Tool Metadata Integration: The system uses toolHelpers.metadata to:

Performance Optimizations:

Error Handling in Tool Loading:

This dynamic approach ensures optimal performance while maintaining flexibility and security, allowing the system to adapt to different user contexts and requirements without compromising functionality.

Client-side tools

Tool call repairs

Description: The system implements an advanced tool call repair mechanism that automatically detects and fixes malformed tool arguments using AI-powered schema validation and repair, ensuring robust tool execution even when the primary AI model makes formatting errors.

Justification:

Implementation Details:


experimental_repairToolCall: async ({ toolCall, tools, error, parameterSchema }) => {
    const startTime = Date.now();
    try {
        // Create repair prompt with context
        const promptContent = [
            `Your task is to fix the arguments for a tool call.`,
            `The model tried to call the tool "${toolCall.toolName}" with the following arguments: ${toolCall.args}.`,
            `The tool accepts the following schema: ${JSON.stringify(parameterSchema({ toolName: toolCall.toolName }))}.`,
            `Error: ${error.message}`,
            'Please fix the arguments to match the required schema.'
        ].join('\n');

        // Use AI model to repair the arguments
        const result = await generateObject({
            model,
            schema: tools[toolCall.toolName].parameters,
            prompt: promptContent
        });

        // Return repaired tool call
        return {
            toolCallType: 'function',
            toolCallId: toolCall.toolCallId,
            toolName: toolCall.toolName,
            args: JSON.stringify(result.object)
        };

    } catch (error) {
        logger.error('Failed to repair tool call', error);

        return null; // Repair failed, let original error propagate
    }
}

Repair Strategy:

  1. Error Analysis: Analyzes the specific validation error to understand the issue
  2. Schema Context: Provides the correct schema to the repair AI model
  3. Contextual Repair: Uses the original arguments as context for intelligent repair
  4. Validation: Ensures repaired arguments match the required schema
  5. Fallback: Returns null if repair is impossible, allowing graceful error handling

Error Types Handled:

Repair Quality Assurance:

MCP

While I was initially skeptical about MCP, the level of adoption across the industry over the past 8 months is hard to ignore.

Categories