All the LLM Agent Tricks

(wip)

In 2021 when GPT-3 was the only game in town, context windows were small, RLHF hadn’t yet refined responses, prompting was difficult, and returning structured data was rough. Tool calling was possible if you were clever enough to coax the model into returning xml.

In 2025 those constraints have evaporated to reveal a new set of challenges for developers: managing long running tasks, pruning and compacting larger context windows, and ever-increasing model capabilities make building AI products non-obvious.

These are the patterns for building AI applications you should know.

This is not a piece about the user experience of AI products. There are plenty of good posts about that on LukeW’s blog. Rather, this is about right patterns for shipping better AI apps.

I’ve split this into 3 sections:

Agent patterns discusses the client/server architectures that emerge in mature AI products.
Model strats covers picking the right model for the right job and how to optimize for cost and control.
Power tools is all about how good tool design can solve upstream context management problems before they begin.

Agent patterns
Model strategies
Power tools

Agent patterns

From my vantage point, we got AGI in June 2024 when Claude Sonnet 3.5 with tool calling was released. OpenAI had been the first lab to popularize tool use, but their 2023-2025 models were not on par with Claude for tool calls. But why are tool calls so important?

Boundaries. Tool calls are an important boundary between systems. The control model is in the driver’s seat. It knows about the user and the system, has a larger system prompt that includes important context about the user, and it has the ability to discover and call tools. Tools open the door to having sub-systems that generate and manage context with different goals than the main thread.

Flexibility. Building for AI integrations means balancing trade-offs between provider-specific features and the desire to have swappable parts. Any good system will leave ample room for switching between models at all layers; switching between models might someday unlock new capabilities and should not require a re-write.

Background everything

When a user submits a prompt it should always finish. Even if the client disconnects because of a weak network connection or a closed tab, a good system will return a result the as if the user had been watching stream in real time. Furthermore, as models become increasingly capable of handling long-running tasks background execution has become table stakes for AI agents

There are two paths: websockets or event streams.

If you choose websockets, it should look something like this

Client:
- On load, connect users to a websocket channel for the session
- Submit requests via POST request
- Listen to websocket channel for responses
- If the session is in progress,
  - hydrate UI from the cache
Server:
- POST /api/chat
  - Create a record in the database
  - Connect to the websocket for the session
  - Call the upstream model
  - Pipe the stream to the websocket
  - Pipe the stream to a temporary cache (Redis). This avoids excessive db writes and allows clients to quickly “catch up” to in progress streams.
- Listen for user interrupts over the websocket

Using event streams, you need to employ a similar trick. The AI SDK team released their own clever solution a few weeks after our team had shipped the websockets solution described above..

Other considerations:

Only send the most recent message from the client
Commit to the database at every step
Track usage and cost for every call. Cost is a major component of trading-off between models

Stream everything

The AI SDK provides an excellent set of patterns to follow for delivering LLM responses as an HTTP event stream.

However, there are several areas where we can improve on the AI SDK defaults:

Interruption Support: Allows users to interrupt long responses when they have enough information
Error Visibility: Errors can be detected and handled before complete response generation
Stream Transforms:
- Smooth Streaming: Implements controlled delays (20ms) to prevent overwhelming the client
- Buffer Management: Splits large buffers into manageable chunks (max 1000 chars)
- Markdown Filtering: Prevents rendering of incomplete markdown structures
- Tool Call Streaming: Enables real-time tool call visibility
streamable JSON patches for view model updates

LLM -> structured data -> create view model LLM -> view update -> JSON patch -> view model

Files and RAG

RAG only works well if you have good retrieval. A good retrieval system starts with a file processing pipeline.

Different files need different strategies:

text files and PDFs need be broken down, indexed, and summarized
images need their metadata extracted and then sent to a model
audio should be transcribed, diarized, and summarized
videos should go to a dedicated video model
zip and tar archives need to be unpacked
everything needs to be private and secure

It’s a lot!

Once you have all (or part) of that done, then you need to implement the retriever itself. This will look like a combination of vector (cosine similarity) and keyword (bm25) search paired with a re-ranker to prune sources by relevance. The you need to append the user’s prompt with relevant chunks and some instructions with how to cite them.

Citations are an area of great interest, because they often make or break the credibility of a response. The ability to provide a corpus of sources up-front and receive accurate well-formed citations is an area to watch as model providers.

Model strategies

It all starts by picking the right model for the job. Models have a strike zone and each model’s is different.

Usually it comes down between picking where you want to be on this axis:

Fast model <--------------> Smart model

Or you can just pick Claude.

Multi-Provider fallbacks

This is a must. Providers can go down with outages, and you’ll be rudely reminded of this the first time your sole provider has an incident that tanks your entire app.

The easiest way out of this is to use OpenRouter or something like it.

High Availability: Prevents single points of failure in AI model access
Load Distribution: Distributes load across multiple providers to avoid rate limits
Cost Optimization: Allows switching to more cost-effective providers when available
Performance Optimization: Routes requests to the fastest available provider
Vendor Independence: Reduces dependency on any single AI provider

If you decide to implement your own fallback logic, you’ll need to consider:

Provider-Specific Error Handling:

tracking different error states between providers
distinguishing errors and behavior differences between different flavors of the same model. e.g., why does gemini-flash-2.5 return error if you pass a frequency_penalty when hosted on Google AI studio but not when hosted on Vertex? Provider State Management:
Color Tracking: Each provider is assigned a color for UI identification
Performance Metrics: Track response times and success rates per provider
Health Monitoring: Monitor provider availability and automatically exclude unhealthy providers
Load Balancing: Distribute requests based on provider capacity and performance

Error handling

Building products around AI models requires sophisticated error handling, with plenty of room for fallbacks, retries, and failovers. Different types of failures need their own unique recovery strategies to enable graceful degradation and intelligent recovery.

Importantly, pushing a decision back to the user is a strategy we can build in these types of apps.

Precise Error Handling: Different error types require different recovery strategies
User Experience: Provides meaningful error messages and recovery options to users
System Stability: Prevents cascading failures by containing errors appropriately
Debugging: Facilitates troubleshooting by providing detailed error context
Monitoring: Enables proper alerting and metrics collection for different error types

AuthRequiredError isAISDKErrorToRetry isContextLengthError isImageDimensionError isShouldStopToolError isUseClientSideToolError ShouldStopTool SystemAbortedError UseClientSideTool UserAbortedError

Tool Error Recovery:

// Convert tool execution errors to tool results for model visibility
if (ToolExecutionError.isInstance(part.error)) {
	logger.debug("Converting tool execution error to tool result", {
		error: part.error,
		toolCallId: part.error.toolCallId,
		toolName: part.error.toolName,
	});

	// Create error result that the model can see and respond to
	const errorResult = {
		type: "tool-result" as const,
		toolCallId: part.error.toolCallId,
		toolName: part.error.toolName,
		result: {
			error: part.error.message || "Tool execution failed",
			artifactsCreated: [],
			annotations: [],
		},
	};

	// Write error result to stream for model processing
	streamWriter.write(messageConverter.transformChunk(errorResult));
	continue; // Continue processing instead of failing
}

Error Context and Logging:

Structured Logging: All errors include contextual information (jobId, sessionId, etc.)
Severity Classification: Different error types have appropriate log levels
Error Aggregation: Similar errors are grouped for monitoring and alerting
Recovery Tracking: Success/failure rates of different recovery strategies are monitored

Context management

Even as context windows increase over time, there’s no free lunch. Large contexts slow down inference, balloon costs, and degrade performance.

The system implements sophisticated context length management to ensure conversations stay within AI model token limits while preserving the most important information. This involves intelligent message truncation, token tracking, and optimization strategies.

Model Compatibility: Different AI models have varying context length limits that must be respected
Performance: Shorter contexts lead to faster response times and lower costs
Information Preservation: Critical information must be retained while less important content is truncated
User Experience: Prevents context length errors that would interrupt conversations
Cost Control: Token usage directly impacts operational costs

Intelligent Truncation Strategy:

Priority-Based: System messages and recent messages have higher priority
Tool Context: Tool calls and results are preserved when possible
User Intent: User messages are prioritized over assistant responses
Conversation Flow: Maintains logical conversation flow even after truncation

Prompt caching

Big system prompts lead to slow responses.

Claude models currently have the best support for fine-grained caching, which can cache portions of the conversation context to reduce token processing costs and improve response times for subsequent requests.

Cost Reduction: Cached tokens are significantly cheaper than processing new tokens
Performance: Cached content doesn’t need to be reprocessed, reducing latency
Efficiency: Reduces computational load on AI providers
Scalability: Enables handling longer conversations without proportional cost increases
User Experience: Faster response times for continued conversations

Provider-Specific Cache Headers:

let headerString = "";
if (isClaude4(modelId)) {
	headerString = "interleaved-thinking-2025-05-14";

	// Extended cache TTL only available on Anthropic provider
	if (modelDef.provider === "anthropic") {
		headerString += ",extended-cache-ttl-2025-04-11";
	}
}

Cache Performance Monitoring:

as part of the overall usage monitoring system, cache performance is included with usage annotations. This gives you request-level views on your cache rate and effective token cost.

Cache Invalidation Strategy:

Content Changes: Cache is invalidated when system prompts or tools change
Time-Based: Extended TTL headers for longer cache retention
Context Boundaries: Cache markers respect conversation boundaries
Processing State: Active processing areas are excluded from caching

Power tools

Tool calls are king. Tools ain’t free; sooner or later you run out of space in the system prompt. Tools definitions can inserted into chat history at any time, creating an opportunity to select tools via another tool call.

Tools allow you to combine the strengths of different models and offset their weaknesses. Gemini has blazing-fast large context models that are weaker at writing prose or complex code but excellent at searching and summarizing large documents; OpenAI has great image generation and unique voice, but is prone to straying from factual grounding; Anthropic has the best instruction following, tool calling, and coding models. There are benefits to mixing your models and evaluating them in different scenarios.

Tool design

Two things are held in tension

tools should have a narrow scope
tools should should be open ended

Take a tool to crawl a webpage. The simplest version might look like this:

{
  name: "crawl_webpage",
  description: "fetch the contents of a webpage"
  arguments: {
    url: z.string()
  }
}

When you first use this, it will probably work ok! But it has some problems. For one, webpages can get pretty big, and if the tool result contains the entire response you’re likely to run out of context relatively quickly. You might try extracting the text with something like mozilla/readability, but you’re still subject to the same problem.

There’s a larger problem with this naive pattern: the burden of dealing with the response is placed back on the control model. The more times you call the tool, the more context gets dedicated to attending to the previous response regardless if they’re relevant to the session.

{
  name: "browse",
  description: "Web page browsing and content extraction",
  arguments: {
    urls: z.array(z.string())
  }
}

Dynamic loading

Dynamic tool loading: Prepares and loads tools based on user context and permissions
Tool state management: Tracks tool invocations, results, and error states
Manual tool execution: Supports executing tools outside of the AI model’s direct control

The dynamic tool loading system is one of the most sophisticated aspects of the orchestrator, implementing a multi-layered approach to tool management that adapts to user context, permissions, and system state.

Tool Categories & Classification:

Core Tools: Essential tools available to all users (ask-user, complete, create-document)
MCP Server Tools: Tools provided by Model Context Protocol servers
Connected Non-Core Tools: External service tools with active connections
Non-Connected Tools: Tools requiring setup or authentication
Browser Extension Tools: Tools requiring browser extension capabilities

Context-Aware Loading Strategy:

User-based filtering: Tools are filtered based on user permissions and subscription level
Developer mode: Additional tools become available when developer flag is enabled
Model compatibility: Tools are filtered based on the target AI model’s capabilities
Feature flags: Tool availability controlled by feature flags in toolHelpers.flags

Load-Time vs Runtime Tool Selection: The system implements a two-tier tool selection mechanism:

Load-Time Selection (loadToolsEnabled flag):
- When enabled, only core tools are initially loaded to the AI model
- Reduces token usage and improves response time
- Non-core tools are loaded on-demand when needed
Runtime Tool Override (overrideToolSelection):
- Allows dynamic tool selection for specific subtasks
- Enables focused tool sets for specialized operations
- Supports tool restriction for security or performance reasons

Permission-Based Tool Filtering:

Tools are filtered based on user’s organizational unit (ouId)
Project-specific tools are loaded based on user’s current project context
External service tools require proper authentication tokens
Rate limiting and quota management per tool type

Tool Metadata Integration: The system uses toolHelpers.metadata to:

Track tool usage statistics
Manage tool-specific configuration
Handle tool deprecation and versioning
Provide tool-specific help and documentation

Performance Optimizations:

Lazy loading: Tools are loaded only when needed
Caching: Tool definitions are cached to avoid repeated preparation
Batching: Multiple tool operations are batched for efficiency
Memory management: Unused tools are garbage collected

Error Handling in Tool Loading:

Graceful degradation: System continues with available tools if some fail to load
Fallback mechanisms: Core tools always available as fallback
Tool validation: Each tool is validated before being made available
Connection testing: External tools are tested for connectivity

This dynamic approach ensures optimal performance while maintaining flexibility and security, allowing the system to adapt to different user contexts and requirements without compromising functionality.

Client-side tools

Client-side detection: Identifies tools that need client-side execution
Auth flow handling: Manages OAuth and authentication flows for external services
Tool state preservation: Maintains tool state across client-side interactions

Tool call repairs

Description: The system implements an advanced tool call repair mechanism that automatically detects and fixes malformed tool arguments using AI-powered schema validation and repair, ensuring robust tool execution even when the primary AI model makes formatting errors.

Justification:

Reliability: Prevents tool execution failures due to minor formatting issues
User Experience: Reduces interruptions from tool argument validation errors
Robustness: Makes the system more resilient to AI model inconsistencies
Efficiency: Automatically fixes issues without requiring user intervention
Quality: Maintains high success rates for tool execution

Implementation Details:

experimental_repairToolCall: async ({ toolCall, tools, error, parameterSchema }) => {
	const startTime = Date.now();
	try {
		// Create repair prompt with context
		const promptContent = [
			`Your task is to fix the arguments for a tool call.`,
			`The model tried to call the tool "${toolCall.toolName}" with the following arguments: ${toolCall.args}.`,
			`The tool accepts the following schema: ${JSON.stringify(parameterSchema({ toolName: toolCall.toolName }))}.`,
			`Error: ${error.message}`,
			"Please fix the arguments to match the required schema.",
		].join("\n");

		// Use AI model to repair the arguments
		const result = await generateObject({
			model,
			schema: tools[toolCall.toolName].parameters,
			prompt: promptContent,
		});

		// Return repaired tool call
		return {
			toolCallType: "function",
			toolCallId: toolCall.toolCallId,
			toolName: toolCall.toolName,
			args: JSON.stringify(result.object),
		};
	} catch (error) {
		logger.error("Failed to repair tool call", error);

		return null; // Repair failed, let original error propagate
	}
};

Repair Strategy:

Error Analysis: Analyzes the specific validation error to understand the issue
Schema Context: Provides the correct schema to the repair AI model
Contextual Repair: Uses the original arguments as context for intelligent repair
Validation: Ensures repaired arguments match the required schema
Fallback: Returns null if repair is impossible, allowing graceful error handling

Error Types Handled:

Type Mismatches: Converting strings to numbers, booleans, etc.
Missing Required Fields: Adding required parameters with sensible defaults
Format Issues: Fixing date formats, URL structures, etc.
Validation Errors: Correcting values that don’t match schema constraints

Repair Quality Assurance:

Schema Validation: All repairs are validated against the original schema
Semantic Preservation: Repairs maintain the original intent of the tool call
Performance Tracking: Monitor repair success rates and latency
Fallback Handling: Graceful degradation when repair is not possible

MCP

While I was initially skeptical about MCP, the level of adoption across the industry over the past 8 months is hard to ignore.

Sam Breed