Go back
Go back
AI Blog

Building Production-Ready Streaming LLM Agents: Lessons from the Trenches

November 17, 2025

Over the past year of building and deploying LLM agents in production, we've learned that success isn't about using the most sophisticated frameworks or the newest models. It's about building systems that are composable, observable, and resilient to failure.

In this post, we share the architectural patterns and practical lessons from building a production agent system that handles thousands of streaming conversations daily. These patterns emerged from real-world constraints: the need for immediate user feedback, graceful error recovery, and the ability to compose complex workflows from simple primitives.

What We Mean by "Streaming Agents"

When we talk about streaming agents, we mean systems where:

  • User feedback is immediate: Users see progress as the agent works, not just final results

  • Failures are recoverable: Errors don't lose context or force users to restart

  • Execution is transparent: Users understand what the agent is doing at each step

  • Composition is explicit: Complex behaviors emerge from chaining simple components

This differs from batch-oriented agents that run to completion before returning results, and from framework-heavy approaches that obscure the underlying execution model.

The key insight is that streaming isn't just about showing tokens as they're generated—it's about creating a continuous feedback loop between the agent and the user, where both progress and problems are visible in real-time.

The Four-Layer Architecture

After numerous iterations, we settled on a four-layer architecture that balances simplicity with production requirements. Each layer has a single, well-defined responsibility, making the system easier to test, debug, and evolve.

Layer 1: The Event Generator (Stream Lifecycle)

The outermost layer manages the HTTP streaming connection and ensures reliability. This is where the agent meets the real world—handling network connections, client disconnections, and connection-level errors.

Core responsibilities:

  • Establish the streaming connection immediately (send a ping before any computation)

  • Monitor for client disconnection throughout execution

  • Route different event types to appropriate handlers

  • Ensure centralized error handling so all failures produce consistent error responses

  • Preserve completed work even when errors occur

Why this matters: Without proper connection management, you waste server resources processing results for disconnected clients, and users see inconsistent error messages that make debugging harder.

The key insight here is that the HTTP streaming layer should be completely separate from agent logic. The event generator doesn't know anything about todos, orchestration, or pipelines—it just manages the lifecycle of a stream.

Layer 2: The Executor (Orchestration)

The executor is the conductor of the symphony—it coordinates planning, execution, transformation, and enrichment, but doesn't do any of these tasks itself. It knows about the high-level workflow but delegates all specific work to specialized components.

Core responsibilities:

  • Coordinate the planning phase (delegate to orchestrator)

  • Check guardrails before starting expensive work

  • Build the execution pipeline dynamically based on the plan

  • Implement automatic recovery when agents hit rate limits or tool call maximums

  • Chain together transformation and enrichment of results

The execution flow:

  1. Ask the orchestrator to generate a plan (what needs to be done)

  2. Check if guardrails triggered (e.g., inappropriate request detection)

  3. Build a pipeline based on the plan

  4. Execute the pipeline with automatic recovery for expected failures

  5. Transform and enrich results before streaming to users

Why this matters: By separating orchestration logic from HTTP streaming, you can test the execution flow without mock HTTP requests. By using specialized components (orchestrator, pipeline builder, transformer, processor), each piece can evolve independently.

The critical insight: plan first, then execute. Don't try to do both simultaneously. The orchestrator figures out what to do, then the pipeline does it. This separation makes the system predictable and testable.

Layer 3: The Orchestrator (Planning)

The orchestrator's job is to figure out what work needs to be done. It runs multiple specialized agents in parallel to generate a comprehensive plan efficiently.

Core responsibilities:

  • Generate a step-by-step execution plan (the "todos")

  • Create a conversation title for new conversations

  • Run guardrail checks to ensure the request is appropriate

  • Validate all results before returning them

  • Store the plan for use by downstream layers

Why parallel execution: Planning, title generation, and guardrail checks are independent—none depends on the others' output. Running them in parallel cuts latency by ~60% compared to sequential execution.

The planning workflow:

  1. Simultaneously ask three agents: "what steps are needed?", "what should this conversation be titled?", and "is this request appropriate?"

  2. Wait for all three to complete

  3. Validate that results are well-formed (correct types, non-empty, etc.)

  4. If guardrails triggered, store the stop reason and halt execution

  5. Otherwise, emit the plan and topic as structured events

  6. Store the plan for the executor to use

Key insight: The orchestrator maintains state (the current plan, stop reasons) but also streams events. This dual responsibility—stateful and streaming—is what enables the rest of the system to be stateless. The executor can ask "what's the current plan?" without parsing events.

Layer 4: The Pipeline (Execution)

The pipeline is where actual work happens. It takes the orchestrator's plan and executes it step-by-step, chaining specialized agents together with transformation logic between stages.

Core responsibilities:

  • Build a dynamic execution pipeline from the orchestrator's plan

  • Select appropriate specialized agents for each step (research agent, writing agent, analysis agent, etc.)

  • Manage state transitions (mark todos as complete, start the next one)

  • Inject context between stages (tell each agent what it should work on)

  • Stream all intermediate results so users see progress

The execution pattern:

  1. Get the plan from the orchestrator (a list of todos)

  2. For each todo, create a pipeline stage with the appropriate agent

  3. Add a "post-hook" that runs after each stage completes

  4. Post-hooks update status, check if more work remains, and prepare context for the next stage

  5. Execute the full pipeline, streaming results as each stage progresses

Why dynamic pipelines: The structure isn't hardcoded—it's built at runtime based on what the orchestrator planned. A simple question might need one research step and one writing step. A complex question might need five research steps, two analysis steps, and three writing steps. The same pipeline infrastructure handles both.

Key insight: Post-hooks are the glue. They handle the transition from one stage to the next: marking work complete, determining whether to continue, and injecting context. Without post-hooks, you'd need either a monolithic agent that does everything (hard to test, hard to specialize) or complex coordination logic (fragile, hard to understand).

Core Patterns for Production Agents

Beyond the four-layer architecture, several specific patterns emerged as critical for production reliability.

Pattern 1: Type-Safe Event Streams

Instead of streaming raw text or generic JSON objects, define explicit types for every event that flows through the system. At minimum, you need:

  • Token events: Individual LLM output tokens (for the typing effect)

  • Object events: Structured data like plans, citations, or metadata

  • Status events: Stage transitions and progress indicators

  • Error events: Failures with enough context to recover or report meaningfully

Why this matters:

  • Type checking catches entire classes of bugs before they reach production

  • Different event types can be processed differently (enrichment for objects, logging for status, alerting for errors)

  • Clients can render appropriate UI for each event type (progress bars for status, error dialogs for errors, structured displays for objects)

  • Adding new event types doesn't break existing handlers

The alternative—streaming unstructured JSON or raw text—means every consumer needs custom parsing logic, and refactoring becomes dangerous.

Pattern 2: Transform-Enrich-Process Pipeline

Stream processing happens in three distinct phases, each with its own responsibility:

  • Transform: Change event structure or format (pure functions, no I/O)

  • Enrich: Add data from external sources (database lookups, API calls)

  • Process: Apply business logic (stateful operations, filtering, aggregation)

Example flow:

  1. Raw stream contains a "tool call started" event with just a tool ID

  2. Transform phase converts it to a standardized format

  3. Enrich phase looks up tool metadata from the database (name, description, parameters)

  4. Process phase applies user-specific logic (hide certain tools from certain users)

  5. Final event includes all metadata, properly filtered for the current user

Why separate these?

  • Each phase can be tested independently (unit tests for transforms, integration tests for enrichment)

  • Adding new transformations doesn't require understanding enrichment or processing logic

  • Error handling boundaries are clear: transform errors are bugs, enrichment errors are I/O failures, processing errors are business logic issues

  • Performance optimization is easier (you can cache enrichment, parallelize transforms)

Pattern 3: Automatic Recovery from Tool Call Limits

LLM providers limit tool calls per request (often 50-100). Agents working on complex tasks often hit these limits. Rather than failing and losing all progress, implement automatic recovery.

The recovery pattern:

  1. Wrap pipeline execution in a retry loop

  2. When the tool call limit is hit, catch the exception

  3. Extract the full conversation state from the exception (all messages, tool calls, and responses up to the limit)

  4. Inject a system message: "You hit the tool call limit. Continue from where you left off."

  5. Resume execution with the recovered state

Key aspects:

  • Preserve all context: Nothing is lost—the agent sees everything that happened before the limit

  • Inform the agent: The system message prevents the agent from starting over or getting confused

  • Limit retry attempts: Prevent infinite loops (3 attempts is reasonable)

  • Log for observability: Track recovery frequency to tune limits

What this prevents: Without recovery, hitting the tool call limit means the user sees an error after waiting 30+ seconds, and the agent's work is lost. With recovery, the user might not even notice—the agent just keeps working.

We recover from approximately 2-3% of requests. These recovered requests tend to be the most valuable ones (complex questions that require many tool calls).

Pattern 4: Parallel Agent Execution

When agents don't depend on each other's outputs, run them concurrently to reduce latency. The orchestrator's planning phase is a perfect example: planning, topic generation, and guardrail checks are independent, so running them in parallel cuts response time by ~60%.

When to use parallel execution:

  • Independent operations: Operations that don't need each other's outputs

  • Latency optimization: When waiting for sequential execution is painful for users

  • Error isolation: When one operation's failure shouldn't block others

When NOT to use parallel execution:

  • Sequential dependencies: When one agent needs another's output as input

  • Rate limiting concerns: When parallel requests would exceed API quotas

  • Cost sensitivity: When API call cost matters more than latency

Implementation considerations:

  • Emit progress events as each parallel agent completes (users see partial progress)

  • Decide how to handle partial failures (continue with available results, or fail completely?)

  • Use a post-hook to aggregate results once all parallel stages finish

  • Consider timeouts for individual parallel stages (one slow agent shouldn't block everything)

The key tradeoff: parallel execution adds complexity (more states to handle, partial failure modes) but can dramatically improve user experience for multi-step workflows.

Pattern 5: Post-Hooks for Stage Transitions

Post-hooks are functions that run after each pipeline stage completes, before the next stage starts. They're the "glue code" that makes pipelines flexible and powerful.

What post-hooks do:

  • State management: Update todo statuses, mark steps complete, track progress

  • Context injection: Add information needed by the next stage (like "you're working on todo #3")

  • Validation: Check that outputs are well-formed before proceeding

  • Early termination: Decide whether to continue or stop the pipeline

  • Error handling: Convert failures into recoverable states

Example use case: In a multi-step research pipeline:

  1. Stage 1 (research) completes

  2. Post-hook marks the research todo as "complete"

  3. Post-hook checks if there's a next todo (writing)

  4. Post-hook injects context: "You're writing step 2 of 3, based on the research above"

  5. Stage 2 (writing) starts with full context

Why this matters: Without post-hooks, you'd need either:

  • A monolithic agent that handles all steps (hard to test, no specialization)

  • Complex coordinator logic between stages (fragile, hard to understand)

  • Agents that somehow "know" what to do next (breaks abstraction boundaries)

Post-hooks keep stages independent while enabling sophisticated coordination.

Pattern 6: Graceful Error Handling at Every Layer

Different layers handle different kinds of errors. Don't catch all exceptions at the top level—handle each error at the layer where it's most meaningful.

Layer 1 (HTTP Stream): Connection errors, client disconnections, timeouts

  • Check for disconnection throughout execution

  • Use centralized error formatting to ensure consistent error responses

  • Always yield what's been completed before the error

Layer 2 (Executor): Orchestration errors, recovery from rate limits, pipeline failures

  • Catch and retry on expected failures (tool call limits, temporary model errors)

  • Stop execution if planning failed (no point executing a bad plan)

  • Preserve conversation state for potential manual recovery

Layer 3 (Orchestrator): Planning errors, validation failures, guardrail violations

  • Validate all agent outputs before storing them

  • Emit structured error events (not just log messages)

  • Store stop reasons when guardrails trigger

Layer 4 (Pipeline): Stage-specific errors, tool failures, agent errors

  • Let post-hooks decide how to handle stage errors (continue vs. abort)

  • Stream partial results before the error

  • Include enough context to resume or retry

Error handling principles:

  • Fail at the right level: Don't catch everything at the top and lose context

  • Preserve partial success: Users should see what worked before the failure

  • Provide recovery paths: Include state snapshots, retry tokens, or clear next steps

  • Log exhaustively: Production debugging requires full context (user ID, conversation ID, stage, inputs)

Design Principles That Emerged

After building and refining this system, several key principles became clear. These aren't rules we started with—they're patterns that emerged from solving real production problems.

1. Composition Over Configuration

Instead of configuring a monolithic framework, explicitly compose small, independent components. Each component should have a clear interface and single responsibility.

What this looks like:

  • The executor depends on an orchestrator, pipeline builder, transformer, and processor

  • Each component can be instantiated and tested independently

  • Swapping implementations is straightforward (provide a different orchestrator)

  • No hidden behavior—all dependencies are explicit

Benefits:

  • Easy to test individual components in isolation

  • Clear dependency injection makes the system understandable

  • Simple to swap implementations for different use cases

  • No framework magic—you see exactly what's happening

The alternative—large configuration objects that control framework behavior—tends to hide complexity rather than managing it. When something goes wrong, you're debugging the framework instead of your code.

2. Make Streams Observable

Every layer should yield structured events that can be logged, monitored, and traced. Don't just stream final output—stream progress, state transitions, and metadata.

Event types to emit:

  • Status events (stage starts/completions)

  • System events (internal transitions)

  • Object events (structured data like plans, citations)

  • Error events (with recovery information)

This enables:

  • Real-time monitoring: Track agent progress in dashboards, measure stage durations

  • Debugging: Replay event streams to understand failures without reproducing them

  • Analytics: Measure completion rates, error rates, bottlenecks

  • User experience: Show users exactly what's happening (progress bars, stage names)

3. Separate Data Flow from Control Flow

The pipeline manages control flow (what runs when, in what order). Data transformations happen in post-hooks. This separation makes both easier to understand and modify.

Control flow: The pipeline structure defines order

  • Stage 1: Planning agent

  • Stage 2: Execution agent

  • Stage 3: Review agent

Data flow: Post-hooks define transformations

  • Post-hook 1: Extract tasks from plan

  • Post-hook 2: Aggregate execution results

  • Post-hook 3: Format for presentation

This separation enables:

  • Visualizing execution flow (just look at the pipeline structure)

  • Testing data transformations in isolation (unit test post-hooks)

  • Modifying execution order without touching data processing logic

4. Optimize for Recovery, Not Just Success

Production agents hit errors constantly: rate limits, timeouts, model errors, tool failures. Design for recovery from the start, not as an afterthought.

Recovery strategies by error type:

  • Rate limits / timeouts: Retry with exponential backoff

  • Tool call limits: Resume from saved state

  • Tool failures: Continue with partial results or skip that tool

  • Model errors: Fall back to simpler prompt or different model

  • Validation failures: Ask the agent to fix its output

The mindset shift: Don't ask "will this work?" Ask "when this fails, can the user recover without starting over?"

5. Type Safety Prevents Runtime Errors

Strong typing catches entire classes of bugs before they reach production. Define explicit types for all events, messages, and data structures.

Benefits:

  • Impossible to forget to handle an event type (compiler catches it)

  • Refactoring is safe (type errors surface immediately)

  • IDE autocomplete works perfectly

  • Runtime type errors become compile-time errors

  • Self-documenting code (types show what's possible)

Where to use strong typing:

  • Event stream types (union of all possible events)

  • Agent messages (user messages, system messages, tool calls, tool results)

  • Configuration objects (validated at startup, not at runtime)

  • API responses (use Pydantic or similar for validation)

Common Pitfalls and How to Avoid Them

Pitfall 1: Not Handling Client Disconnection

Problem: Continuing to process after the client disconnects wastes server resources and can leave orphaned database transactions.

Solution: Check for disconnection in the event loop before processing each event. If the client is gone, stop immediately and clean up resources.

Pitfall 2: Losing Context on Errors

Problem: Throwing away partial results when an error occurs frustrates users who have to start over completely.

Solution: Stream results immediately as they're produced, so users see progress even if later stages fail. When errors occur, include what was completed and provide a way to resume (resume tokens, state snapshots).

Pitfall 3: Tight Coupling Between Stages

Problem: Stages that directly depend on each other's internal structure are hard to test and modify. Changing one stage breaks others.

Solution: Use post-hooks to define explicit interfaces between stages. Stage 2 shouldn't reach into Stage 1's internal structure—instead, Stage 1's post-hook should explicitly provide what Stage 2 needs.

Pitfall 4: Hidden State in Closures

Problem: Closures can capture mutable state that changes unexpectedly. This is especially common when creating post-hooks in loops.

Solution: Use closure factories that capture immutable values. Instead of creating closures directly in a loop (where the loop variable changes), create a factory function that returns a closure with the correct captured value.

Pitfall 5: Over-Engineering Early

Problem: Adding abstractions before understanding the problem leads to wrong abstractions that are hard to remove later.

Solution: Start with the simplest thing that works—even just a single agent with basic error handling. Add complexity incrementally as you encounter real problems:

  1. Multiple users? Add authentication and context isolation

  2. Slow responses? Add streaming

  3. Tool call limits? Add recovery

  4. Complex workflows? Add pipelines

  5. Parallel operations? Add parallel execution

Resist the temptation to build the "perfect" system upfront. The right abstractions emerge from solving real problems.

When to Use These Patterns

These patterns are most valuable when:

  • Streaming is essential: Users need to see progress in real-time

  • Workflows are complex: Multiple agents need to coordinate

  • Reliability matters: Errors should be recoverable, not fatal

  • Scale is important: The system needs to handle many concurrent requests

  • Observability is required: You need to monitor and debug production issues

These patterns may be overkill when:

  • Batch processing is acceptable: Users can wait for complete results

  • Single-step workflows: One agent call is sufficient

  • Prototyping: You're still figuring out the problem space

  • Low-stakes applications: Errors aren't costly to users

Summary

Building production-ready streaming agents requires more than just calling LLM APIs. The key lessons we've learned:

  1. Layer your architecture: Separate concerns (HTTP streaming, orchestration, execution, processing)

  2. Make streams type-safe: Use structured events, not raw text or JSON

  3. Design for failure: Recovery should be automatic, not manual

  4. Compose from small pieces: Avoid monolithic frameworks

  5. Parallelize independent work: But only when it actually improves latency

  6. Make execution observable: Every layer should emit structured events

  7. Test each layer independently: Composition is easier to test than integration

Most importantly: start simple and add complexity only when you have a clear reason. Many production agent systems are just a few hundred lines of well-structured code wrapping LLM API calls. The patterns described here emerged from real production requirements, not from architectural astronautics.

The catalyst for your
business’s success.

Driving your business forward with impactful solutions.

Related Articles

Copyright © 2025 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Copyright © 2025 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103