State Management Patterns for AI Agents
State management is what separates agents that crash mid-task from agents that recover gracefully. Learn to track conversation context, task progress, and environmental knowledge — then persist it strategically so agents can resume from failures without losing work or context.
The Three Layers of Agent State
LLMs are stateless. Every API call starts fresh with no memory of what came before. This is fine for single-turn completions, but agents running multi-step workflows need to remember where they are, what they've learned, and what they've already tried.
Agent state isn't monolithic — it breaks into three distinct categories, each with different persistence requirements:
| Layer | What It Tracks | Persistence Needs | Example |
|---|---|---|---|
| Conversation State | Message history, current thread context | Per-session, short-lived | Chat history in a support agent |
| Task Progress | Current step, completed actions, pending work | Must survive crashes | Code review agent tracking which files reviewed |
| Environmental Knowledge | User preferences, learned patterns, entity relationships | Cross-session, long-term | User's preferred coding style, past decisions |
Checkpointing: The Core Pattern
Checkpointing saves a snapshot of agent state at defined points during execution. When something fails, you reload the last checkpoint and resume — rather than starting over. LangGraph's checkpointer architecture makes this elegant:
Python · LangGraphfrom langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
# Define state schema with TypedDict
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
current_step: str
completed_files: list[str]
user_preferences: dict
# Compile with checkpointer for persistence
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
graph = workflow.compile(checkpointer=checkpointer)
# Every invocation auto-saves state to Postgres
result = graph.invoke(
{"messages": [HumanMessage(content="Review the auth module")]},
config={"configurable": {"thread_id": "review-123"}}
)
thread_id is critical — it's the key that links related checkpoints together.
Think of threads as separate conversations or workflow runs, each maintaining independent state.Event Sourcing vs. Snapshot Checkpointing
📸 Snapshot Checkpointing
- Saves full state at each step
- Simple recovery — just load latest
- Storage grows with state size
- LangGraph default; right for most agents
📜 Event Sourcing
- Stores sequence of state changes
- Full audit trail, can replay
- Recovery requires replaying events
- Best for audit-critical workflows
Thread vs. Store: Two Persistence Concepts
LangGraph separates these clearly:
- Thread persistence (checkpointers) — state within a single conversation. Tied to
thread_id. - Cross-thread memory (stores) — state spanning multiple conversations: preferences, learned patterns, entity knowledge.
Python · Cross-thread Storefrom langgraph.store.memory import InMemoryStore
store = InMemoryStore()
# Save user preference — applies to ALL threads
store.put(("user", user_id, "preferences"), "style", {"tone": "concise"})
# Any thread for this user can retrieve it
prefs = store.get(("user", user_id, "preferences"), "style")
- Agent state has three layers: conversation (per-session), task progress (crash-safe), environmental (cross-session)
- Checkpointing saves full state snapshots, enabling crash recovery without replaying entire workflows
- Thread persistence isolates conversations; cross-thread stores share user-level knowledge across sessions
Architecture Decisions You'll Face
State management decisions directly impact three things: reliability (does the agent recover from failures?), user experience (does it remember context?), and operational cost (how much state are you storing?).
The Redis vs. Postgres Decision
This is the most common architectural debate you'll face in production:
| Requirement | Redis ✓ | Postgres ✓ |
|---|---|---|
| Sub-millisecond reads | ✓ | |
| Complex cross-session queries | ✓ | |
| ACID durability guarantees | ✓ | |
| Simple key-value lookups | ✓ | |
| Semantic search with pgvector | ✓ | |
| Ephemeral session cache | ✓ |
Hybrid State Manager Pattern
Python · Hybrid Patternclass HybridStateManager:
def __init__(self, redis_client, pg_connection):
self.redis = redis_client
self.pg_saver = PostgresSaver.from_conn_string(pg_connection)
self.checkpoint_interval = 5 # Persist to Postgres every 5 steps
async def get_state(self, thread_id: str) -> AgentState:
# Try Redis first (fast path)
cached = await self.redis.get(f"state:{thread_id}")
if cached:
return AgentState.parse_raw(cached)
# Fall back to Postgres (durable path)
return await self.pg_saver.get_tuple(thread_id)
async def save_state(self, thread_id: str, state: AgentState, step: int):
# Always update Redis
await self.redis.set(f"state:{thread_id}", state.json())
# Checkpoint to Postgres periodically
if step % self.checkpoint_interval == 0:
await self.pg_saver.put(thread_id, state)
- State architecture decisions directly impact reliability, UX, and operational cost
- Production systems like Stripe's Agentic Commerce use layered state: fast caches for sessions, durable stores for transactions
- The hybrid Redis + Postgres pattern balances sub-millisecond reads with ACID durability
Why Engineers Get This Wrong
Mistake 1: Treating All State the Same
Mistake 2: Checkpointing Too Infrequently
Mistake 3: Ignoring Cross-Thread Memory Until It's Too Late
Python · Design Upfrontclass AgentConfig:
# Thread-scoped (isolated per conversation)
thread_checkpointer: BaseCheckpointSaver
# User-scoped (shared across threads)
user_store: BaseStore # Even if empty initially
# Org-scoped (shared across users)
org_knowledge_base: VectorStore # For RAG
- Treating all state the same leads to either slow reads (all Postgres) or lost progress (all Redis)
- Checkpoint after every significant action — task-boundary-only checkpointing loses intermediate progress
- Design cross-thread memory architecture upfront, even if you don't populate it initially
Building a Fault-Tolerant Code Review Agent
Let's implement state management for an agent that reviews pull requests file-by-file. Requirements: resume from last reviewed file if crashed, remember user's style preferences across PRs, and track review decisions for audit trail.
Step 1: Define the State Schema
Python · State Schemaclass ReviewState(TypedDict):
# Conversation state (thread-scoped)
messages: Annotated[list, add_messages]
# Task progress (must survive crashes)
pr_url: str
files_to_review: list[str]
files_completed: list[str]
current_file: str | None
review_comments: dict[str, list[str]]
# Environmental knowledge (cross-thread)
user_style_prefs: dict
Step 2: Recovery Logic
Python · Crash Recoveryasync def resume_or_start_review(infra, pr_url, user_id):
thread_id = f"review-{hash(pr_url)}"
# Check for existing checkpoint
checkpoint = await infra.checkpointer.get_tuple(
{"configurable": {"thread_id": thread_id}}
)
if checkpoint and checkpoint.checkpoint:
state = checkpoint.checkpoint
completed = len(state.get("files_completed", []))
total = len(state.get("files_to_review", []))
print(f"Resuming review: {completed}/{total} files done")
return state
# Fresh start
files = await fetch_pr_files(pr_url)
user_prefs = await infra.load_user_prefs(user_id)
return ReviewState(
messages=[], pr_url=pr_url,
files_to_review=files, files_completed=[],
current_file=files[0] if files else None,
review_comments={}, user_style_prefs=user_prefs
)
Before vs. After
| Scenario | ❌ No State Management | ✅ Layered State |
|---|---|---|
| Agent crashes at file 8/10 | Restart from file 1, duplicate work | Resume from file 8, no waste |
| User returns next week | "Who are you? What's your style?" | Loads saved preferences seamlessly |
| Debug production issue | No visibility into what happened | Full checkpoint history, replay state |
| Redis eviction during peak | All session state lost | Fall back to Postgres checkpoints |
- Define state schema with explicit categories: conversation (thread-scoped), task progress (crash-safe), environmental (cross-thread)
- Layer persistence: Redis for hot session cache, Postgres for durable checkpoints, user store for cross-session prefs
- Recovery logic should check for existing checkpoints before starting fresh — always
Key Takeaways
Build Your AI Skills Systematically
This article is part of the AI Fluens AI agent engineering track.
Get a personalized week-by-week AI upskill plan tailored to your role.
