AI Observability and Monitoring: Your APM Is Lying to You
Your dashboards say everything's green. Users say the chatbot is hallucinating. Here's why traditional monitoring is blind to AI failures — and how to build the observability stack that actually catches them.
A user reports your chatbot gave a dangerously wrong answer. You pull up Datadog. Response times: normal. Error rates: zero. Status codes: all 200s. Everything is green. Everything is fine. Except it isn't — and your monitoring has no idea, because it was never designed to watch what an LLM actually says.
The Gap Your APM Can't See
Traditional application monitoring tracks requests, latencies, and error rates. That's necessary but wildly insufficient for AI systems. When your summarization feature suddenly costs 3x more or your chatbot invents facts, standard APM dashboards stay perfectly green.
LLM applications introduce entirely new dimensions that don't exist in traditional software. The request body matters — not just whether it succeeded, but what it said. The cost isn't predictable compute time — it's token-by-token pricing that varies by model. And the hardest dimension of all: output quality has no HTTP status code.
HTTP status 200. Latency 340ms. No errors. Service healthy. Ship it.
Prompt truncated context. Response hallucinated a date. Cost $0.12 for one call. Relevance score dropped 20% since Tuesday.
The Three Pillars of AI Observability
Modern AI observability platforms handle three core functions. Think of them as the "what happened," "what did it cost," and "did it actually work" of every LLM call. Here's who owns each phase:
The human parts are non-negotiable. You decide what to instrument and how to tag traces for attribution. You act on alerts and decide whether to roll back, optimize, or investigate further. The platform handles the middle — capturing traces, computing costs, running automated evaluations — but never replaces the bookends of setup and decision-making.
What Traditional APM Cannot See
AI introduces entire categories of failure that traditional monitoring tools are structurally blind to:
Gartner predicts that by 2028, 60% of software engineering teams will use AI evaluation and observability platforms — up from just 18% in 2025. The teams adopting now are building a significant operational advantage.
Three Misconceptions That Cost Teams Weeks
Trap 1: "Our Existing APM Handles It"
Engineers comfortable with Datadog or New Relic assume adding LLM calls is just another service to monitor. They instrument latency and error rates, call it done. Then a user reports the chatbot "sounds different." The APM shows perfect health. The actual issue? A prompt was modified and quality degraded — something traditional APM doesn't track at all.
The fix: Keep traditional APM for infrastructure health. Layer purpose-built AI observability on top for prompt content, token costs, and output quality.
Trap 2: "We'll Add Observability Later"
The classic trap: ship the MVP fast, instrument later. But "later" never comes because retrofitting tracing into existing code is painful, you lack baseline data to detect regressions, and cost spikes hit before you have attribution.
One startup shipped an AI feature that silently cost $2,000/day because they deferred monitoring. They discovered it three weeks later when the invoice arrived. Adding observability from day one costs hours. Discovering problems through billing costs thousands.
Trap 3: "Token Counting Equals Cost Tracking"
Naive cost tracking sums tokens and multiplies by a single price. But real cost attribution needs per-model pricing (GPT-4 vs. Claude vs. GPT-4o-mini differ dramatically), separate input vs. output token counts, tracking cached vs. fresh responses, multi-step workflow attribution, and accounting for failed requests that still consume tokens. Teams that implement simple token counting often find their estimates off by 30–50% from actual invoices.
The fix: Always capture the model name in trace metadata. Track input and output tokens separately. Attribute costs across the full multi-step chain, not just individual API calls.
Instrumenting an AI Service from Scratch with Langfuse
Let's add full observability — tracing, cost tracking, and automated quality evaluation — to a production AI service, step by step.
Step 1 — Basic Tracing
Install the SDK and wrap your LLM calls. The @observe() decorator automatically captures prompts, responses, latency, token counts, and cost:
from langfuse import Langfuse from langfuse.decorators import observe, langfuse_context import openai langfuse = Langfuse() @observe() def answer_question(user_id: str, question: str) -> str: # Tag with user and feature for attribution langfuse_context.update_current_trace( user_id=user_id, metadata={"feature": "qa-bot"} ) response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}] ) return response.choices[0].message.content
Step 2 — Multi-Step Agent Tracing
For agentic workflows, nest observations to see the full reasoning chain as a tree of spans — each with its own timing and cost:
In the Langfuse UI, nested @observe() decorators render as a parent span containing child spans. You see research_agent containing search_web → summarize_results → generate_answer, each with independent latency and token costs. This is what makes AI debugging take minutes instead of hours.
Step 3 — Cost Alerting
Query your trace data to build attribution dashboards. Alert when any feature exceeds its daily budget:
# Daily cost by feature traces = langfuse.get_traces( filter=f"timestamp > '{yesterday}'" ) cost_by_feature = {} for trace in traces: feature = trace.metadata.get("feature", "unknown") cost_by_feature[feature] = ( cost_by_feature.get(feature, 0) + trace.total_cost ) # Alert if any feature exceeds daily budget for feature, cost in cost_by_feature.items(): if cost > DAILY_BUDGET[feature]: send_alert(f"{feature} exceeded budget: ${cost:.2f}")
The Results
With full observability in place, debugging user complaints goes from "check logs, see 200 OK, add print statements, redeploy, wait" to "search by user ID, see exact prompt and response, diagnose in 5 minutes." Cost spikes surface in real-time dashboards instead of monthly invoices. Quality regressions trigger automated alerts within hours — not after weeks of user complaints.
Real Teams, Real Results
Stripe's engineering teams built granular cost attribution into their AI stack and discovered that 3% of users generated 40% of their AI costs. That single insight led to per-user rate limiting that saved thousands monthly — and it was completely invisible without per-user cost tagging in their traces.
LinkedIn's AI team runs continuous evaluation on their recommendation systems. Their monitoring detected a 15% relevance drop after a model update — before it affected engagement metrics. Without automated quality evaluation, that regression would have surfaced weeks later through declining user engagement, long after trust was damaged.
The common thread: AI observability didn't replace their existing infrastructure. Both teams still ran traditional APM for latency and errors. They layered AI-specific tracing, cost attribution, and quality evaluation on top — and it changed what they could see and how fast they could act.
Choosing the Right Platform
The market has consolidated around a few major players. The right choice depends on your stack, team size, and how much control you need over your data:
| Platform | Strengths | Pricing (2026) | Best For |
|---|---|---|---|
| Langfuse | Open-source, self-hostable, framework-agnostic | Free self-hosted; Cloud from $50/mo | Teams wanting control and flexibility |
| LangSmith | Deep LangChain integration, visual agent debugging | Free 5K traces; $39/seat/mo | LangChain-native applications |
| Helicone | 15-minute proxy setup, built-in caching | Free 100K req/mo; Pro $79/mo | Quick visibility, cost reduction |
| Arize AI | Enterprise telemetry, drift detection | Free 25K spans; Pro $50/mo | Large-scale production ML systems |
| Datadog LLM | Unified with existing APM stack | Usage-based | Teams already on Datadog |
Proxy vs. SDK: Proxy-based tools like Helicone give instant visibility by routing LLM traffic through a monitoring layer — just change your base URL. SDK-based tools like Langfuse require explicit instrumentation but capture custom metadata and multi-step workflows. Most production systems use both: a proxy for baseline coverage, an SDK for detailed instrumentation where it matters.
Five Things to Take Away
Traditional APM is blind to AI. Prompt content, token costs, model behavior, and output quality require purpose-built observability.
Instrument from day one. Retrofitting is painful and you lose the baseline data needed to detect regressions.
Cost attribution must be granular. Track by user, feature, and model — with per-model pricing, token splits, and multi-step workflow accounting.
LLM-as-a-judge scales quality evaluation. It's 500–5,000x cheaper than human review, but validate it against real user outcomes periodically.
Remember the three pillars. Tracing captures what happened. Cost tracking shows what it cost. Evaluation determines if it worked.
Build Your AI Skills Systematically
This article is part of the AI Fluens advanced software engineering track.
Get a personalized week-by-week AI upskill plan tailored to your role.
