A user reports your chatbot gave a dangerously wrong answer. You pull up Datadog. Response times: normal. Error rates: zero. Status codes: all 200s. Everything is green. Everything is fine. Except it isn't — and your monitoring has no idea, because it was never designed to watch what an LLM actually says.

This is the AI observability gap. And nearly every team hits it.

The Gap Your APM Can't See

Traditional application monitoring tracks requests, latencies, and error rates. That's necessary but wildly insufficient for AI systems. When your summarization feature suddenly costs 3x more or your chatbot invents facts, standard APM dashboards stay perfectly green.

LLM applications introduce entirely new dimensions that don't exist in traditional software. The request body matters — not just whether it succeeded, but what it said. The cost isn't predictable compute time — it's token-by-token pricing that varies by model. And the hardest dimension of all: output quality has no HTTP status code.

✕ Traditional APM sees

HTTP status 200. Latency 340ms. No errors. Service healthy. Ship it.

✓ AI observability sees

Prompt truncated context. Response hallucinated a date. Cost $0.12 for one call. Relevance score dropped 20% since Tuesday.

The Three Pillars of AI Observability

Modern AI observability platforms handle three core functions. Think of them as the "what happened," "what did it cost," and "did it actually work" of every LLM call. Here's who owns each phase:

Instrument → Trace → Attribute → Evaluate → Act
🔧
Instrument
Human
🔎
Trace
Platform
💰
Attribute Cost
Both
Evaluate
LLM Judge
🚨
Act
Human

The human parts are non-negotiable. You decide what to instrument and how to tag traces for attribution. You act on alerts and decide whether to roll back, optimize, or investigate further. The platform handles the middle — capturing traces, computing costs, running automated evaluations — but never replaces the bookends of setup and decision-making.

What Traditional APM Cannot See

AI introduces entire categories of failure that traditional monitoring tools are structurally blind to:

💬
Prompt Content & Quality
A prompt gets modified and quality degrades. APM shows 200 OK. Without capturing the full prompt and response, you can't even begin to investigate.
💰
Token-Level Cost Attribution
GPT-4 vs. GPT-4o-mini pricing differs dramatically. Input vs. output tokens are priced differently. Naive counting misses 30–50% of actual costs.
🎯
Output Relevance & Accuracy
The chatbot hallucinated a date. The summary omitted key facts. There's no HTTP status code for "wrong but confident." Quality needs its own evaluation layer.
🔀
Multi-Step Agent Traces
An agent chain calls three tools, retries twice, and hits a context window limit. Traditional APM sees one successful HTTP call. AI tracing sees the full reasoning tree.
📈 Market Signal

Gartner predicts that by 2028, 60% of software engineering teams will use AI evaluation and observability platforms — up from just 18% in 2025. The teams adopting now are building a significant operational advantage.

Three Misconceptions That Cost Teams Weeks

Trap 1: "Our Existing APM Handles It"

Engineers comfortable with Datadog or New Relic assume adding LLM calls is just another service to monitor. They instrument latency and error rates, call it done. Then a user reports the chatbot "sounds different." The APM shows perfect health. The actual issue? A prompt was modified and quality degraded — something traditional APM doesn't track at all.

The fix: Keep traditional APM for infrastructure health. Layer purpose-built AI observability on top for prompt content, token costs, and output quality.

Trap 2: "We'll Add Observability Later"

The classic trap: ship the MVP fast, instrument later. But "later" never comes because retrofitting tracing into existing code is painful, you lack baseline data to detect regressions, and cost spikes hit before you have attribution.

🚨 The $42,000 Lesson

One startup shipped an AI feature that silently cost $2,000/day because they deferred monitoring. They discovered it three weeks later when the invoice arrived. Adding observability from day one costs hours. Discovering problems through billing costs thousands.

Trap 3: "Token Counting Equals Cost Tracking"

Naive cost tracking sums tokens and multiplies by a single price. But real cost attribution needs per-model pricing (GPT-4 vs. Claude vs. GPT-4o-mini differ dramatically), separate input vs. output token counts, tracking cached vs. fresh responses, multi-step workflow attribution, and accounting for failed requests that still consume tokens. Teams that implement simple token counting often find their estimates off by 30–50% from actual invoices.

The fix: Always capture the model name in trace metadata. Track input and output tokens separately. Attribute costs across the full multi-step chain, not just individual API calls.

Instrumenting an AI Service from Scratch with Langfuse

Let's add full observability — tracing, cost tracking, and automated quality evaluation — to a production AI service, step by step.

Step 1 — Basic Tracing

Install the SDK and wrap your LLM calls. The @observe() decorator automatically captures prompts, responses, latency, token counts, and cost:

Python · Basic Tracing
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import openai

langfuse = Langfuse()

@observe()
def answer_question(user_id: str, question: str) -> str:
    # Tag with user and feature for attribution
    langfuse_context.update_current_trace(
        user_id=user_id,
        metadata={"feature": "qa-bot"}
    )

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}]
    )

    return response.choices[0].message.content

Step 2 — Multi-Step Agent Tracing

For agentic workflows, nest observations to see the full reasoning chain as a tree of spans — each with its own timing and cost:

💡 Key Insight

In the Langfuse UI, nested @observe() decorators render as a parent span containing child spans. You see research_agent containing search_web → summarize_results → generate_answer, each with independent latency and token costs. This is what makes AI debugging take minutes instead of hours.

Step 3 — Cost Alerting

Query your trace data to build attribution dashboards. Alert when any feature exceeds its daily budget:

Python · Cost Alerts
# Daily cost by feature
traces = langfuse.get_traces(
    filter=f"timestamp > '{yesterday}'"
)

cost_by_feature = {}
for trace in traces:
    feature = trace.metadata.get("feature", "unknown")
    cost_by_feature[feature] = (
        cost_by_feature.get(feature, 0) + trace.total_cost
    )

# Alert if any feature exceeds daily budget
for feature, cost in cost_by_feature.items():
    if cost > DAILY_BUDGET[feature]:
        send_alert(f"{feature} exceeded budget: ${cost:.2f}")

The Results

5 min
Debug time (vs. hours before)
Real-time
Cost attribution by feature
Auto
Quality regression alerts
500×
Cheaper than human review

With full observability in place, debugging user complaints goes from "check logs, see 200 OK, add print statements, redeploy, wait" to "search by user ID, see exact prompt and response, diagnose in 5 minutes." Cost spikes surface in real-time dashboards instead of monthly invoices. Quality regressions trigger automated alerts within hours — not after weeks of user complaints.

Real Teams, Real Results

Stripe's engineering teams built granular cost attribution into their AI stack and discovered that 3% of users generated 40% of their AI costs. That single insight led to per-user rate limiting that saved thousands monthly — and it was completely invisible without per-user cost tagging in their traces.

LinkedIn's AI team runs continuous evaluation on their recommendation systems. Their monitoring detected a 15% relevance drop after a model update — before it affected engagement metrics. Without automated quality evaluation, that regression would have surfaced weeks later through declining user engagement, long after trust was damaged.

The common thread: AI observability didn't replace their existing infrastructure. Both teams still ran traditional APM for latency and errors. They layered AI-specific tracing, cost attribution, and quality evaluation on top — and it changed what they could see and how fast they could act.

Choosing the Right Platform

The market has consolidated around a few major players. The right choice depends on your stack, team size, and how much control you need over your data:

Platform Strengths Pricing (2026) Best For
Langfuse Open-source, self-hostable, framework-agnostic Free self-hosted; Cloud from $50/mo Teams wanting control and flexibility
LangSmith Deep LangChain integration, visual agent debugging Free 5K traces; $39/seat/mo LangChain-native applications
Helicone 15-minute proxy setup, built-in caching Free 100K req/mo; Pro $79/mo Quick visibility, cost reduction
Arize AI Enterprise telemetry, drift detection Free 25K spans; Pro $50/mo Large-scale production ML systems
Datadog LLM Unified with existing APM stack Usage-based Teams already on Datadog

Proxy vs. SDK: Proxy-based tools like Helicone give instant visibility by routing LLM traffic through a monitoring layer — just change your base URL. SDK-based tools like Langfuse require explicit instrumentation but capture custom metadata and multi-step workflows. Most production systems use both: a proxy for baseline coverage, an SDK for detailed instrumentation where it matters.

Five Things to Take Away

1

Traditional APM is blind to AI. Prompt content, token costs, model behavior, and output quality require purpose-built observability.

2

Instrument from day one. Retrofitting is painful and you lose the baseline data needed to detect regressions.

3

Cost attribution must be granular. Track by user, feature, and model — with per-model pricing, token splits, and multi-step workflow accounting.

4

LLM-as-a-judge scales quality evaluation. It's 500–5,000x cheaper than human review, but validate it against real user outcomes periodically.

5

Remember the three pillars. Tracing captures what happened. Cost tracking shows what it cost. Evaluation determines if it worked.

Build Your AI Skills Systematically

This article is part of the AI Fluens advanced software engineering track.
Get a personalized week-by-week AI upskill plan tailored to your role.

Get Your AI Upskill Plan