Trainly - AI Agent Observability

Nobody is watching your AI

You shipped an AI feature. It works. Users aren't complaining. So you move on.

Three months later your OpenAI bill is 3x what you budgeted and nobody can explain why. A customer mentions that answers got worse “sometime last week.” Your support team notices the same question sometimes returns 50 words and sometimes 500. Nobody has an answer because nobody was looking.

Most teams treat AI monitoring the way they treat any API: uptime, latency, error codes. That covers infrastructure. It doesn't cover the things that actually break in AI: costs that quietly balloon, quality that slowly erodes, models that are overkill for the job.

1. You don't know where your money is going

Your LLM provider gives you a monthly invoice with a dollar amount. Maybe it's broken down by model. That tells you what you spent, but not why.

Which feature is the most expensive? Which user segment drives the most token usage? Is there one endpoint that burns 40% of your budget while serving 5% of your traffic?

Without per-trace cost tracking broken down by tags, endpoints, or users, you can't answer these questions. You're optimizing in the dark.

What to look for: Break down your AI spend by tag, endpoint, user, and model. Look for concentration. If one tag or one model accounts for more than 40% of your total cost, that's where you should start optimizing.

2. Your most expensive model is doing work a cheaper one could handle

Every model provider offers a range: a flagship model that's expensive and capable, and a smaller model that's fast and cheap. OpenAI has GPT-4o vs. GPT-4o-mini. Anthropic has Sonnet vs. Haiku. Google has Pro vs. Flash. The pattern is always the same.

Most teams pick the flagship during prototyping and ship it for every task. That means “what's the pricing for X?” runs on the same model as “generate an investment brief comparing 5 competitors.” The lookup doesn't need a flagship model. But it's using one anyway.

The cost gap is enormous. Take OpenAI as an example: GPT-4o charges $2.50 per million input tokens, GPT-4o-mini charges $0.15. For tasks where both produce the same answer, that's a 94% cost reduction. Anthropic's gap is similar: Sonnet to Haiku is roughly 4x cheaper on input. If you're not routing tasks to the right model tier, you're overpaying on every call.

What to look for: Group your traces by task type (tags work well for this). Within each group, check if both an expensive model and a cheap model are used. If the cheap model handles the same task with comparable error rates and output quality, the expensive model traces are candidates for downgrade.

3. Quality is getting worse and nobody noticed

AI doesn't break with a stack trace. It gets subtly worse. Outputs get a little shorter each week. Latency drifts up. The model starts hedging on questions it used to answer confidently.

Each individual response looks fine in isolation. But pull up a weekly comparison and the trend line is obvious. p95 latency up 20%. Average output length down 15%. Error rate ticking up 2% per week. Nobody noticed because nobody was comparing week-over-week.

Why does this happen? Provider model updates (OpenAI ships these without warning), accumulated prompt changes, shifts in your user base. Pick any AI pipeline that's been running for 3+ months and you'll find at least one metric that drifted without anyone catching it.

What to look for: Track p95 latency, average output length, error rate, and average cost per trace over weekly buckets. Flag any metric that changes by 15% or more week-over-week, but only if both weeks have at least 20 traces (otherwise you get false positives from low-volume noise).

4. Your AI is behaving differently and your logs won't tell you

Traditional monitoring catches crashes. It doesn't catch an AI that starts giving vague answers to questions it used to handle well, or a summarization pipeline that quietly drops key details, or an agent that begins looping on a tool call it never looped on before.

These are semantic anomalies. The HTTP request succeeds. The status code is 200. The latency looks normal. But the meaning of the output changed, and no metric in your dashboard reflects that.

Catching these requires comparing what your AI said to what it normally says for similar inputs. That means embedding your inputs and outputs, clustering them by topic, and flagging clusters where behavioral signals (output length, input/output similarity, error rate) deviate from the baseline. It's not something you can bolt on with a log query.

This is also where agent gating comes in. If your pipeline has a quality threshold (e.g., a generated brief must be over 300 words, a code suggestion must pass a syntax check), you can block bad outputs before they reach users. But you can only gate what you can measure, and you can only measure what you're tracing.

What to look for: If you're only tracking latency and error codes, you're missing the failures that matter most. Look at input/output semantic similarity over time (are responses becoming less relevant?), output length distributions (are they getting shorter?), and cluster-level behavioral shifts (did one category of queries start behaving differently?). If you have quality thresholds, enforce them at the trace level before returning results.

What to do about it

All of this is fixable. Most of it is fixable with data you already have, or data you can start collecting with a few lines of code. The hard part is knowing to look.

We built Trainly because we kept running into these problems ourselves. The scan takes 72 hours: you add a decorator to your AI functions, run your app normally, and get a report at the end showing your cost breakdown, model routing inefficiencies, quality drift, and data gaps. It's specific to your pipeline, not generic advice.

If you want to check your own pipeline, the scan is free and takes about 2 minutes to set up.

4 things your AI pipeline isn't telling you

Nobody is watching your AI

1. You don't know where your money is going

2. Your most expensive model is doing work a cheaper one could handle

3. Quality is getting worse and nobody noticed

4. Your AI is behaving differently and your logs won't tell you

What to do about it

Get a free AI pipeline scan

4 things your AI pipeline isn't telling you

Nobody is watching your AI

1. You don't know where your money is going

2. Your most expensive model is doing work a cheaper one could handle

3. Quality is getting worse and nobody noticed

4. Your AI is behaving differently and your logs won't tell you

What to do about it

Get a free AI pipeline scan