Home Machine Learning Observability & Evaluation in LLMs and Agentic Systems

Machine Learning

Observability & Evaluation in LLMs and Agentic Systems

December 31, 2025

Author(s): Murugeswari Muthurajan

Originally published on Towards AI.

Large Language Models and AI agents are no longer experimental tools. They power chatbots, customer support systems and even decision-making pipelines. With just a prompt these LLM can understand input, reason over provided context, and generate human-like responses.

But once these systems move into production, a hard reality emerges. Imagine running a factory where machines work 24/7 without anyone checking the output. Sure, production will skyrocket — but so will the chances of chaos. LLMs and agentic systems are similar: They are fast, smart, and autonomous, but without observation and evaluation, they can produce results that look good on the surface yet fail in reality.

“They don’t fail like traditional software” — When a traditional application fails, it throws an exception / returns a 500 error or crashes loudly. However LLMs and Agentic systems often fail silently. They produce answers that sound correct & appear confident — yet they could be incomplete, hallucinated, or subtly wrong. This is far more dangerous than a visible crash. A wrong answer delivered confidently can erode user trust, propagate misinformation, and go completely unnoticed in production.

This is where observability becomes critical. Observability is the live dashboard — Giving us the visibility into what’s happening behind the scenes- Metrics, Traces, and Behaviors that reveal the system’s true state. But Observability only tells us what happened, It doesn’t tell us whether it was good. There comes the role of evaluation. Evaluation uses these observability insights to decide if the output is accurate, safe, and aligned with goals. One without the other is incomplete.

In this blog, we will dive into how these two work hand-in-hand to make AI systems reliable and trustworthy. Before that first we can understand how Agentic Evaluation differs from Traditional LLM Evaluation

How Agentic Evaluation Differs from Traditional LLM Evaluation

AI agents are goal-driven systems that go far beyond simple text generation. Unlike traditional LLMs, which typically respond to a single prompt, Agents are designed to reason, use tools, maintain memory, and adapt their behavior in order to complete tasks autonomously.

Key characteristics that distinguish agentic systems include:

Tool usage — Agents invoke external tools and APIs to gather information, perform actions, or execute tasks as part of their workflow.
Autonomy — Agents operate with minimal human intervention, dynamically deciding what to do next based on intermediate results and system state.
Reasoning frameworks — Agents often rely on planning or decision-making strategies to break down goals, sequence actions, and handle errors or retries.

Because of these capabilities, evaluating an agentic system is fundamentally different from evaluating a traditional LLM. Quality can no longer be judged only by the final answer; it must also account for how the agent reasoned, which tools it chose, and whether it followed an efficient and correct execution path

📊 Observability in Practice: What We Actually Monitor

In practice, observability is not an abstract concept — it is implemented through deliberate logging and measurement at every stage of the LLM or agent workflow

Most Common Metrics to Monitor:-

Latency — Measures the time taken for the system to generate a response. Faster responses usually mean a better user experience.
Cost — Tracks the monetary cost per request based on model usage and tokens consumed. In AI agent systems, this cost mainly comes from LLM calls that are billed per token, but it can also include external API calls and tool usage.
Token Consumption — Represents the Number of tokens used in prompts and model responses. Excessive token usage often indicates overly large context, inefficient prompts, or unnecessary agent reasoning steps.
User Feedback — What users think about the answer, can be collected through like or dislike buttons. This tells us whether the response was actually helpful.
Knowledge Base (KB) Used — Which data source the system used to answer the question. This helps ensure the answer comes from the right information.
Tool Selection — Which tools the agent used while answering the question. This shows whether the system chose the correct steps to solve the problem.
User Question and Answer — This records the exact question asked by the user and the answer generated by the LLM or agent. Capturing both helps to understand what users are asking and how the system responds in real scenarios
Retrieved Documents (Knowledge Base Context) — This tracks the documents or text chunks retrieved from the knowledge base to help answer the user’s question. Logging retrieved documents separately makes it easy to verify whether the system pulled the right information and to diagnose retrieval issues when answers are incorrect or incomplete.

How We Can Achieve:-

Once we know what to observe, the next step is deciding how to capture and store these signals in a reliable way. This requires a logging system that is scalable, searchable, and tightly integrated with production workloads, without adding operational overhead. One such service provided by AWS that meets these requirements is Amazon CloudWatch. All user interactions are logged in CloudWatch, creating a single place to inspect what happened during each LLM or agent run. Every request is assigned a unique identifier so related logs can be traced end to end

For Coding Refer :- https://github.com/MuMu2807/LLM_Evaluation/blob/main/src/main.py

Seeing Isn’t Knowing: The Case for Evaluation

As we saw earlier, Observability gives us visibility into how an LLM or agent system behaves in production, but it does not tell us whether the behavior is actually good. This is where evaluation becomes essential.

Evaluation is the process of measuring the quality of model and agent outputs against defined expectations. In production LLM systems, evaluation is not a one-time activity. It must be continuous, systematic, and closely tied to real usage.

Online vs Offline Evaluation

Evaluation in LLM and agent systems generally falls into two categories: Online evaluation and Offline evaluation. Both serve different purposes and are most effective when used together

Online Evaluation

Online evaluation happens in real time, as users interact with the system in production. It focuses on understanding how the system performs under real-world conditions.

Typical online evaluation signals include:

User feedback (like / dislike)
Latency and cost trends
Error rates and failed tool calls
Sudden drops in answer quality after a deployment

Online evaluation is valuable because it reflects actual user experience, but it is often noisy and subjective. A dislike does not always mean the answer was wrong — it may simply not match user expectations.

Offline Evaluation

Offline evaluation is performed outside of live traffic, using predefined test cases and controlled datasets. It allows teams to systematically measure quality without impacting users.

Offline evaluation typically includes:

Structured test cases with expected outputs

Regression testing across model or prompt versions

Functional / Performance Testing — Latency testing under varying loads

Robustness Testing — How effectively the model handles imperfect inputs such as typos, slang, grammatical errors, & ambiguous queries.

Offline evaluation is more reliable and repeatable, making it ideal for benchmarking, comparison, and validation before release.

Now that we understand when evaluation happens, the next step is deciding how to evaluate — and which metrics truly capture answer quality.

Core Evaluation Metrics

Evaluating the quality of LLM and agent outputs requires more than checking whether a response exists. Because these systems generate natural language, quality must be measured across multiple dimensions. A response can be fluent but wrong, correct but incomplete, or grounded in data but poorly explained.

To capture these nuances, we rely on a set of core evaluation metrics. Each metric focuses on a specific aspect of answer quality, and together they provide a balanced view of how well the system is performing

I. QUALITY BASED METRICS

1. Correctness

Is the answer factually correct based on given context?

Example:- Q: “What is the notice period policy?”
Answer states 30 days when the policy says 60 days → ❌ incorrect

2. Completeness

Does the answer cover all required parts?

Example:- User asks: “How do I apply for leave and who approves it?”
Answer explains only how to apply, not approval flow → ❌ incomplete

3. Coherence

Is the answer logically structured and readable?

Example:- Answer jumps between unrelated points with no clear order → ❌ incoherent

4. Relevance

Does the answer address the question directly?

Example:- Long explanation but misses core ask → ❌ irrelevant

5. Faithfulness

Is the answer grounded in provided context (RAG)?

Example:- Context mentions email approval required.
Answer adds manager and HR approval → ❌ unfaithful

6. Helpfulness

Would a real user find this useful?

Example:- Technically correct but confusing → ❌ not helpful

7. Professional Style and Tone

Is the response respectful, neutral, and appropriate for the context?

Example:- Answer says “This is obvious, you should know this” → ❌ unprofessional

8. Following Instructions

Does the answer follow the user’s explicit instructions, format, and constraints?

Example:- Correct content but wrong format → ❌ not following instructions

9. Readability

Is the answer easy to read and understand?

Example:- One long paragraph with no spacing or headings → ❌ poor readability

II. RESPONSIBLE AI METRICS

1. Harmfulness

Could the response cause physical, emotional, or social harm?

Example:
Providing medical or legal advice without disclaimers → ❌ harmful

2. Stereotyping

Does the answer avoid biased or unfair generalisations about people or groups?

Example:- Implying a group behaves a certain way → ❌ stereotyping

3. Refusal

Does the system appropriately refuse unsafe or disallowed requests?

Example:- User asks for illegal instructions and the model complies → ❌ incorrect refusal

III. KEY RAG METRICS

1. Contextual Precision

How much of the retrieved context is actually relevant?

Too much irrelevant context → ❌ low precision

Example:- Question about expense policy. Retrieved context includes travel policy + cafeteria rules → ❌ low precision

2. Contextual Recall

Did retrieval include all necessary information?

Missing key details → ❌ low recall

Example:- Question about leave policy.
Retrieved context includes eligibility but not leave duration → ❌ low recall

3. Answer Relevancy

Is the answer relevant to the question?

Correct info but not answering the question → ❌ irrelevant

Example:- User asks “How do I change my address?”
Answer explains why address updates are important → ❌ irrelevant

4. Faithfulness

Is the answer strictly grounded in context?

Hallucinated details → ❌ unfaithful

Example:- Context mentions approval required.
Answer claims automatic approval → ❌ hallucination

LLM Model Evaluation

Selecting the right LLM is not about choosing the most powerful or expensive model — it’s about choosing the model that performs best for your specific task, constraints, and scale. Model evaluation provides a structured way to compare different LLMs using the same prompts, test cases, and metrics.The best LLM is the one that meets your quality, cost, latency, and safety requirements — consistently.

Model Evaluation Using Playgrounds (AWS and Beyond)

Cloud platforms provide built-in playgrounds that make model evaluation easier without writing custom code. In AWS, Amazon Bedrock offers a playground-style interface where teams can experiment with multiple foundation models, compare responses side by side, tune prompts, and evaluate output quality, latency, and cost

Similar playground experiences exist in Azure OpenAI Studio and Google Vertex AI, enabling teams to test models on the same prompts, observe differences in reasoning and instruction-following, and make informed model choices. These playgrounds act as a practical starting point for model evaluation, which can later be extended with structured evaluation, observability, and dashboards in production.

Prompt Evaluation Vs Prompt Alignment Evaluation

Prompt Evaluation

Prompt evaluation focuses on measuring how well a prompt performs across different inputs and scenarios. It evaluates whether the prompt consistently produces correct, relevant, safe and helpful outputs.

How we can do it :-

By running prompts against a fixed set of test cases and comparing results across versions, teams can identify weaknesses, reduce regressions, and improve overall reliability before deploying prompts to production.

Prompt Alignment Evaluation

Prompt alignment checks whether the model’s output matches the structure, format, and constraints specified in the prompt. Instead of judging semantic quality, it validates instruction adherence using deterministic rules.

If the prompt asks for structured output (JSON, YAML, table), the response is validated against a schema.

Example checks:-

Is the output valid JSON?
Are required fields present?
Are field types correct?

EVALUATION STRATEGIES:-

As we saw earlier, Evaluating LLMs and agentic systems involves trade-offs between accuracy, scalability, and cost. No single approach can satisfy all these at once. As a result, most production systems adopt a combination of evaluation strategies based on their requirements. The three most commonly used strategies are:

Manual Evaluation

LLM-as-a-Judge

Programmatic Evaluation

Manual Evaluation

Human reviewers assess model outputs for Accuracy, Tone, and Context. Also check the answers for Facts, Contradictions / Twisted Meaning. This is the gold standard for nuanced judgment but is costly and slow, making it hard to scale for large datasets or real-time systems.

LLM-as-a-Judge

In this approach, an LLM is used to evaluate the output of another model based on predefined metrics such as correctness, faithfulness, and helpfulness. It enables large-scale, fast evaluation of subjective qualities that are difficult to measure programmatically. However, it requires careful prompt design and periodic human validation to avoid bias

LLM-as-a-Judge evaluations can take multiple forms depending on the goal.

Scoring:- Assign scores to responses across quality metrics
Comparison :- Compare multiple answers to select the better one
Critique Explanation:- Generate qualitative critiques explaining strengths and weaknesses
Hallucination detection:- Identifies statements in the response that are unsupported by the provided context or factual information.

Programmatic Evaluation

Programmatic evaluation relies on rules and automated checks rather than subjective judgment. This includes exact matches, keyword checks, semantic similarity scores, and format validation. Metrics such as ROUGE, BLEU, and METEOR are used for summarization, accuracy and F1 score for classification, and cosine similarity, n-gram matching, and word overlap for measuring similarity. These methods are fast and repeatable but limited in capturing deeper language quality.

Structured Evaluation — The Framework

Structured evaluation is a framework, not an evaluation method by itself. It relies on clearly defined test cases to specify what should be evaluated, along with the metrics and scoring format used to assess results. This structure remains the same whether the evaluation is performed by a human reviewer, an LLM-as-a-judge, or programmatic checks.

What Is a Structured Test Case

A structured test case represents a single, well-defined scenario used to evaluate an LLM or agent. Each test case focuses on one intent or capability and includes everything needed to judge the response objectively.

At a minimum, a structured test case contains:

A unique test identifier
A clear user query
The expected behavior or outcome
Category

Example: Structured Evaluation Test Case

“id”: “id_1”,
“query”: “How do I apply for parental leave?”,
“expected_contains”: (“ABC”, “XYZ”)
“expected_tool”: (“ABC_TOOL”),
“category”: “Leave_Queries”

Test Case Categories

When designing test cases, it’s important to create a well-rounded test suite that covers different types of behaviors and scenarios. This ensures the system is evaluated across a wide range of capabilities rather than a narrow set of examples.

Common categories to include are:

Information Retrieval — Questions that test factual knowledge, definitions, or explanations
Reasoning and Logic — Problems that require deduction, inference, or multi-step thinking
Tool Invocation — Scenarios where the agent must choose and use the correct tool
Conversational Flow — Multi-turn interactions that test context retention and coherence
Edge and Boundary Cases — Uncommon, ambiguous, or extreme inputs
Safety and Compliance — Requests involving sensitive, restricted, or potentially harmful topics

For Coding Refer :- https://github.com/MuMu2807/LLM_Evaluation/blob/main/src/evaluation.py

CUSTOM DASHBOARD USING STREAMLIT

To make observability and evaluation actionable, we can built a custom dashboard using Streamlit / other tools that brings all these signals into a single, easy-to-use interface. The dashboard retrieves logs and metrics from Amazon CloudWatch, including user queries, model responses, cost, latency, token usage, tool selection, and knowledge base access. On top of operational metrics, it also displays evaluation results from structured tests, manual reviews, and LLM-as-a-judge scoring.

By visualizing both system behavior and quality metrics side by side, the dashboard makes it easier to spot trends, detect regressions, and debug failures. Instead of digging through raw logs or spreadsheets, teams can quickly understand how the system is performing in production and how changes to prompts, models, or tools impact overall quality. This closes the loop between observability and evaluation, turning raw data into insights that drive continuous improvement

For Coding Refer :- https://github.com/MuMu2807/LLM_Evaluation/blob/main/src/monitoring_dashboards.py

Other Notable Evaluation & Observability Tools

RAGAS

RAGAS is an open-source framework focused on evaluating Retrieval-Augmented Generation systems. It provides metrics such as faithfulness, answer relevance, contextual precision, and contextual recall using LLM-based scoring

OpenAI Evals

OpenAI Evals is a framework for evaluating LLM behavior using predefined tasks and metrics. It is commonly used for benchmarking, regression testing, and validating prompt or model changes against expected outcomes.

LangSmith

LangSmith provides tracing, debugging, and evaluation for LLM and agent workflows built with LangChain. It helps visualize agent steps, tool calls, and prompt versions, and supports dataset-based evaluation and regression testing.

Langfuse

Langfuse is an observability and evaluation platform for LLM applications that offers traces, prompt versioning, cost tracking, and evaluation workflows. It supports both online monitoring and offline evaluation using structured datasets.

AWS Bedrock Evaluation

AWS provides evaluation capabilities through Amazon Bedrock, enabling model comparison, prompt testing, and human or automated evaluation workflows. These tools integrate well with AWS-native logging, monitoring, and governance systems.

Azure AI Evaluation

Azure AI Evaluation tools support assessing model quality, safety, and groundedness, particularly for enterprise use cases. They integrate with Azure OpenAI, AI Studio, and Responsible AI dashboards for end-to-end evaluation.

Google Vertex AI Evaluation

Google Vertex AI provides evaluation features for generative models, including prompt comparison, human feedback workflows, and safety assessments. It is designed to support production-scale ML and GenAI deployments.

Arize Phoenix

Arize Phoenix is an open-source observability tool for LLM applications, focusing on tracing, embeddings analysis, and evaluation. It is often used for debugging RAG pipelines and detecting hallucinations

TruLens

TruLens provides feedback functions and evaluation primitives to measure groundedness, relevance, and correctness. It integrates well with RAG pipelines and supports both automated and human-in-the-loop evaluation.

Helicone

Helicone acts as a proxy for LLM APIs, capturing request/response data, latency, and cost. It is lightweight and useful for observability, though evaluation features are more limited.

Published via Towards AI