Structured Evaluations – One Building Block of AI Safety

TLDR;

Evaluation of AI systems is essential for safety and satisfaction of our users. Evaluation of stochastic AI systems is a lot more complex than your typical deterministic enterprise application. Not only does this require an understanding of user needs, non-functional system requirement, model capabilities but also their intersections to pick the right evaluation approach. A sound approach rests on a clear blue print, the right evaluation metrics, and automatization for transparency. A structured evaluation approach is a must. Industry grade LLM-application development frameworks facilitate implementation.

Introduction

Freundlicher Roboter vor dem Kölner Dom

AI assistants like chatbots excel in use cases where know-how needs to be mediated. Whether this is a cooking recipe, the explanation of a complex physical topic or a piece of interesting trivia.

In our current project we apply LLMs to disseminate travel information to people interested in the beautiful city of Cologne. You want to know something about this unconventional German million-citizen city in the heart of North Rhine Westphalia, look no further. Our chatbot can tell you something interesting about the Cathedral of Cologne, the Rhine river or its local beer culture which centers around the Kölsch.

Surely, providing tourist guidance and answering questions about local cites is not as high-stakes as for example providing correct guidance to a surgeon, pilot or lawyer. Still an AI that is tasked with handing out knowledge should do so.

Now, evaluating (language-based) AI systems is slightly more complicated than evaluating a traditional application due to their stochastic nature (e.g., learning from data, execution in large stochastic system) and their possibly “fuzzy” inputs . For example, we may ask a question with the same intent but use slightly different words. Conversely, we may even use different words. Even when we use the same word, each and every one of us may have a slightly different understanding of the sense of a word. Sounds ridiculous right?

While testing typical enterprise applications rests on clear expectations from the large down to the small implementation level, such clear 1:1 mapping does not exist in AI system. For example, we can describe that our chatbot should answer the question “Where does the Cologne cathedral stand?” with “In Cologne”. Surely, the LLM may generate not only this answer but many other (in)/correct answers as well. And we cannot (with reasonable confidence, and practicality) specify at the micro-level how the LLM should compute its answer.

For that, understanding how our Insilico companions fare when being asked superficial or intricate questions is essential to instilling trust in the technology and at the same time servicing our users. Now, the nuts and bolts of serving us correctly and best is not so easily grasped. You may ask why … Surely, there is only one cathedral. Yes, but what about the chatbot giving out answers that superficially are correct, but possibly outdated. Or what if the answers are partially correct, or completely made up. A native Kölner (citizen of Cologne) may easily weed out incorrect answers. But for some tourist from Australia a fact check may boil down to standing in front of a closed door.

Consequently, we need a better way of just saying “Yes, that looks right”. A structured approach to evaluating LLM-powered systems is needed. Such a structured approach is best conceptualized by a pipeline that implements the traditional data to data-driven model cycle. Starting with the definition of an evaluation data set, a system processes the data to provide some output. Subsequently, the output is being used to evaluate the system performance, driving further adjustments thereof.

If your LLM application is deployed without an automated, structured evaluation gate, you don't have a safety policy—you have a time bomb.

During the last years, many professionals have adopted the practice of continuous improvement and delivery (CI/CD). Consequently, this iterative cycle should be at the core of your delivery pipeline. This does not mean that you need to execute the evaluation at each change. However, (i) having a quality signal in place is better than none, and (ii) being able to quickly measure when the need arises is essential in high-pace environments. For that, in many of our project experiences we have implemented data driven testing and model evaluation as part of the core CI/ CD loop.

I once oversaw a system where a single, un-caught context hallucination cost a fair amount of reputational damage. That is why I am uncompromising on structured evaluation.

In the rest of our article, we look into key evaluation metrics. We provide a glimpse into their Python definition and show how they can be used from standard frameworks like llamaindex.

Key Metrics In LLM Evaluation

A variety of metrics are used to evaluate large language models (LLMs), covering various aspects such as accuracy, relevance, robustness, and efficiency. These metrics are selected and combined depending on the use case, benchmark, and objective to ensure a comprehensive performance evaluation of LLMs. The most important main metrics are presented in a structured overview below.

Classic metrics

• Accuracy: Measures the percentage of correct model responses in tasks such as classification or question answering.

• Precision & Recall: Precision considers how many results classified as positive are actually positive; recall measures how many of the actual positive results were recognized.

• F1 score: The harmonic mean of precision and recall – particularly important when the ratio between classes is unbalanced.

Token and retrieval metrics

• Precision@k, Recall@k, tok@k/ top@k: Provide information about how many relevant tokens/documents are among the top k results, which is central to RAG systems, for example.

• Exact Match (EM): Checks whether the model answer exactly matches the reference answer – e.g., in factoid QA.

• Language model and text quality metrics

• Perplexity: Evaluates how confident or “surprised” the model is when predicting the next tokens. Lower values indicate better language modeling.

• BLEU, ROUGE, GLEU, METEOR: Measure similarities between reference and model text, which is important for translation or summarization, for example.

• BERTScore: Compares embeddings of model and reference text to capture semantic similarity even in paraphrases.

Factual and semantic metrics

• Factual consistency / hallucination rate: Measures how often an LLM provides correct facts or “hallucinates” (invents content).

• Semantic Similarity: Evaluates how similar the generated response is to the target response in terms of meaning; often used for creative or open-ended tasks.

• System and practical metrics

• Latency: Time to generate a response, important for real-time requirements.

• Throughput: Requests processed per unit of time.

• Memory utilization: Resources required for the model application.

• Batching efficiency: How many parallel requests can be processed efficiently.

Human-in-the-loop & special metrics

• Human evaluation: Humans evaluate responses in terms of naturalness, coherence, creativity, etc.

• Task completion: Checks whether the LLM actually completes a given task (e.g., JSON output).

• Responsible metrics: Measure bias, fairness, and toxic behaviour in model output.

A Brief Guide to Choosing a Metric

In the following, we provide you with a mental shortcut to picking an evaluation metric. You will find the major metrics by category listed with their main application domains and use. Critically, we give guidance when to avoid a metric due to it being irrelevant, misleading or insufficient.

Category	Metric	Main Application Domains	When to Avoid
Classic	Accuracy	Classification tasks (e.g., intent detection, content moderation). Question-Answering with single correct answer.	When classes are imbalanced; a high score can be misleading if one class dominates. For open-ended tasks with multiple valid answers.
	Precision & Recall	Crucial for tasks where error types have different costs. Precision: When the cost of false positives is high (e.g., flagging safe content as unsafe). Recall: When missing a positive case is critical (e.g., failing to detect harmful content).	When you need a single, balanced score to compare models. In such cases, the F1 score is more appropriate.
	F1 Score	The go-to metric when you need to balance the trade-off between Precision and Recall, especially when class distribution is uneven.	When either Precision or Recall is clearly more important for your specific application.
Token & Retrieval	Precision@k / Recall@k	Core to Retrieval-Augmented Generation (RAG) systems and search engines. Measures the quality of top-k retrieved documents or tokens.	When the ranking order within the top-k results matters more than mere presence.
	Exact Match (EM)	Ideal for tasks requiring exact string correctness, such as extracting codes, names, or factoid question-answering (e.g., "What is the capital of France?").	For open-ended, generative, or conversational tasks where the same meaning can be expressed in different ways. It is overly strict and will penalize correct but paraphrased answers.
Language Model & Text Quality	Perplexity	Intrinsic evaluation of a language model's quality. Used to compare models or monitor training progress on the same dataset. Low perplexity indicates better prediction of the next word.	As a proxy for task-specific performance like factuality or helpfulness. A model can be fluent (low perplexity) but factually wrong. Not for comparing models trained on different datasets.
	BLEU / ROUGE / METEOR	BLEU/ROUGE: Machine translation (BLEU) and text summarization (ROUGE). METEOR: An improved variant that considers synonyms and stemming.	When evaluating for semantic meaning or when valid outputs involve significant paraphrasing. These are n-gram overlap metrics and do not capture meaning well.
	BERTScore	Evaluating tasks where semantic correctness is key, such as paraphrase generation, summarization, and question-answering. It captures meaning even when words differ.	When computational resources are limited, as it is more expensive to compute than BLEU/ROUGE. When grammatical correctness or sentence structure is the primary concern.
Factual & Semantic	Factual Consistency / Hallucination Rate	Critical for applications where truthfulness is paramount, such as summarizing documents or generating answers based on a given context (e.g., in RAG). Measures if claims are supported by a source.	For purely creative tasks where factual accuracy is not the goal (e.g., writing fiction or poetry).
	Semantic Similarity	Evaluating open-ended and creative tasks (e.g., story generation, idea brainstorming) where there are multiple valid outputs and the goal is to assess meaning proximity to a reference or intent.	When you require exact factual correctness or when evaluating structured outputs.
System & Practical	Latency	Essential for all real-time applications, such as live chatbots, voice assistants, and interactive AI features.	During initial model research and development phases where speed is not the primary focus.
	Throughput	Measuring system scalability and cost-efficiency, especially for batch processing of large volumes of requests.	For evaluating the user experience of a single interaction.
	Memory Utilization	Critical for deploying models on resource-constrained devices (e.g., edge devices, mobile phones) and for estimating infrastructure costs.	When only evaluating the quality of the model output in a research setting.
	Batching Efficiency	Optimizing server-side performance and cost for handling multiple concurrent requests.	For low-traffic applications or during prototyping.
Human-in-the-Loop & Special	Human Evaluation	The gold standard for nuanced aspects like coherence, creativity, helpfulness, and safety. Used to validate automated metrics and for high-stakes applications.	When you need fast, cheap, and scalable evaluation. It is slow, expensive, and can suffer from low inter-annotator agreement.
	Task Completion	Evaluating AI agents and functional applications (e.g., correctly generating JSON, calling the right API, completing a multi-step task).	For evaluating purely conversational chatbots where the primary goal is information exchange, not taking an action.
	Responsible Metrics (Bias, Toxicity)	Essential for all public-facing applications to ensure outputs are fair, ethical, and non-harmful. Used in content moderation and for adhering to safety policies.	No application should completely avoid evaluating for responsibility, though the strictness of the criteria may vary.

Effective evaluation typically does not stop with running your model through one metric. Typically, it involves picking a fitting combination of many of these metrics that evaluate your model and its embedding into a service. Since no single metric can fully capture performance of an LLM system wholistically, you need to make an informed choice. This choice should be guided by your application's specific goals. Critically, these specific goals must be rooted in your key user needs and constraints. For example:

RAG-powered Customer Service Chatbot:

Factual Consistency to avoid wrong answers
Answer Relevancy to stay on-topic)
Recall@k for retrieval quality, Latency for user experience, and
Human Evaluation to spot those tiny needle in the haystack errors and inconsistencies

Creative Writing Assistant:

Semantic Similarity
Human Evaluation of creativity and coherence are more suitable to evaluate on an open ended task where there is no single correct answer and easy to spot error

Content Moderation System:

Precision, Recall or their blend - F1 Score
Responsible Metrics like toxicity and bias detection to identify biased, explicit or harmful content

We believe that our overview helps you build a robust evaluation framework for your needs.

A Closer Look on Key Metrics

In the previous Chapter we had a broad look on evaluation metrics. Now, in this Chapter we narrow our focus to those metrics that are more frequently used in LLM-systems evaluation.

Definition of Precision@k, Recall@k and tok@k

Precision@k and Recall@k are widely used metrics for evaluating retrieval and recommendation systems (such as RAG or LLMs with retrieval). tok@k/ top@k is a LLM evaluation metric that measures whether a searched token appears among the k most probable predictions (top k prediction). In our discussion below, you will see the word item, this can denote a token (the unit of embedding), or chunk (the unit of indexing). Which is relevant depends on the stage and level of your evaluation. In any case, each metric lies between 0 and 1 can be regarded as a percentage value.

Precision@k

Measures the proportion of actually relevant items among the top-k results returned by the model. It hows how many recommended or retrieved items are genuinely relevant. Here, k is an arbitrarily chosen cut-off.
Definition:

Recall@k

Measures the proportion of relevant items in the entire data set that were found by the model within the top-k results.
Definition:

tok@k/ top@k (Token@k)

Mostly used in language model outputs: For a token prediction problem, top@k measures whether the target token is among the k most likely to be selected in the next step. It is usually calculated individually as a top-k prediction among the logits.

Definition:

A Pythonic View

Equipped with their definition, let’s have a look how we can implement precision@k, recall@k and top@k ourself.

import pandas as pd

def precision_at_k(df, k, y_test, y_pred):
    dfK = df.head(k)
    denominator = dfK[y_pred].sum()
    numerator = dfK[dfK[y_pred] & dfK[y_test]].shape[0]
    if denominator > 0:
        return numerator / denominator
    else:
        return None

def recall_at_k(df, k, y_test, y_pred):
    dfK = df.head(k)
    denominator = df[y_test].sum()
    numerator = dfK[dfK[y_pred] & dfK[y_test]].shape[0]
    if denominator > 0:
        return numerator / denominator
    else:
        return None

Typically, you would call:

k = 10
print(
    f'Precision@k: {precision_at_k(df,k,"relevant","retrieved"):.2f}, Recall@k: {recall_at_k(df,k,"relevant","retrieved"):.2f}'
)

For tok@k, the token logits of the model are typically used. Example:

def tok_at_k(logits, target_token_id, k):
    top_k = logits.argsort()[-k:][::-1]
    return int(target_token_id in top_k)

Exposure in Industry Frameworks

Now, understanding how one can take the definition of a metric and implement them in a few python lines is important. After all, you may want to define your specific evaluation metric that might not even exist yet. However, most practitioners in the industry are looking to a few standard frameworks RAG or specialized evaluation frameworks. The two frameworks we commonly work with are llamaindex for our RAG needs, and ragas for evaluation. HuggingFace offers excellent tooling support for many machine learning tasks including NLP. And it often comes to our rescue when we need to bridge the gap from “works out of the box” to “must do it ourself”.

ragas (Library focused on evaluation)

It directly provides Precision@k and Recall@k as evaluation metrics for retrieval components.

The metrics are used at the just-in-time response evaluation level, for example as a function ragas.metrics.precision_at_k or recall_at_k when evaluating passage retrieval tasks.

The following is an example from the ragas website:

from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
)

await context_precision.single_turn_ascore(sample)

Hugging Face (Transformers/Evaluate)

The Hugging Face library “evaluate” provides ready-made implementations for “precision,” “recall,” and, in some cases, “precision@k”/“recall@k” (typically for recommendation and ranking).

For next-token probabilities (tok@k), you can compare the token IDs after the forward pass (see above), as there is usually no dedicated metric for this in the HF ecosystem.

Llamaindex

LlamaIndex supports the evaluation of retrieval systems by integrating Precision@k and Recall@k for different query pipelines, for example in the Retriever Evaluator module.

from llama_index.evaluation import RetrieverEvaluator

metrics = ["precision@k", "recall@k"]
evaluator = RetrieverEvaluator(metrics=metrics)
results = evaluator.evaluate(retriever, queries, ground_truth)

# top-k is typically evaluated separately

Conclusion

You are not just building an LLM application; you are building a stochastic system that delivers intelligence that operates on a probabilistic foundation. In a production environment, this inherent uncertainty means that traditional, deterministic testing methodologies are, quite simply, obsolete for ensuring AI safety. But this is not only about safety it is also about compliance, reputation, and fiduciary duty.

Deploying an LLM without a robust, automated, and structured evaluation system is not just technically negligent; it's an architectural flaw. Implementing a structured evaluation is the non-negotiable building block of your safety architecture. It is not an academic exercise;

Let us know where you will start applying structured evaluation?

Search This Blog

AI Cologne