Structured Evaluations – One Building Block of AI Safety
TLDR;
Evaluation
of AI systems is essential for safety and satisfaction of our users. Evaluation
of stochastic AI systems is a lot more complex than your typical deterministic enterprise
application. Not only does this require an understanding of user needs, non-functional
system requirement, model capabilities but also their intersections to pick the
right evaluation approach. A sound approach rests on a clear blue print, the
right evaluation metrics, and automatization for transparency. A structured
evaluation approach is a must. Industry grade LLM-application development frameworks
facilitate implementation.
Introduction
AI
assistants like chatbots excel in use cases where know-how needs to be
mediated. Whether this is a cooking recipe, the explanation of a complex
physical topic or a piece of interesting trivia.
In our
current project we apply LLMs to disseminate travel information to people
interested in the beautiful city of Cologne. You want to know something about
this unconventional German million-citizen city in the heart of North Rhine
Westphalia, look no further. Our chatbot can tell you something interesting
about the Cathedral of Cologne, the Rhine river or its local beer culture which
centers around the Kölsch.
Surely,
providing tourist guidance and answering questions about local cites is not as
high-stakes as for example providing correct guidance to a surgeon, pilot or
lawyer. Still an AI that is tasked with handing out knowledge should do so.
Now, evaluating
(language-based) AI systems is slightly more complicated than evaluating a traditional
application due to their stochastic nature (e.g., learning from data, execution
in large stochastic system) and their possibly “fuzzy” inputs . For example, we
may ask a question with the same intent but use slightly different words. Conversely,
we may even use different words. Even when we use the same word, each and every
one of us may have a slightly different understanding of the sense of a word.
Sounds ridiculous right?
While testing
typical enterprise applications rests on clear expectations from the large down
to the small implementation level, such clear 1:1 mapping does not exist in AI
system. For example, we can describe that our chatbot should answer the
question “Where does the Cologne cathedral stand?” with “In Cologne”. Surely,
the LLM may generate not only this answer but many other (in)/correct answers as
well. And we cannot (with reasonable confidence, and practicality) specify at
the micro-level how the LLM should compute its answer.
For that,
understanding how our Insilico companions fare when being asked superficial or
intricate questions is essential to instilling trust in the technology and at
the same time servicing our users. Now, the nuts and bolts of serving us
correctly and best is not so easily grasped. You may ask why … Surely, there is
only one cathedral. Yes, but what about the chatbot giving out answers that
superficially are correct, but possibly outdated. Or what if the answers are
partially correct, or completely made up. A native Kölner (citizen of Cologne)
may easily weed out incorrect answers. But for some tourist from Australia a
fact check may boil down to standing in front of a closed door.
Consequently,
we need a better way of just saying “Yes, that looks right”. A structured
approach to evaluating LLM-powered systems is needed. Such a structured
approach is best conceptualized by a pipeline that implements the traditional data
to data-driven model cycle. Starting with the definition of an evaluation data
set, a system processes the data to provide some output. Subsequently, the
output is being used to evaluate the system performance, driving further
adjustments thereof.
During the
last years, many professionals have adopted the practice of continuous
improvement and delivery (CI/CD). Consequently, this iterative cycle should be
at the core of your delivery pipeline. This does not mean that you need to
execute the evaluation at each change. However, (i) having a quality signal in
place is better than none, and (ii) being able to quickly measure when the need
arises is essential in high-pace environments. For that, in many of our project
experiences we have implemented data driven testing and model evaluation as
part of the core CI/ CD loop.
I once oversaw a system where a single,
un-caught context hallucination cost a fair amount of reputational damage. That
is why I am uncompromising on structured evaluation.
In the rest
of our article, we look into key evaluation metrics. We provide a glimpse into
their Python definition and show how they can be used from standard frameworks
like llamaindex.
Key Metrics In LLM Evaluation
A variety of metrics are used to evaluate
large language models (LLMs), covering various aspects such as accuracy,
relevance, robustness, and efficiency. These metrics are selected and combined
depending on the use case, benchmark, and objective to ensure a comprehensive
performance evaluation of LLMs. The most important main metrics are presented
in a structured overview below.
Classic metrics
•
Accuracy: Measures the
percentage of correct model responses in tasks such as classification or
question answering.
•
Precision & Recall:
Precision considers how many results classified as positive are actually
positive; recall measures how many of the actual positive results were
recognized.
•
F1 score: The harmonic mean of
precision and recall – particularly important when the ratio between classes is
unbalanced.
Token and retrieval metrics
•
Precision@k, Recall@k, tok@k/
top@k: Provide information about how many relevant tokens/documents are among
the top k results, which is central to RAG systems, for example.
•
Exact Match (EM): Checks
whether the model answer exactly matches the reference answer – e.g., in
factoid QA.
•
Language model and text quality
metrics
•
Perplexity: Evaluates how
confident or “surprised” the model is when predicting the next tokens. Lower
values indicate better language modeling.
•
BLEU, ROUGE, GLEU, METEOR:
Measure similarities between reference and model text, which is important for
translation or summarization, for example.
•
BERTScore: Compares embeddings
of model and reference text to capture semantic similarity even in paraphrases.
Factual and semantic metrics
•
Factual consistency /
hallucination rate: Measures how often an LLM provides correct facts or
“hallucinates” (invents content).
•
Semantic Similarity: Evaluates
how similar the generated response is to the target response in terms of
meaning; often used for creative or open-ended tasks.
•
System and practical metrics
•
Latency: Time to generate a
response, important for real-time requirements.
•
Throughput: Requests processed
per unit of time.
•
Memory utilization: Resources
required for the model application.
•
Batching efficiency: How many
parallel requests can be processed efficiently.
Human-in-the-loop & special metrics
•
Human evaluation: Humans
evaluate responses in terms of naturalness, coherence, creativity, etc.
•
Task completion: Checks whether
the LLM actually completes a given task (e.g., JSON output).
•
Responsible metrics: Measure
bias, fairness, and toxic behaviour in model output.
A Brief Guide to Choosing a Metric
In the following, we provide you with a
mental shortcut to picking an evaluation metric. You will find the major
metrics by category listed with their main application domains and use. Critically,
we give guidance when to avoid a metric due to it being irrelevant, misleading
or insufficient.
Category |
Metric |
Main Application Domains |
When to Avoid |
Classic |
Accuracy |
Classification tasks (e.g., intent detection, content moderation).
Question-Answering with single correct answer. |
When classes are imbalanced; a high score can be misleading if one
class dominates. For open-ended tasks with multiple valid answers. |
Precision & Recall |
Crucial for tasks where error types have different costs. Precision:
When the cost of false positives is high (e.g., flagging safe content as
unsafe). Recall: When missing a positive case is critical (e.g., failing to
detect harmful content). |
When you need a single, balanced score to compare models. In such
cases, the F1 score is more appropriate. |
|
F1 Score |
The go-to metric when you need to balance the trade-off between
Precision and Recall, especially when class distribution is uneven. |
When either Precision or Recall is clearly more important for your
specific application. |
|
Token & Retrieval |
Precision@k / Recall@k |
Core to Retrieval-Augmented Generation (RAG) systems and search
engines. Measures the quality of top-k retrieved documents or tokens. |
When the ranking order within the top-k results matters more than mere
presence. |
Exact Match (EM) |
Ideal for tasks requiring exact string correctness, such as extracting
codes, names, or factoid question-answering (e.g., "What is the capital
of France?"). |
For open-ended, generative, or conversational tasks where the same
meaning can be expressed in different ways. It is overly strict and will
penalize correct but paraphrased answers. |
|
Language Model & Text Quality |
Perplexity |
Intrinsic evaluation of a language model's quality. Used to compare
models or monitor training progress on the same dataset. Low perplexity
indicates better prediction of the next word. |
As a proxy for task-specific performance like factuality or
helpfulness. A model can be fluent (low perplexity) but factually wrong. Not
for comparing models trained on different datasets. |
BLEU / ROUGE / METEOR |
BLEU/ROUGE: Machine translation (BLEU) and text summarization (ROUGE).
METEOR: An improved variant that considers synonyms and stemming. |
When evaluating for semantic meaning or when valid outputs involve
significant paraphrasing. These are n-gram overlap metrics and do not capture
meaning well. |
|
BERTScore |
Evaluating tasks where semantic correctness is key, such as paraphrase
generation, summarization, and question-answering. It captures meaning even
when words differ. |
When computational resources are limited, as it is more expensive to
compute than BLEU/ROUGE. When grammatical correctness or sentence structure
is the primary concern. |
|
Factual & Semantic |
Factual Consistency / Hallucination Rate |
Critical for applications where truthfulness is paramount, such as
summarizing documents or generating answers based on a given context (e.g.,
in RAG). Measures if claims are supported by a source. |
For purely creative tasks where factual accuracy is not the goal
(e.g., writing fiction or poetry). |
Semantic Similarity |
Evaluating open-ended and creative tasks (e.g., story generation, idea
brainstorming) where there are multiple valid outputs and the goal is to
assess meaning proximity to a reference or intent. |
When you require exact factual correctness or when evaluating
structured outputs. |
|
System & Practical |
Latency |
Essential for all real-time applications, such as live chatbots, voice
assistants, and interactive AI features. |
During initial model research and development phases where speed is
not the primary focus. |
Throughput |
Measuring system scalability and cost-efficiency, especially for batch
processing of large volumes of requests. |
For evaluating the user experience of a single interaction. |
|
Memory Utilization |
Critical for deploying models on resource-constrained devices (e.g.,
edge devices, mobile phones) and for estimating infrastructure costs. |
When only evaluating the quality of the model output in a research
setting. |
|
Batching Efficiency |
Optimizing server-side performance and cost for handling multiple
concurrent requests. |
For low-traffic applications or during prototyping. |
|
Human-in-the-Loop & Special |
Human Evaluation |
The gold standard for nuanced aspects like coherence, creativity,
helpfulness, and safety. Used to validate automated metrics and for
high-stakes applications. |
When you need fast, cheap, and scalable evaluation. It is slow,
expensive, and can suffer from low inter-annotator agreement. |
Task Completion |
Evaluating AI agents and functional applications (e.g., correctly
generating JSON, calling the right API, completing a multi-step task). |
For evaluating purely conversational chatbots where the primary goal
is information exchange, not taking an action. |
|
Responsible Metrics (Bias, Toxicity) |
Essential for all public-facing applications to ensure outputs are
fair, ethical, and non-harmful. Used in content moderation and for adhering
to safety policies. |
No application should completely avoid evaluating for responsibility,
though the strictness of the criteria may vary. |
RAG-powered Customer Service Chatbot:
- Factual Consistency to avoid wrong answers
- Answer Relevancy to stay on-topic)
- Recall@k for retrieval quality, Latency for user experience, and
- Human Evaluation to spot those tiny needle in the haystack errors and inconsistencies
Creative Writing Assistant:
- Semantic Similarity
- Human Evaluation of creativity and coherence are more suitable to evaluate on an open ended task where there is no single correct answer and easy to spot error
Content Moderation System:
- Precision, Recall or their blend - F1 Score
- Responsible Metrics like toxicity and bias detection to identify biased, explicit or harmful content
We believe that our overview helps you
build a robust evaluation framework for your needs.
A Closer Look on Key Metrics
In the previous Chapter we had a broad look
on evaluation metrics. Now, in this Chapter we narrow our focus to those metrics
that are more frequently used in LLM-systems evaluation.
Definition of Precision@k, Recall@k and tok@k
Precision@k and Recall@k are widely used
metrics for evaluating retrieval and recommendation systems (such as RAG or
LLMs with retrieval). tok@k/ top@k is a LLM evaluation metric that measures
whether a searched token appears among the k most probable predictions (top k
prediction). In our discussion below, you will see the word item, this can denote
a token (the unit of embedding), or chunk (the unit of indexing). Which is
relevant depends on the stage and level of your evaluation. In any case, each
metric lies between 0 and 1 can be regarded as a percentage value.
Precision@k
- Measures the proportion of actually relevant items among the top-k results returned by the model. It hows how many recommended or retrieved items are genuinely relevant. Here, k is an arbitrarily chosen cut-off.
- Definition:
Recall@k
- Measures the proportion of relevant items in the entire data
set that were found by the model within the top-k results.
- Definition:
tok@k/ top@k (Token@k)
- Mostly used in language model outputs: For a token prediction
problem, top@k measures whether the target token is among the k most
likely to be selected in the next step. It is usually calculated individually as a
top-k prediction among the logits.
- Definition:
A Pythonic View
Equipped with their definition, let’s have
a look how we can implement precision@k, recall@k and top@k ourself.
Typically, you would call:
For tok@k, the token logits of the model are typically used. Example:
Exposure in Industry Frameworks
Now, understanding how one can take the
definition of a metric and implement them in a few python lines is important.
After all, you may want to define your specific evaluation metric that might not
even exist yet. However, most practitioners in the industry are looking to a
few standard frameworks RAG or specialized evaluation frameworks. The two
frameworks we commonly work with are llamaindex for our RAG needs, and ragas
for evaluation. HuggingFace offers excellent tooling support for many machine
learning tasks including NLP. And it often comes to our rescue when we need to
bridge the gap from “works out of the box” to “must do it ourself”.
ragas (Library focused on evaluation)
It directly provides Precision@k and
Recall@k as evaluation metrics for retrieval components.
The metrics are used at the just-in-time
response evaluation level, for example as a function
ragas.metrics.precision_at_k or recall_at_k when evaluating passage retrieval
tasks.
The following is an example from the ragas website:
Hugging Face (Transformers/Evaluate)
The Hugging Face library “evaluate”
provides ready-made implementations for “precision,” “recall,” and, in some
cases, “precision@k”/“recall@k” (typically for recommendation and ranking).
For next-token probabilities (tok@k), you
can compare the token IDs after the forward pass (see above), as there is
usually no dedicated metric for this in the HF ecosystem.
Llamaindex
LlamaIndex supports the evaluation of retrieval systems by integrating Precision@k and Recall@k for different query pipelines, for example in the Retriever Evaluator module.
Conclusion
You are not just building an LLM
application; you are building a stochastic system that delivers intelligence that
operates on a probabilistic foundation. In a production environment, this
inherent uncertainty means that traditional, deterministic testing
methodologies are, quite simply, obsolete for ensuring AI safety. But this is
not only about safety it is also about compliance, reputation, and fiduciary
duty.
Deploying an LLM without a robust,
automated, and structured evaluation system is not just technically negligent;
it's an architectural flaw. Implementing a structured evaluation is the
non-negotiable building block of your safety architecture. It is not an
academic exercise;
Let us know where you will start applying
structured evaluation?
Further Reading
https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/
https://www.llamaindex.ai/blog/evaluating-rag-with-deepeval-and-llamaindex
Comments