A lot of AI research such as HELM and BigBench has been devoted to building test suites to evaluate the accuracy of large language models. However, the tasks contained in those suites often don’t generalize to narrower, more specific real-world applications, and developers are often on the hook to develop their own infrastructure for measuring the accuracy of their applications. This has resulted in a lot of hacked together solutions involving custom scripts, Google Sheets and Slack channels for coordination.
As I mentioned in my previous post, measuring if LLM agents and applications are getting better or worse with better algorithms, prompts, data or base LLM is critical for iterating quickly with confidence in the fast-paced LLM world. Traditional software engineering has mature practices around CI/CD, and a similar streamlined workflow is needed for more reliable and deterministic LLM based software development.
We have been seeing LLM application developers set up 4 types of evaluations which we will describe in more detail below.
1. Metric based
The simplest evaluation is when one can write a well-defined mathematical or algorithmic function to compare the desired, ideal output with the LLM generated output. Examples include:
Semantic similarity: Generate vector embeddings of the ideal and generated output and take the dot product as a similarity measure
Exact match: True if the ideal output and model output match. Some flexibility can be included to allow for differences in white-spaces and case insensitivity.
Includes: True if the ideal output is contained within the model output
Fuzzy match: Allow for even more flexibility by making the includes comparison case insensitive, removing punctuation, articles and whitespaces
OpenAI’s evals command line interface (CLI) tool includes the above metrics, and a comprehensive set of tests for evaluations. However, due to its CLI, log file and OpenAI models centric based design choices it is difficult to integrate with existing workflows programmatically or with non OpenAI models.
Custom webhooks: LLM app developers often end up needing to define custom metric functions. These can be called via a webhook
2. Tools based
In certain situations, specialized tools are needed for evaluation. For example, in code generation applications, one may want to call a compiler to test if the generated code compiles, or trigger unit tests to run and pass. There may be specialized context, such as for inserting few shot examples dynamically, which needs to be fetched from code repositories, databases and internal tools. Autoevaluator is an open source tool integrated with Langchain to help with document QA evaluation.
3. Model based
A natural development has been the use of LLMs to evaluate their own output. While attractive for their ease of getting started, model based evaluations are not as straight forward as the community had initially hoped. The LLMs are biased towards their own output, may have contamination from the test data, or may lack the specialized, nuanced or subjective knowledge needed to evaluate the output. Teams are experimenting with building specialized models to detect specific attributes and anomalies such as hallucinations, harmful content and degradations in classification quality. Data privacy requirements also prevent third party LLMs from being used for evaluation in certain sensitive financial and healthcare use cases.
Another way to leverage LLMs in the evaluation pipeline is to bootstrap example data generation for the evaluation dataset. This comes with the usual caveat that human supervision is typically needed to refine the LLM output before it’s adopted into the evaluation dataset. Overall though there could be time savings by using LLMs in this way.
4. Human expert in the loop
In many situations, it is not feasible to describe or define the ideal response algorithmically, via tools or with models. In these situations, one needs to have a human in the loop to provide feedback on or to annotate the generated output. Beyond the point evaluation, this feedback if logged and stored correctly could help as training data for future fine-tuning of models.
In a customer support context for example, nuance and cultural context can get lost in LLM based language translations. A native speaker, human customer support agent could annotate the right response in this scenario.
For long form literary writing or to assess output in highly technical domains human experts tend to still perform better than LLMs.
The obvious downside to this type of evaluation is that inserting human experts in the loop makes the process slower and more expensive, and recent reports indicate that human labelers might be using LLMs themselves and in doing so contaminating their responses with model based evaluations!
Why LLM application evaluations are important
Without a firm grasp on whether LLM applications are getting better, it is almost impossible to make progress and improve their quality. Beyond unleashing Agentic capabilities in Production, we believe that Evaluations are fundamental to enabling features such as fine-tuning, distillation and model orchestration.
Agents at different levels of complexity require different considerations in evaluations. Some of the issues with even single API call based agents are outlined above. As we get into role-playing agents evaluating quality of dialogs or generating and comparing final summary output quality become critical. As we advance into more autonomous agents connected to Internet scale tools and internal DBs safety evaluations start becoming a real issue.
Fine-tuned models could be more accurate, self hosted, self owned, faster, cheaper and more reliable to operate. However, today for most production uses the LLMs from closed providers such as OpenAI or Anthropic tend to give far better accuracy than the open source alternatives, even as the latest models from them do not yet have public APIs available for fine-tuning. Developers are interested in evaluating whether and when the open source alternatives become on par with the closed offerings.
Distillation: The instruction prompts part of the LLM logs provide rich analytics data on the distribution of tasks the LLM is being asked to perform. This data could be used to discover smaller models that are specialized for those tasks or to distill those task capabilities into smaller versions of the original model.
Model Orchestration: As developers transition from using one monolithic LLM to using a combination of self-hosted models there is a need to dynamically route queries depending on the sub-task that the query is aiming to perform.
Currently, the above steps require a lot of manual effort involving teams of data scientists, machine learning engineers and DevOps support limiting their adoption to a few companies that can afford the budget and talent. However, in the future we see the above optimizations and orchestration being done by an automated process. We think the maturity of abstractions built on top of LLMs as a computing platform will track a similar journey to how we went from manual assembly programming to building compilers and operating systems for computers.
If you are thinking about the above topics as you build LLM applications, do reach out - we’d love to hear from you at Log10.io!
"Evaluating LLM agents and applications is such a critical yet challenging aspect of advancing AI. Ensuring accuracy, utility, and safety requires robust evaluation frameworks. At Datumo, we provide LLM-based app evaluation solutions to enhance reliability and trust in AI systems. Learn more at https://datumo.com/en. Thanks for highlighting this important topic