Pytest is All You Need

… for LLM evaluation when you have reference data and metrics.

Jun 12, 2024

Previously, we introduced llmeval, a versatile command-line tool designed for comprehensive LLM application evaluation. It incorporates features such as one-shot model evaluation, hyperparameter sweeping, and report generation. Despite these strengths, it also faced limitations, such as the difficulty in crafting custom evaluation functions, the cumbersome use of inlined Python code within YAML configurations, and the restriction to only sweeping LLM models and hyper-parameters. Since then, we learned that our users needed the ability to easily define and maintain custom evaluation metrics directly in Python (no need to write YAML files any more), especially for agent-based testing scenarios, enhancing both usability and flexibility. We are excited to share best practices on how developers can set up LLM evaluations. These practices are designed to empower users to:

Programmatically evaluate and test LLM applications
Easily explore optimal settings, including prompts and LLM models
Customize evaluation and test metrics effortlessly
Collect and analyze data from all test runs efficiently
Support LLM agent testing, enabling testing of multi-step static and dynamic trajectories

Evals is an essential part to build LLM apps, even if you are just getting started. In this blog post, I'll showcase the recipe to set up LLM testing, specifically tailored for Python applications. We'll walk through everything from simple test setup to comprehensive evaluation and report generation.

Contrary to the cumbersome setups often seen in LLM evaluations or benchmarks, we primarily rely on Pytest, a widely recognized testing framework for Python, to conduct our tests. One of Pytest’s extensions, Pytest-harvest, enables us to capture test execution data, which is used for post-processing, analysis, and report.

We’ve created this example Python code repository, which mimics a typical Python library. It includes a src folder with LLM function and custom evaluation metrics and a test folder for testing. If you are exploring evaluation metrics beyond traditional NLP and model-based methods, we offer innovative solutions such as our custom AutoFeedback models, which significantly amplify human feedback through Log10.

What’s the right evaluation for your application from the types of evaluation methods?

Example 1 - Testing LLM consistency across multiple runs

Let's start with a simple example. In this initial demonstration, we’ll illustrate how to define test data, create custom evaluation methods, and compute statistics across multiple test runs.

To begin, we defined a test function called test_simple to evaluate the function summarize_to_30_words, which summarizes articles. Utilizing pytest.fixture, we created the input data consisting of an article and its expected_summary.

For our evaluation metrics, we created a custom Python function to measure the similarity between reference and output summary. This might involve calculating metrics like cosine distance between the embedding of output and reference. We then validated that the value falls within a predefined threshold. Furthermore, we used pytest.mark.repeat(3) to execute the same test three times consecutively. These repeated runs help access the consistency and robustness of the LLMs.

Finally, we defined another test called test_mean_cosine_similarity to compute the mean and standard deviation of the cosine distances obtained from the three test runs. This analysis is enabled by fixtures results_bag and module_results_df, provided by pytest-harvest. These fixtures enable efficient data aggregation and analysis across multiple test executions.

@pytest.fixture
def article():
    return "<FULL-ARTICLE-HERE>"

@pytest.fixture
def expected_summary():
    return "<REFERENCE-SUMMARIZATION-HERE>"

@pytest.mark.repeat(3)
def test_simple(article, expected_summary, results_bag):
    output = summarize_to_30_words(article)
    metric = cosine_distance(expected_summary, output)

    results_bag.cos_sim = metric
    assert metric < 0.2


def test_mean_cosine_similarity(module_results_df: pd.DataFrame):
    print("Average cosine distance: ", module_results_df["cos_sim"].mean())
    print("Std dev of cosine distance: ", module_results_df["cos_sim"].std())

    assert module_results_df["cos_sim"].mean() < 0.2

Example 2: Loading dataset, Gating deployment, and Generating report

The second example demonstrates how to handle tasks such as loading a dataset from a JSON Line file, generating and saving a report, and gating deployment into production such as for CI/CD (more on that below).

We began by defining a fixture to facilitate loading a dataset from a JSON Line file. We introduced a custom evaluation function to tally the words in the output. Next, we created two tests:

test_summarize_to_30_words measured evaluation metrics and ran multiple times.
test_pass_rate_of_30_words generated and saved a report in markdown format and checked the pass rate of all test runs of.

This evaluation allows for a less than perfect score, enabling flexibility in deployment of LLM applications. In addition, we added a RECORD mode function to create or update reference outputs using LLMs like GPT4. For those utilizing log10.io, the process is even easier - you can simply curate an evaluation dataset from your existing completions by downloading them with the log10 CLI. This feature plays a crucial role in ensuring the quality and accuracy of the responses.

@pytest.fixture
def data():
    filename = "data.jsonl"
    data = []
    with jsonlines.open(filename) as reader:
        for obj in reader:
            data.append((obj["article"], obj["summary"]))
    return data

@pytest.mark.repeat(2)
@pytest.mark.parametrize("sample_idx", range(3))
def test_summarize_to_30_words(data: list, sample_idx: int, results_bag):
    article, expected_summary = data[sample_idx]
    output = summarize_to_30_words(article)
    metric = cosine_distance(expected_summary, output)
    num_words = count_words(output)

    results_bag.test_name = f"test_summarize_to_30_words_{sample_idx}"
    results_bag.article = article
    results_bag.expected_summary = expected_summary
    results_bag.output = output
    results_bag.cos_sim = metric
    results_bag.num_words = num_words

    assert num_words <= 30

def test_pass_rate_of_30_words(module_results_df: pd.DataFrame):
    df = filter_results_by_test_name(module_results_df, "test_summarize_to_30_words")

    pass_rate, pass_rate_report_str = report_pass_rate(df)
    long_articles_section = _prepare_long_string_section(df, "article")
    selected_columns = [
        "status",
        "article",
        "expected_summary",
        "output",
        "cos_sim",
        "num_words",
    ]
    detailed_results = generate_results_table(df, selected_columns)

    generate_markdown_report(
        "test_summarize_to_30_words",
        [pass_rate_report_str, detailed_results, long_articles_section],
    )

    assert pass_rate > 0.60

Here’s a screenshot of the generated report, showing pass rate of six test runs (3 data x 2 repeats) and the top of the full results table. You can find the full report here.

Example 3: Comparing the Performance of Different Prompts and Models

In this example, we demonstrate how to compare the performance of two different prompts and generate comprehensive reports. This method could be easily adapted to compare LLM models. For those interested in comparing the latest models, we have also included this functionality in the Log10 CLI. We will discuss this in detail in our next post.

We began by creating separate functions and tests tailored to each prompt. This approach simplifies post-processing of the data and ensures clarity in the evaluation process. Leveraging the evaluation metrics from previous examples, we maintained consistency in our evaluation methodology. Then, we introduced the third test test_compare_prompts_results to collect all test execution results, generate figures, tables, and prepare the final report (full report here).

An example report, used to compare two different system prompts based on a custom metrics to the reference output along with the table and the figure. The test result table includes all the detailed information of all test runs.

@pytest.mark.repeat(3)
@pytest.mark.parametrize("sample_idx", range(3))
def test_summarize_with_sys_prompt_1(data: list, sample_idx: int, results_bag):
    article, expected_summary = data[sample_idx]
    output = summarize_with_sys_prompt_1(article)
    metric = cosine_distance(expected_summary, output)

    results_bag.test_name = f"test_summarize_sys_prompt_1_{sample_idx}"
    results_bag.article = article
    results_bag.expected_summary = expected_summary
    results_bag.output = output
    results_bag.cos_sim = metric
    results_bag.prompt = sys_message_1


@pytest.mark.repeat(3)
@pytest.mark.parametrize("sample_idx", range(3))
def test_summarize_with_sys_prompt_2(data: list, sample_idx: int, results_bag):
    article, expected_summary = data[sample_idx]
    output = summarize_with_sys_prompt_2(article)
    metric = cosine_distance(expected_summary, output)

    results_bag.test_name = f"test_summarize_sys_prompt_2_{sample_idx}"
    results_bag.article = article
    results_bag.expected_summary = expected_summary
    results_bag.output = output
    results_bag.cos_sim = metric
    results_bag.prompt = sys_message_2


def test_compare_prompts_results(module_results_df: pd.DataFrame):
    df = module_results_df[module_results_df["test_name"].str.contains("test_summarize_sys_prompt_")]
    # save df to csv
    df.to_csv("module_results_df.csv", index=True)

    # compare mean and std of cosine distance
    mean_1 = df[df["test_name"].str.contains("sys_prompt_1")]["cos_sim"].mean()
    std_1 = df[df["test_name"].str.contains("sys_prompt_1")]["cos_sim"].std()
    mean_2 = df[df["test_name"].str.contains("sys_prompt_2")]["cos_sim"].mean()
    std_2 = df[df["test_name"].str.contains("sys_prompt_2")]["cos_sim"].std()

    # Create a new dataframe
    summary_df = pd.DataFrame(
        {
            "Prompt": ["sys_prompt_1", "sys_prompt_2"],
            "Mean Cosine distance": [mean_1, mean_2],
            "Std Dev": [std_1, std_2],
        }
    )

    # plot summary_df to a bar chart and save to file
    plot_file = "generated_reports/test_compare_prompts_results.png"
    summary_df.plot(kind="bar", x="Prompt", y="Mean Cosine distance", yerr="Std Dev")
    plt.savefig(plot_file)

    markdown_table = tabulate(summary_df, headers="keys", tablefmt="pipe", showindex=False)
    prompt_comp_section = (
        "## Prompt Comparison\n\n" + markdown_table + "\n\n![Prompt Comparison](test_compare_prompts_results.png)"
    )

    # remove new lines in output
    df["output"] = df["output"].str.replace("\n", " ")

    long_articles_section = _prepare_long_string_section(df, "article")

    selected_columns = [
        "prompt",
        "article",
        "expected_summary",
        "output",
        "cos_sim",
    ]
    detailed_results = generate_results_table(df, selected_columns)
    generate_markdown_report(
        "test_compare_prompts_results",
        [prompt_comp_section, detailed_results, long_articles_section],
    )

Evaluating Agent/Tool calls

With that above building blocks in place, we now focus on scenarios where tool calling, especially for constructing agents with multiple steps, plays a crucial role. To demonstrate, we developed a two-step function for retrieving the weather of a specific location. The first LLM call determines which API to use and its arguments. After executing the API, a second LLM call processes the results to prepare the final response. We’ve designed this function to return not only the final output but also the intermediate tool calls. During testing, we validate both the intermediate and final responses, ensuring the integrity and accuracy of agent functionality within our system. With more steps in a more complex system, you could save all intermediate and final responses in a JSON format and validate them against expected responses, enhancing traceability and debugging capabilities.

# please refer to the above link for the complete code
def weather_of(question):
    messages = [
        {
            "role": "user",
            "content": f"what's the weather in {question}?",
        }
    ]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=messages,
        tools=tools,
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    step_1_tool_calls = tool_calls
    
    if tool_calls:
        # execute tool functions
        ...
        # second LLM call
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo-0125",
            messages=messages,
        )
        return second_response.choices[0].message.content, step_1_tool_calls

    return response, step_1_tool_calls

# test 
def test_weather_of():
    final_response, intermediate_tool_call = weather_of("HQ of OpenAI")

    # check intermediate tool call is correct
    assert len(intermediate_tool_call) == 1
    assert intermediate_tool_call[0].function.name == "get_current_weather"
    assert intermediate_tool_call[0].function.arguments == '{"location":"San Francisco, CA"}'

    # check final response is correct
    assert "San Francisco" in final_response and "72" in final_response

CI/CD Integration

In addition, we integrated this into CI/CD easily. For example, in a GitHub request, we triggered a GitHub action to run the test before merging the pull request.

Share LLMs for Engineers

Tackle More Complex Evaluation Scenarios with Custom Evaluation Models

I hope you find these insights valuable for enhancing your LLM product testing and development. While Pytest is adequate for initial setup evaluations, it falls short in addressing more complex evaluation methods, particularly those involving human judgment. Go beyond standard metric/tool/model-based evaluations with our custom AutoFeedback and evaluation models, uniquely trained on your data via Log10. For more details or to start using these custom solutions, please don’t hesitate to reach out to us at ai@log10.io or sign up for Log10 here. We’re eager to hear from you and explore how we can tailor our tools to meet your needs!

LLMs for Engineers