3 ways to improve LLM Agent chains with debugging

Tl;dr: Cost, reliability & accuracy

May 03, 2023

https://twitter.com/gerardsans/status/1640477855002906629

This is the first in a series of posts documenting my journey taking LLM applications from prototype to production. Some of the future topics I plan to post about include the evolution of Agents as an LLM abstraction, the end to end stack for getting jobs done with LLMs, and driving efficiency and scale for LLM powered apps. Do comment or reach out if there are topics you’d like to hear more about. And do subscribe to get notified of new posts as they come out!

Quick level set: while LLMs such as GPT3.5 and 4 have amazing capabilities, they suffer from issues such as a knowledge cut-off that was back in 2021 and have difficulty in accurately answering math problems. A framework called Langchain has done a great job tackling these and other shortcomings by making it really easy for developers to set up chains of LLM calls which can call out to other tools such as search engines and calculators to solve the overall task in collaboration with the core LLM.

Despite this several limitations remain with getting LLMs into production. This post is about some surprises I had along the way chaining LLMs using Langchain.

😮 The first surprise was using the SQLAgentToolkit (blog, webinar) which provides NLP2SQL capabilities.

The SQLAgentToolkit consists of 4 tools:

list_tables_sql_db: tool to return tables in the database
schema_sql_db: tool to return the schema and sample rows
query_sql_db: tool to query a SQL DB and return results
query_checker_sql_db: check correctness of SQL query

I tried these on a relatively simple dummy database example. The agent did a great job correctly answering a query such as:

“Who is the least recent user?”

There were 4 calls generated for a total of 3381 tokens. This translated to a cost of 6.76 cents/query in OpenAI API costs (using davinci-003). For a typical application with thousands to millions of queries a day and over more complex tables and schemas these costs could quickly add up!

Lesson 1: Keep an eye on how many tokens all those intermediate LLM calls are adding up to and the associated costs when using LLM chains.

😮 The second example is using the Zapier Toolkit which uses the recently launched Zapier Natural Language Actions API - the same API that was used in Greg Brockman’s TED talk via ChatGPT & OpenAI Plugins.

As a test, I connected Zapier to my GMail and to a CRM system, so I can automatically populate any new accounts based on new information in my emails.

It worked great at queries such as:

“Summarize the last email I received from <company_name>. Send the summary to Google Docs”, or “Find the oldest user and create an account using their name and email address in my CRM”.

But when I pushed it to:

“Find the youngest user and create a task for me to wish them Happy Birthday in my CRM”

…I ran into some hiccups. The chain seemed to run through as if successfully completed but on further inspection I realized it had actually failed on the create task step. After some more digging it seemed that this was because some of the required parameters in the CRM’s API weren’t semantically matched to fields in the NLP query. I tried prompting the Toolkit to explicitly do the matching but still couldn’t quite get it to work!

Lesson 2: Use semantically meaningful names in API parameters to improve reliability of LLM chain powered apps.

😮 The third and final example here was the most surprising to me - perhaps it shouldn’t have been given what we know to be one of the biggest challenges with LLMs.

I set up a custom agent with two tools - a scraper and a summarizer to create custom summaries of websites to help populate a CRM.

But when I called it using a query such as:

“Scrape espn.com and summarize it”

…the agent hallucinated the call to the tool, and generated a response from the LLM without actually using the tool! Different variations of the prompt would hallucinate calling the scraper tool, or the summarizer tool. Other folks have reported similar issues here.

More declarative approaches such as DSP have been proposed to define chains programmatically and ensure they run as intended, versus what I think of as Langchain’s more imperative approach (flashback to early days of Deep Learning frameworks and Tensorflow vs PyTorch anyone?)! However, requiring that chains be defined programmatically for reliability takes away from the wonderfully democratizing nature of interacting with AI via NLP powered Chat UXs. Ideally, we might infer programs implicitly from natural language rather than require non-technical users to write programs. Projects like Parsel seem to be a step in this direction.

Lesson 3: Ensure that LLM chains are actually calling out to the tools as intended and not hallucinating the output of the tool.

Over the course of realizing these issues,

Niklas Nielsen

and I started building an open source library to help debug LLM chains called log10 and an accompanying website called log10.io to help developers visualize chains, collect feedback and iterate faster. With the right tooling the future is bright for LLM powered applications!

Harry Kim

May 4, 2023

Thanks for the article, Arjun! The DB idea is quite cool :)

Expand full comment

David Gentile

May 3, 2023

Thanks for the writeup Arjun. It's interesting that this seems to be a common issue of getting agents to do what you want them to do. It feels almost like a roll of the dice. At this stage, I’m unsure if better prompting is the answer or if fine-tuning will be the only consistent approach.

1 reply by Arjun Bansal

1 more comment...

LLMs for Engineers

Discussion about this post