I’ve had a lot of fun playing with LLMs recently. I think they get a lot of coverage in applications where they play the role of an agent, but I’ve had pretty mixed results building these kinds of tools. In particular, I’ve had trouble getting the LLM to reliably make multiple function calls in serial. While one call may be exactly what I’m looking for, there’s a decent chance at least one of the calls will be slightly off, breaking the whole flow.

Where I’ve had much more reliable results is in using LLMs for extracting structured results from unstructured input and in using them for text manipulation. These both rely less on the LLM being able to “think” and more on it being able to follow direct instructions.

Recipes

In 2021, I made a recipes page, and I seeded it with some recipes. Sadly, I never ended up adding more. While the main reason for this is my laziness, it’s also just hard to make good recipes! Formatting ingredients, giving easy-to-follow instructions, and just making the new page is a real pain.

This is a bummer for me since for any of my recipes, I can easily articulate in words how to make it, but I find writing those instructions down very difficult. Recently, I’ve turned to voice recognition and LLMs for assistance on this.

At a high level, I want to enable the following workflow:

  • Dictate a recipe using my voice
  • Have an LLM generate a nicely-formatted recipe
  • Put that recipe on my website
  • Dictate any modifications I want to make and have them reflected in the recipe.

Voice Transcription

The first step of all of this is to go from voice -> text. I don’t have much familiarity with recording in Python, so I mostly asked ChatGPT to write the code and told it what was broken. The code for this isn’t all that interesting, but it provides a nice interface that I can use to record audio:

# This will record audio to the file listen.RECORDING_FILE
listen.record_audio()

To go from this recording file to text, I used a python port of the incredible whisper.cpp project. I am consistently stunned at how well whisper.cpp works and how quickly it runs. It outperforms the in-built MacOS dictation by such a wide margin that you’ll never want to use Siri again.

With whisper set up, I now have a super simple function I can use to get voice input:

def get_voice_input(whisp: Whisper) -> str:
    listen.record_audio()
    return whisp.transcribe_from_file(listen.RECORDING_FILE)

Here’s where we are so far:

  • Dictate a recipe using my voice
  • Have an LLM generate a nicely-formatted recipe
  • Put that recipe on my website
  • Dictate any modifications I want to make and have them reflected in the recipe.

Function calls

I alluded to function calls before, but it’s worth taking some time to talk about what they really are. In June, OpenAI announced an extension to their API to allow GPT chat completions to call functions instead of just responding with text. You simply provide a JSON schema describing the available functions, and the LLM will make a decision on whether to call a function and what arguments to call it with.

This has primarily been billed as a tool to build agents. For example, if you want to build an agent that can tell you the weather, you can give it a “get_current_weather” function that’ll make a call out to some weather API. Then, when you ask it “what’s the weather”, it’ll make the function call and respond to your question in plain english using the results of the call.

The powerful thing here is that the function calling is extremely reliable in only ever making calls that conform to the JSON schema. I’ve had very few issues, and I was able to fix any issues by prompting the LLM more clearly.

Autogenerated schemas

JSON schemas are pretty arduous to type out, in particular because you have to define your schema once in code and once in the JSON schema you provide to the LLM. To alleviate this pain, there’s a library that generates OpenAI-compatible JSON schemas straight from Pydantic objects. This allows you to define type-safe schemas that LLMs can understand “for free”.

From the docs, here’s an example of using the library to extra user information from input:

class UserDetails(OpenAISchema):
    """User Details"""
    name: str = Field(..., description="User's name")
    age: int = Field(..., description="User's age")

completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0613",
    functions=[UserDetails.openai_schema],
    messages=[
        {"role": "system", "content": "I'm going to ask for user details. Use UserDetails to parse this data."},
        {"role": "user", "content": "My name is John Doe and I'm 30 years old."},
    ],
)

user_details = UserDetails.from_response(completion)
print(user_details)  # UserDetails(name="John Doe", age=30)

You might be able to see where I’m going with this :)

Recipe extraction

In order to render a recipe onto a page, I need to know its components. I thought the following were pretty typical attributes of a recipe:

  • Description
  • Steps
  • Ingredients
  • Total Time
  • Active Time

Using the function call library, I wrote up a schema for recipes:

class RecipeDetails(OpenAISchema):
    """RecipeDetails are the details describing a recipe. Details are precise and
reflect the input from the user."""
    description: str = Field(..., description="casual description of every recipe step in approximately 200 characters")
    steps: List[str] = Field(..., description="""specific list of all steps in the recipe, in order.
If multiple steps can be combined into one, they will.""")
    recipe_title: str = Field(..., description="title of the recipe in 20 characters or less")
    ingredients: List[Ingredient] = Field(..., description="list of all ingredients in the recipe")
    total_time: int = Field(..., description="total time in minutes required for this recipe")
    active_time: int = Field(..., description="total active (non-waiting) cooking time in minutes required for this recipe")

I then wrote a generic function to extract any schema from unstructured text and called it on the transcript:

transcription = get_voice_input(w)
recipe = llm.get_shema_obj("RecipeDetails", elems.RecipeDetails, transcription)

Now, recipe is a structured object that I can pull all of my recipe information out of.

  • Dictate a recipe using my voice
  • Have an LLM generate a nicely-formatted recipe
  • Put that recipe on my website
  • Dictate any modifications I want to make and have them reflected in the recipe.

Rendering the recipe

I already use Hugo to host my blog, so I won’t fix what ain’t broke. First, when I create a new recipe, I make a small markdown file inside of my content/recipes directory with some metadata and a reference to the JSON-formatted recipe info:

+++
title = "The Good Lemonade"
description = "Refreshing and sweet lemonade like they served at the country fair"
date = "2023-08-06"
total_time = "1440"
active_time = "20"
+++

{{< ai_recipe url="data/recipes/country_fair_lemonade.json" >}}

This ai_recipe shortcode is a pretty simple template that displays all the elements of the JSON recipe. The JSON is just the RecipeDetails class serialized.

And that’s it for the first pass! I can now speak into my mic and get a nicely-formatted recipe on my website!

  • Dictate a recipe using my voice
  • Have an LLM generate a nicely-formatted recipe
  • Put that recipe on my website
  • Dictate any modifications I want to make and have them reflected in the recipe.

Modifications

These recipes are never perfect in the first iteration. The ingredients are typically overspecified, and the steps definitely suffer from miscommunications.

To solve this, I built a new mode into my REPL for performing modifications. I just ask the user for a link to the recipe they’d like to modify, and I pull the unique identifier out of the URL. I then read in the JSON for that recipe so I can kick off modification:

url = nice_input("give the url for the recipe: ")
# Regular expression pattern
pattern = r'https?://[^/]+/recipes/ai_([^/]+)/?'

# Use re.search to find matches
match = re.search(pattern, url)

if match:
	recipe_name = match.group(1)

	json_path = os.path.join(presentation.hugo_base_dir, "data", "recipes", f"{recipe_name}.json")
	with open(json_path, 'r') as f:
		file_contents = f.read()
	recipe_dict = json.loads(file_contents)
	recipe = parse_obj_as(elems.RecipeDetails, recipe_dict)

Modification REPL

In order to make modifications relative to the current state, I need to show the LLM what the current state of the file is. I don’t want to pass in a full HTML page, so I made a text analog of my webpage using jinja:

Recipe Title: {{ recipe.recipe_title }}

Total time: {{ recipe.total_time }} minutes
Active time: {{ recipe.active_time }} minutes

Recipe Description: {{ recipe.description }}

Recipe Ingredients:
{% for item in recipe.ingredients -%}
- {{ item.quantity }} {{ item.unit }} {{ item.name }}
{% endfor %}
Recipe steps:
{% for step in recipe.steps -%}
- {{ step }}
{% endfor %}

I then ask the user for voice input and instruct the LLM to make modifications to the recipe according to the user’s instructions:

modifications = get_voice_input(w)
new_recipe = llm.make_modifications(modifications, recipe)
# use the same parsing as before to turn unstructured recipe into pydantic class
recipe = parse_recipe(new_recipe)

I can then write out the new recipe, and the user can look in their live Hugo view to see if it looks good:

There are some cases where the LLM will make modifications too aggressively or it’s just easier for me to make these changes manually. For these cases, I open up a vim buffer with the recipe and allow the user to make whatever changes they wish. I use the LLM to parse the required info out the same as before.

  • Dictate a recipe using my voice
  • Have an LLM generate a nicely-formatted recipe
  • Put that recipe on my website
  • Dictate any modifications I want to make and have them reflected in the recipe.

Conclusion

This tool was a ton of fun to build and taught me to think more precisely about how to use LLMs. I really believe that the current assistant/agent-focused discourse is pretty misguided. While assistant applications may come into the mainstream, I think that in the near term, LLMs will be useful as tools to get jobs done rather than agents to replace humans.

Local LLM epilogue

I’ve been playing around with local LLMs a lot as well. They are not even close to GPT quality, but they’re getting better every day. Hopefully Llama 2’s commercial license will open the floodgates.

With grammar based sampling being merged into llama.cpp, I think we’re a JSON-trained model away from being able implement function calling locally. Some have already taken a stab at it, and I have high hopes for restricted sampling being the key to eeking out better performance from local models.

At some point in the future, I’d like to rip out the OpenAI calls here and replace them with a local model. That would be sweet.