Every Way To Get Structured Output From LLMs
By Sam Lijin - @sxlijin
Update (Nov 26): Added some more details to a few providers
Update (Jun 18): check out the discussion on Hacker News and /r/LocalLLaMA! Thanks for all the feedback and comments, folks- keep it coming!
This post will be interesting to you if:
- you're trying to get structured output from an LLM,
- you've tried
response_format: "json"
and function calling and been disappointed by the results, - you're tired of stacking regex on regex on regex to extract JSON from an LLM,
- you're trying to figure out what your options are.
Everyone using LLMs in production runs into this problem sooner or later: what we really want is a magical black box that returns JSON in exactly the format we want. Unfortunately, LLMs return English, not JSON, and it turns out that converting English to JSON is kinda hard.
Here's every framework we could find that solves this problem, and how they compare.
(Disclaimer: as a player in this space, we're a little biased! We're the creators of BAML.)
Comparison
Framework | Language Support | Does it handle or prevent malformed JSON? | How do I build the prompt? | Do I have full control over the prompt? | How do I see the final prompt? | Supported Model Providers | API flavors | How do I define output types? | Test Framework? |
---|---|---|---|---|---|---|---|---|---|
✅ Yes, using a new Rust-based error-tolerant parser (e.g. can parse {"foo": "bar} ) | Jinja templates | ✅ Yes | VSCode extension | ✅ OpenAI | ✅ sync | BAML schemas, transpiled to Pydantic | ✅ VSCode Extension 🚧 CLI | ||
✅ sync | BAML schemas, transpiled to TS | ||||||||
✅ sync | BAML schemas, transpiled to Sorbet | ||||||||
All other languages (via REST server + OpenAPI adapter) | ❌ sync | BAML schemas, hosted on REST server | |||||||
⚠️ Supports LLM-based retries (none by default) | Build the messages array | ❌ No (feature request) | No supported mechanism | ✅ OpenAI | ✅ sync | Pydantic | Via the Parea platform | ||
Build the messages array | ❌ No | ✅ OpenAI | ❌ sync | Zod | |||||
TypeChat | ⚠️ Automatic LLM-based retries | pass in a string | ❌ No | n/a | ✅ OpenAI | ❌ sync | Pydantic | None | |
pass in a string | ❌ No | Zod | None | ||||||
pass in a string | ❌ No | C# class | None | ||||||
⚠️ Supports LLM-based retries (none by default) | Jinja templates | ❌ No | No supported mechanism | OpenAI | ✅ sync | Pydantic | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | ⚠️️ OpenAI1 | ✅ sync | Pydantic | None | |
Python (Example pending) | ⚠️ OpenAI | pass in a string | ✅ Yes | n/a | ✅ llama.cpp | ✅ sync | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | OpenAI1 | ✅ sync | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | Transformers2 | ✅ sync | JSON schema | None | |
TypeScript (Example pending) | ❌ No | TODO | TODO | TODO | ⚠️ Google AI | TODO | Zod | None | |
❌ OpenAI | TODO | TODO | TODO | TODO | TODO | Regex | None | ||
❌ OpenAI | TODO | TODO | TODO | TODO | TODO | JSON schema | None |
**: Honorable mention to Microsoft's AICI, which is working on creating a shim for cooperative constraints implemented in Python/JS using a WASM runtime. Haven't included it in the list because it seems more low-level than the others, and setup is very involved.
1: Applying constraints to OpenAI models can be very error-prone, because the OpenAI API does not expose sufficient information about the underlying model operations for the framework to actually apply constraints effectively. See this discussion about limitations from the LMQL documentation.
2: Transformers refers to "HuggingFace Transformers"
3: Constrained streaming generation produces partial objects, but
no good ways of interacting with the partial objects, since they are not yet
parse-able. We only consider a framework to support streaming if it allows
interacting with the partial objects (e.g. if streaming back an object with
properties foo
and bar
, you can access obj.foo
before bar
has been
streamed to the client).
Criteria
Most of our criteria are pretty self-explanatory, but there are two that we want to call out:
Does it handle/prevent malformed JSON? If so, how?
LLMs make a lot of the same mistakes that humans do when producing JSON (e.g. a } in the wrong place or a missing comma), so it's important that the framework can help you handle these errors.
A lot of frameworks "solve" this by feeding the malformed JSON back into the LLM and asking it to repair the JSON. This kinda works, but it's also slow and expensive. If your LLM calls individually take multiple seconds already, you don't really want to make that even slower!
There are two techniques that exist for handling or preventing this: actually parse the malformed JSON (BAML takes this approach) or constrain the LLM's token generation to guarantee that valid JSON is produced (this is what Outlines, Guidance, and a few others do).
Parsing the malformed JSON is our preferred approach: it most closely aligns with what the LLM was designed to do (emit tokens), is fast (takes microseconds), and flexible (allows working with any LLM). It does have limitations: it can't magically make sense of completely nonsensical JSON, after all.
Applying constraints to LLM token generation, by contrast, can be robust, but has its own issues: doing this efficiently requires applying runtime transforms to the model itself, so this only works with self-hosted models (e.g. Llama, Transformers) and does not work with models like OpenAI's ChatGPT or Anthropic's Claude.
Can you see the actual prompt? Do you have full control over the prompt?
Prompts are how we "program" LLMs to give us output.
The best way to get an LLM to return structured data is to craft a prompt designed to return data matching your specific schema. To do that, you need to
- see the prompt actually getting sent to ChatGPT, and
- try different prompts.
Most frameworks, unfortunately, have hardcoded templates baked in which prevent doing this.
Example code
For each framework listed above, we've included example code, from the framework's documentation, provides for how you would use it.
BAML (Python)
From baml-examples/fastapi-starter/baml_src/extract_resume.baml
BAML (TS)
From baml-examples/nextjs-starter/baml_src/classify_message.baml
BAML (Ruby)
From baml-ruby-starter/examples.rb
:
require_relative "baml_client/client"
b = Baml::BamlClient.from_directory("baml_src")
input = "Can't access my account using my usual login credentials"
classified = b.ClassifyMessage(input: input)
puts classified.categories
From baml-ruby-starter/baml_src/classify_message.baml
:
enum Category {
Refund
CancelOrder
TechnicalSupport
AccountIssue
Question
}
class MessageFeatures {
categories Category[]
}
function ClassifyMessage(input: string) -> MessageFeatures {
client GPT4Turbo
prompt #"
{# _.role("system") starts a system message #}
{{ _.role("system") }}
Classify the following INPUT.
{{ ctx.output_format }}
{# This starts a user message #}
{{ _.role("user") }}
INPUT: {{ input }}
Response:
"#
}
Instructor (Python)
From simple_prediction.py
:
class Labels(str, enum.Enum):
SPAM = "spam"
NOT_SPAM = "not_spam"
class SinglePrediction(BaseModel):
"""
Correct class label for the given text
"""
class_label: Labels
def classify(data: str) -> SinglePrediction:
return client.chat.completions.create(
model="gpt-3.5-turbo-0613",
response_model=SinglePrediction,
messages=[
{
"role": "user",
"content": f"Classify the following text: {data}",
},
],
) # type: ignore
prediction = classify("Hello there I'm a nigerian prince and I want to give you money")
assert prediction.class_label == Labels.SPAM
instructor-js
From simple_prediction/index.ts
:
import { z } from "zod"
enum CLASSIFICATION_LABELS {
"SPAM" = "SPAM",
"NOT_SPAM" = "NOT_SPAM"
}
const SimpleClassificationSchema = z.object({
class_label: z.nativeEnum(CLASSIFICATION_LABELS)
})
const createClassification = async (data: string) => {
const classification = await client.chat.completions.create({
messages: [{ role: "user", content: `"Classify the following text: ${data}` }],
model: "gpt-3.5-turbo",
response_model: { schema: SimpleClassificationSchema, name: "SimpleClassification" },
max_retries: 3,
seed: 1
})
return classification
}
const classification = await createClassification(
"Hello there I'm a nigerian prince and I want to give you money"
)
// OUTPUT: { class_label: 'SPAM' }
console.log({ classification })
assert(
classification?.class_label === CLASSIFICATION_LABELS.SPAM,
`Expected ${classification?.class_label} to be ${CLASSIFICATION_LABELS.SPAM}`
)
TypeChat (Python)
From examples/sentiment/demo.py
:
import asyncio
import sys
from dotenv import dotenv_values
import schema as sentiment
from typechat import Failure, TypeChatJsonTranslator, TypeChatValidator, create_language_model, process_requests
async def main():
env_vals = dotenv_values()
model = create_language_model(env_vals)
validator = TypeChatValidator(sentiment.Sentiment)
translator = TypeChatJsonTranslator(model, validator, sentiment.Sentiment)
async def request_handler(message: str):
result = await translator.translate(message)
if isinstance(result, Failure):
print(result.message)
else:
result = result.value
print(f"The sentiment is {result.sentiment}")
file_path = sys.argv[1] if len(sys.argv) == 2 else None
await process_requests("😀> ", file_path, request_handler)
if __name__ == "__main__":
asyncio.run(main())
From examples/sentiment/schema.py
:
from dataclasses import dataclass
from typing_extensions import Literal, Annotated, Doc
@dataclass
class Sentiment:
"""
The following is a schema definition for determining the sentiment of a some user input.
"""
sentiment: Annotated[Literal["negative", "neutral", "positive"],
Doc("The sentiment for the text")]
TypeChat (TypeScript)
From examples/sentiment/src/main.ts
:
import { createJsonTranslator, createLanguageModel } from "typechat";
import { processRequests } from "typechat/interactive";
import { createTypeScriptJsonValidator } from "typechat/ts";
import { SentimentResponse } from "./sentimentSchema";
const dotEnvPath = findConfig(".env");
assert(dotEnvPath, ".env file not found!");
dotenv.config({ path: dotEnvPath });
const model = createLanguageModel(process.env);
const schema = fs.readFileSync(path.join(__dirname, "sentimentSchema.ts"), "utf8");
const validator = createTypeScriptJsonValidator<SentimentResponse>(schema, "SentimentResponse");
const translator = createJsonTranslator(model, validator);
// Process requests interactively or from the input file specified on the command line
processRequests("😀> ", process.argv[2], async (request) => {
const response = await translator.translate(request);
if (!response.success) {
console.log(response.message);
return;
}
console.log(`The sentiment is ${response.data.sentiment}`);
});
From examples/sentiment/src/sentimentSchema.ts
:
export interface SentimentResponse {
sentiment: "negative" | "neutral" | "positive"; // The sentiment of the text
}
TypeChat (C#/.NET)
From examples/Sentiment/Program.cs
:
using Microsoft.TypeChat;
namespace Sentiment;
public class SentimentApp : ConsoleApp
{
JsonTranslator<SentimentResponse> _translator;
public SentimentApp()
{
OpenAIConfig config = Config.LoadOpenAI();
// Although this sample uses config files, you can also load config from environment variables
// OpenAIConfig config = OpenAIConfig.LoadFromJsonFile("your path");
// OpenAIConfig config = OpenAIConfig.FromEnvironment();
_translator = new JsonTranslator<SentimentResponse>(new LanguageModel(config));
}
public override async Task ProcessInputAsync(string input, CancellationToken cancelToken)
{
SentimentResponse response = await _translator.TranslateAsync(input, cancelToken);
Console.WriteLine($"The sentiment is {response.Sentiment}");
}
}
From examples/Sentiment/SentimentSchema.cs
:
using System.Text.Json.Serialization;
using Microsoft.TypeChat.Schema;
namespace Sentiment;
public class SentimentResponse
{
[JsonPropertyName("sentiment")]
[JsonVocab("negative | neutral | positive")]
public string Sentiment { get; set; }
}
Marvin
From the Marvin docs:
import marvin
from pydantic import BaseModel
class Recipe(BaseModel):
name: str
cook_time_minutes: int
ingredients: list[str]
steps: list[str]
@marvin.fn
def recipe(
ingredients: list[str],
max_cook_time: int = 15,
cuisine: str = "North Italy",
experience_level:str = "beginner"
) -> Recipe:
"""
Returns a complete recipe that uses all the `ingredients` and
takes less than `max_cook_time` minutes to prepare. Takes
`cuisine` style and the chef's `experience_level` into account
as well.
"""
Last thoughts
This is a living document, and we'll be updating it as we learn more about other frameworks.
If you have any questions, comments, or suggestions, feel free to reach out to us on Discord or Twitter at @boundaryml. We're happy to also meet and help with any prompting / AI engineering questions you might have.