Aug 13, 2024

Beating OpenAI's structured outputs on cost, accuracy and speed — An interactive deep-dive

Click to play with the data

Using BAML, we achieved state-of-the-art results for every model in the Berkeley function-calling benchmark (BFCL) — nearly solving it1.

Key Findings

  1. BAML is more accurate and cheaper for function calling than any native function calling API. It's easily 2-4x faster than OpenAI's FC-strict API.

  2. BAML's technique is model-agnostic and works with any model without modification (even open-source ones).

  3. gpt-3.5-turbo, gpt-4o-mini, and claude-haiku with BAML work almost as well as gpt4o with structured output (less than 2%)

  4. Using FC-strict over naive function calling improves every older OpenAI models, but gpt-4o-2024-08-06 gets worse

Background

Until now, the only way to get better results from LLMs was to:

  1. Prompt engineer the heck out of it with longer and more complex prompts
  2. Train a better model

What BAML does differently

  1. Replaces JSON schemas with typescript-like definitions. e.g. string[] is easier to understand than {"type": "array", "items": {"type": "string"}}.
  2. Uses a novel parsing technique (Schema-Aligned Parsing) inplace of JSON.parse. SAP allows for fewer tokens in the output with no errors due to JSON parsing. For example, this can be parsed even though there are no quotes around the keys. PARALLEL-5
[
  {
    streaming_service: "Netflix",
    show_list: ["Friends"],
    sort_by_rating: true
  },
  {
    streaming_service: "Hulu",
    show_list: ["The Office", "Stranger Things"],
    sort_by_rating: true
  }
]

We used our prompting DSL (BAML) to achieve this[2], without using JSON-mode or any kind of constrained generation. We also compared against OpenAI's structured outputs that uses the 'tools' API, which we call "FC-strict".

We are excited to share these results, as using prompting over other approaches lowers the barrier to get reliable structured data from any model, open-source or not.

Noteable examples

  1. FC-Strict leans towards fewers tools than expected from the user query

  2. FC-Strict will sometimes not choose a tool and output in pure text instead

  3. The benchmark has quite a few ambiguous prompts and schemas (not exhaustive)

    • Prompt doesn't specify interest rate format (0-100 vs 0.00-1.00) SIMPLE-145
    • Schema doesn't allow LLM to indicate size of each grocery item: PARALLEL-54
    • Asserts compare int vs float, when they should allow implicit conversions PARALLEL-26 PARALLEL-9
  4. The Berkley Function Calling Prompting techinque fails on the newest model because the model's results are no longer parseable (prompting the model with NO ADDITONAL TEXT no longer works)

Thoughts on the future

Instead of efforts towards training models for structured data or contraining tokens at generation time, we believe there is un-tapped value in applying engineering efforts to areas like robustly handling the output of models.

Models are really, really good an semantic understanding.

Models are really bad at things that have to be perfect like perfect JSON, perfect SQL, compiling code, etc.

Research has shown that models perform worse when they are constrained

"Comparing results without and with schema constraint, adding schema not only increase the sensitivity to prompt but also degrade in average performance."

- Let Me Speak Freely
Zhi Rui Tam, et. all, Aug 2024

Further reading

Footnotes

  1. No model/technique can ever reach 100% with the current data since some prompts are confusing even to humans, or the results are not checked correctly. We did not change/fix the existing assertions (yet) to compare our score more closely against other previous runs of this benchmark. See the code on github
  2. BAML uses a prompting technique here, but will soon support the tool APIs.

BFCL Results

Loading data...



Thanks for reading!