Nov 27, 2024

Announcing LLM Eval Support for Python, Ruby, Typescript, Go, and more.

By Greg Hale

We are excited to announce that BAML now supports evaluating LLM prompts using tests and assertions.

Here is how it works:

  1. Define an LLM prompt in BAML + the expected output type.
  2. Define a test with the arguments you want to pass to the LLM prompt.
  3. Add @@asserts and @@checks to fail the test if the LLM output doesn't match the expected output.
  4. Run the test in the LLM Playground!
  5. Run your tested LLM function in Python, Ruby, Typescript, Go, and more!

Let's look at an interactive example!

Loading preview...
No tests running

If you press Run, you'll see this test fail.

Test expressions

The expression inside the @@assert is a jinja2 expression.

  • The _ variable contains fields result, checks and latency_ms.
  • The this variable refers to the value computed by the test, and is shorthand for _.result.
  • In a given check or assert, _.checks.$NAME can refer to the NAME of any earlier check that was run in the same test block. By referring to prior checks, you can build compound checks and asserts, for example asserting that all checks of a certain type passed.
Loading preview...
No tests running

Unlike other LLM testing frameworks, BAML evals:

  1. Work for prompts with structured outputs -- with compile-time error checking of your test expressions.
  2. Support any language -- since you don't need Python, Ruby, Typescript, Go, etc. to run your tests (only BAML)
  3. Run locally -- no logins required.
  4. Work with the BAML VSCode Playground
  5. Integrate with Boundary Studio, our observability dashboard

Read more about evaluating LLM functions and let us know what you think!

We are looking at supporting LLM-as-judge evaluations in the future, and providing more helper functions to make it easier to evaluate free-form text. Stay tuned for more updates!


Thanks for reading!