Nov 27, 2024
Announcing LLM Eval Support for Python, Ruby, Typescript, Go, and more.
By Greg Hale
We are excited to announce that BAML now supports evaluating LLM prompts using tests and assertions.
Here is how it works:
- Define an LLM prompt in BAML + the expected output type.
- Define a
test
with the arguments you want to pass to the LLM prompt. - Add
@@assert
s and@@check
s to fail the test if the LLM output doesn't match the expected output. - Run the test in the LLM Playground!
- Run your tested LLM function in Python, Ruby, Typescript, Go, and more!
Let's look at an interactive example!
If you press Run
, you'll see this test fail.
Test expressions
The expression inside the @@assert
is a jinja2
expression.
- The
_
variable contains fieldsresult
,checks
andlatency_ms
. - The
this
variable refers to the value computed by the test, and is shorthand for_.result
. - In a given check or assert,
_.checks.$NAME
can refer to the NAME of any earlier check that was run in the same test block. By referring to prior checks, you can build compound checks and asserts, for example asserting that all checks of a certain type passed.
Unlike other LLM testing frameworks, BAML evals:
- Work for prompts with structured outputs -- with compile-time error checking of your test expressions.
- Support any language -- since you don't need Python, Ruby, Typescript, Go, etc. to run your tests (only BAML)
- Run locally -- no logins required.
- Work with the BAML VSCode Playground
- Integrate with Boundary Studio, our observability dashboard
Read more about evaluating LLM functions and let us know what you think!
We are looking at supporting LLM-as-judge evaluations in the future, and providing more helper functions to make it easier to evaluate free-form text. Stay tuned for more updates!