Announcements8 months ago2 min read

Announcing LLM Eval Support for Python, Ruby, Typescript, Go, and more.

Use BAML to evaluate your LLM applications regardless of the language you use to call them

Greg Hale

We are excited to announce that BAML now supports evaluating LLM prompts using tests and assertions.

Here is how it works:

Define an LLM prompt in BAML + the expected output type.
Define a test with the arguments you want to pass to the LLM prompt.
Add @@asserts and @@checks to fail the test if the LLM output doesn't match the expected output.
Run the test in the LLM Playground!
Run your tested LLM function in Python, Ruby, Typescript, Go, and more!

Let's look at an interactive example!

Loading preview...

No tests running

If you press Run, you'll see this test fail.

Test expressions

The expression inside the @@assert is a jinja2 expression.

The _ variable contains fields result, checks and latency_ms.
The this variable refers to the value computed by the test, and is shorthand for _.result.
In a given check or assert, _.checks.$NAME can refer to the NAME of any earlier check that was run in the same test block. By referring to prior checks, you can build compound checks and asserts, for example asserting that all checks of a certain type passed.

Loading preview...

No tests running

Unlike other LLM testing frameworks, BAML evals:

Work for prompts with structured outputs -- with compile-time error checking of your test expressions.
Support any language -- since you don't need Python, Ruby, Typescript, Go, etc. to run your tests (only BAML)
Run locally -- no logins required.
Work with the BAML VSCode Playground
Integrate with Boundary Studio, our observability dashboard

Read more about evaluating LLM functions and let us know what you think!

We are looking at supporting LLM-as-judge evaluations in the future, and providing more helper functions to make it easier to evaluate free-form text. Stay tuned for more updates!

Announcements8 months ago2 min read

Announcing LLM Eval Support for Python, Ruby, Typescript, Go, and more.

Use BAML to evaluate your LLM applications regardless of the language you use to call them

Greg Hale

We are excited to announce that BAML now supports evaluating LLM prompts using tests and assertions.

Here is how it works:

Define an LLM prompt in BAML + the expected output type.
Define a test with the arguments you want to pass to the LLM prompt.
Add @@asserts and @@checks to fail the test if the LLM output doesn't match the expected output.
Run the test in the LLM Playground!
Run your tested LLM function in Python, Ruby, Typescript, Go, and more!

Let's look at an interactive example!

Loading preview...

No tests running

If you press Run, you'll see this test fail.

Test expressions

The expression inside the @@assert is a jinja2 expression.

The _ variable contains fields result, checks and latency_ms.
The this variable refers to the value computed by the test, and is shorthand for _.result.
In a given check or assert, _.checks.$NAME can refer to the NAME of any earlier check that was run in the same test block. By referring to prior checks, you can build compound checks and asserts, for example asserting that all checks of a certain type passed.

Loading preview...

No tests running

Unlike other LLM testing frameworks, BAML evals:

Work for prompts with structured outputs -- with compile-time error checking of your test expressions.
Support any language -- since you don't need Python, Ruby, Typescript, Go, etc. to run your tests (only BAML)
Run locally -- no logins required.
Work with the BAML VSCode Playground
Integrate with Boundary Studio, our observability dashboard

Read more about evaluating LLM functions and let us know what you think!

We are looking at supporting LLM-as-judge evaluations in the future, and providing more helper functions to make it easier to evaluate free-form text. Stay tuned for more updates!