#5
4 months ago
🦄 Designing Evals
Minimalist and high-performance testing/evals for LLM applications. Stay tuned for our season 2 kickoff topic on testing and evaluation strategies.
Project Details
Open in GitHub🦄 designing evals
minimalist and high-performance testing/evals for LLM applications
Overview
This session explores best practices for evaluating LLM applications, focusing on practical, efficient approaches that provide meaningful insights without unnecessary complexity.
Running this code
installing dependencies
# Install dependencies
uv sync
run the code
# Run the code
python hello.py
Key Topics
- Why evals are great - what you can do with an answer key
- How to get the answer key
- we all start out with no answer key
- how do you build it up over time
- Structured Data vs. Unstructured data
- people view as one or the other, but its often semi-structured / a blend
- json with sentences
- markdown with json
- using rubrics to design evals
- llm as judge
- Enron email dataset
- Visualizing Eval Results
Session Notes
Checklist
- Vibe evals - run your prompt (e.g. in playground) and look at the output
- write in a few test cases that work
- write a few end to end tests that run your prompt chain (e.g. with pytest)
- great for tone
- capture intermediate steps of your pipeline as probes and individual testable components
- alternative to probes
- structured outputs from an llm
- helps you break your problems down into smaller components
- e.g. lesson plan output --> "list of biases", "estimated cost"
- don't use numbers for confidence, use a rubric
- categorical, "slow" vs "medium" vs "fast" - enum-based evals
- use prod data to build up your golden dataset over time
- review diffs in either/both of RAW OUTPUT and the STRUCTURED EVALUATION of your pipeline outputs
Links
- (using only) integrated tests are a scam https://www.youtube.com/watch?v=VDfX44fZoMc