🦄 Policy to Prompt: Evaluating w/ the Enron Emails Dataset
One of the most common problems in AI engineering is looking at a set of policies/rules and evaluating evidence to determine if the rules were followed. In this session we'll explore turning policies into prompts and pipelines to evaluate which emails in the massive Enron email dataset violated SEC and Sarbanes-Oxley regulations.
Project Details
Open in GitHub🦄 policy to prompt: evaluating the enron email dataset against SEC regulations
one of the most common problems in AI engineering is looking at a set of policies / rules and evaluating evidence to determine if the rules were followed. In this session we'll explore turning policies into prompts and pipelines to evaluate which emails in the massive enron email dataset violated SEC and Sarbanes-Oxley regulations.
<a href="https://www.youtube.com/watch?v=gkekVC67iVs"></a>
Key Topics
-
Policy-to-Prompt Workflows
- Mapping compliance policies (Sarbanes-Oxley, JP Morgan Code of Conduct) to automated LLM checks
- Focusing on specific rules (gift-giving) rather than generic policy systems
- Building targeted evaluation pipelines
-
Iterative Evaluation Loop
- Start with vibe evals (playground testing)
- Add deterministic pytest cases
- Capture intermediate pipeline steps
- Use structured outputs (e.g. Pydantic models)
-
Scaling & Tooling Patterns
- Regex pre-filtering → async LLM calls → structured analysis
- Parallel processing with asyncio.gather
- Batch processing for large datasets
- Progress tracking with tqdm
-
Human-in-the-Loop & Golden Dataset
- Store analyzed emails as JSON files
- Enable reviewer triage of high-risk cases
- Build golden dataset from production traffic
- Monitor for drift and expand test cases
Aside - 12-Factor / ShadCN-for-Agents Mindset
- Open, customizable scaffold approach vs closed systems
- Developers own and version their agent code
- Flexibility to tweak and adapt