#10
2 months ago
🦄 Entity Resolution: Extraction, Deduping, and Enriching
Disambiguating many ways of naming the same thing (companies, skills, etc.) - from entity extraction to resolution to deduping. We'll explore breaking problems into extraction → resolution → enrichment stages, scaling with two-stage designs, and building async workflows with human-in-loop patterns for production entity resolution systems.
Project Details
Open in GitHubEntity Resolution: Extraction, Deduping, and Enriching
Disambiguating many ways of naming the same thing (companies, skills, etc.) - from entity extraction to resolution to deduping.
Video (1h15m) (AVAILABLE June 20 8 am PST)
Links:
- https://github.com/BoundaryML/baml-examples/tree/main/extract-anything
- Related Session: Large Scale Classification
Key Takeaways
- Separate Extraction from Resolution: Extract "what string did the user type?" first, then resolve "which row in my DB?" separately
- Two-Stage Design for Scale: List-in-prompt fails beyond ~500 companies; use staged queues instead of bigger prompts
- Heuristics Before LLMs: Straight alias matching covers 80% of cases - save LLM calls for the hard 20%
- Type-Signature Mindset: Treat every LLM call as a pure function; swap implementations without rewriting call-sites
- Status-Driven Async Workflow: Use database status columns (proposed/ready/committed) to enable human-in-loop and future automation
- Start Expensive, Then Optimize: Ship with big models first, collect ground-truth data, then optimize when it hurts
Whiteboards
Core Architecture
Pipeline Stages
- Extraction: Extract entities from raw text with small models (gpt-4o-mini, llama3:8b)
- Resolution: Match extracted entities to canonical database entries
- Enrichment: Queue unknown entities for web search and human review