In this episode of AI That Works, hosts Vaibhav Gupta and Dex, along with guest Kevin Gregory, explore the intricacies of building AI systems that are ready for production. They discuss the concept of dynamic UIs, the challenges of large-scale classification, and the importance of user experience in AI applications. The conversation delves into the use of LLMs for enhancing classification systems, the evaluation and tuning of these systems, and the subjective nature of what constitutes a 'correct' classification. The episode emphasizes the need for engineers to focus on accuracy and user experience while navigating the complexities of AI engineering. The speakers also discuss model upgrades, user feedback, and the importance of building effective user interfaces, emphasizing iterative development and rapid prototyping for chatbot performance evaluation.
In this episode, hosts Vaibhav Gupta and Dex, along with guest Kevin Gregory, explore building production-ready AI classification systems, focusing on evaluation, tuning, and user experience design.
This episode dives deep into the practical challenges of building AI systems ready for production. The hosts explore large-scale classification systems handling 1000+ categories, demonstrating how to evaluate and tune these systems for real-world use.
Key Topics Covered
Building production-grade classification systems
Dynamic UIs for flexible content creation
Using LLMs to enhance classification accuracy
Evaluation strategies and custom dashboards
The subjective nature of classification correctness
Tuning classification pipelines for performance
Balancing accuracy, cost, and user experience
Key Takeaways
AI engineering concepts can be applied to real projects with measurable impact
Building production-grade classification systems requires careful attention to UX
Evaluating AI systems requires understanding both metrics and user experience
Subjectivity plays a significant role in defining correct classifications
Real user data is crucial for effective iteration and improvement
UI design should prioritize clarity and enable rapid spot-checking
This system solves the challenge of classifying text into large category sets (1000+ categories) by using a two-stage approach:
Narrowing Stage: Uses vector embeddings to quickly narrow down from 1000+ categories to ~5-10 candidates
Selection Stage: Uses LLM reasoning to select the best final category from the narrowed candidates
Quick Start
Prerequisites
Python 3.10+
OpenAI API key
UV package manager
Installation
# Clone and navigate to the project
cd level-3-code/large_scale_classification
# Install runtime dependencies (for running the system)
uv sync
# OR install with development dependencies (for contributing/development)
uv sync --extra dev
# Set up environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
Note: Use uv sync --extra dev if you plan to contribute to the project or need development tools like linting (ruff), type checking (pyright), and testing (pytest). For just running the classification system, uv sync is sufficient.
Generate BAML Client
# Convert BAML files to Python client code
uv run baml-cli generate
Basic Usage
# Run the interactive classification system
uv run python src/main.py
This will prompt you to enter text for classification and return the most appropriate category.
Category Loading: System loads 1000+ categories from data/categories.txt
Embedding Generation: Creates embeddings for input text and categories
Narrowing: Reduces categories to top candidates using similarity search
LLM Selection: Uses BAML/LLM to choose the best category from candidates
Result: Returns selected category with metadata and timing
Performance Features
Vector Store Caching
The system includes an advanced ChromaDB-based vector store for performance:
Faster lookups: Cached embeddings vs fresh API calls
Automatic caching: New categories are automatically added to the store
Model validation: Ensures compatibility between stored and current embeddings
Build Vector Store
# Build the vector store from categories
python scripts/build_vector_store.py
# Force rebuild (e.g., after changing embedding models)
python scripts/build_vector_store.py --force-rebuild
Narrowing Strategies
The system supports multiple narrowing strategies:
narrowing_strategy = NarrowingStrategy.HYBRID # EMBEDDING, HYBRID, or LLM
max_narrowed_categories = 5 # Number of candidates to pass to final selection
Testing
The system includes comprehensive testing infrastructure with both unit and integration tests:
Run Tests
# Run all tests (unit + integration)
cd tests
python run_tests.py
# Run specific test types
python run_tests.py --unit # Unit tests only (fast, no API calls)
python run_tests.py --narrowing-accuracy # Narrowing accuracy integration test
python run_tests.py --selection-accuracy # Selection accuracy integration test
python run_tests.py --pipeline-accuracy # Complete pipeline integration test
python run_tests.py --all # All tests explicitly
Test Types
Unit Tests: Fast component testing with mocking (embeddings, narrowing, selection, pipeline, vector store)
Narrowing Accuracy: Tests how often the correct category is included in narrowed results
Selection Accuracy: Tests final category selection accuracy
Pipeline Accuracy: End-to-end pipeline testing with performance metrics
Test Results
Integration tests automatically save detailed JSON results with timestamps for performance tracking:
# Compare results across test runs
python tests/compare_results.py --narrowing file1.json file2.json
Running Individual Tests
# Unit tests (from project root)
uv run pytest tests/unit/classification/pipeline_test.py -v
uv run pytest tests/unit/classification/selection_test.py -v
# Integration tests (from tests/integration)
cd tests/integration
python test_pipeline_accuracy.py
Configuration
Environment Variables
Create a .env file with only the required API key:
# Required - the only thing needed in .env
OPENAI_API_KEY=your_api_key_here
Application Settings
All other configuration is done in src/config/settings.py. You can modify the default values directly in the file:
class Settings(BaseSettings):
"""Application configuration settings."""
# OpenAI Configuration
openai_api_key: str # Loaded automatically from .env. Don't put your key here
embedding_model: str = "text-embedding-3-small"
# Classification Strategy
narrowing_strategy: NarrowingStrategy = NarrowingStrategy.HYBRID
max_narrowed_categories: int = 5
# Hybrid Strategy Specific Settings
max_embedding_candidates: int = 10 # How many categories embedding stage returns
max_final_categories: int = 3 # How many categories LLM stage returns
# Data Configuration
categories_file_path: pathlib.Path = CWD.parents[1] / C.DATA / C.CATEGORIES_TXT
NarrowingStrategy.LLM: Pure LLM-based narrowing (most flexible)
Tuning Performance
Adjust these settings in settings.py to optimize for your use case:
max_narrowed_categories: Number of candidates passed to final selection (default: 5)
max_embedding_candidates: For hybrid strategy, how many categories the embedding stage returns (default: 10)
max_final_categories: For hybrid strategy, how many categories the LLM stage returns (default: 3)
embedding_model: OpenAI embedding model to use (default: "text-embedding-3-small")
Category Data
Categories are loaded from data/categories.txt. The format supports hierarchical categories:
/Appliances/Refrigerators/French Door Refrigerators
/Appliances/Dishwashers/Built-in Dishwashers
/Appliances/Appliance Parts/Dishwasher Parts
π Development Workflow
Configuration β Testing β Analysis Workflow
The system supports a complete development workflow for optimizing classification performance:
Update Configuration: Modify settings in src/config/settings.py
Run Performance Tests: Execute pipeline tests with version tracking
Analyze Results: Use the Streamlit app to compare performance across versions
Example Workflow
# 1. Update configuration settings
# Edit src/config/settings.py - for example:
# max_narrowed_categories = 10 (was 5)
# max_embedding_candidates = 50 (was 10)
# 2. Run pipeline test with version tracking
uv run python tests/integration/test_pipeline_accuracy.py --save-as v7 --description "embedding 50, llm 10, model upgrade"
# 3. View results in Streamlit app
uv run streamlit run ui/app.py
# 4. Compare with previous versions in the UI
# The app will show performance comparisons across all saved versions
Configuration Parameters for Optimization
Key settings in src/config/settings.py that affect performance:
class Settings(BaseSettings):
# Strategy Selection
narrowing_strategy: NarrowingStrategy = NarrowingStrategy.HYBRID
# Performance Tuning
max_narrowed_categories: int = 5 # Final candidates passed to LLM
max_embedding_candidates: int = 10 # Embedding stage candidates (hybrid only)
max_final_categories: int = 3 # LLM stage candidates (hybrid only)
# Model Selection
embedding_model: str = "text-embedding-3-small" # or "text-embedding-3-large"
Streamlit Analysis Dashboard
The Streamlit app (ui/app.py) provides:
Performance Comparison: Compare accuracy and timing across test versions
Detailed Analysis: Drill down into individual test case results
Configuration Tracking: See what settings were used for each version
Trend Analysis: Track performance improvements over time
Launch the dashboard:
uv run streamlit run ui/app.py
Version Management
Pipeline tests support version tracking for systematic performance analysis:
# Save test results with version and description
uv run python tests/integration/test_pipeline_accuracy.py --save-as v6 --description "baseline configuration"
uv run python tests/integration/test_pipeline_accuracy.py --save-as v7 --description "increased embedding candidates to 50"
uv run python tests/integration/test_pipeline_accuracy.py --save-as v8 --description "upgraded to text-embedding-3-large"
Results are saved to tests/results/saved_runs/ with metadata for easy comparison.
π§ Advanced Usage
Programmatic Usage
from src.classification.pipeline import ClassificationPipeline
# Initialize pipeline
pipeline = ClassificationPipeline()
# Classify text
result = pipeline.classify("Samsung 17.5-cu ft French door refrigerator")
print(f"Category: {result.category.name}")
print(f"Confidence: {result.confidence}")
print(f"Processing time: {result.processing_time_ms:.1f}ms")
print(f"Candidates: {[c.name for c in result.candidates]}")
Custom Categories
To use your own category set:
Replace data/categories_full.txt with your categories
Rebuild the vector store: python scripts/build_vector_store.py --force-rebuild
Update test cases in tests/data/test_cases.py if needed
BAML Integration
The system uses BAML for LLM interactions. BAML files are in src/baml_src/: