Architecture
System Overview
The ETL pipeline is a Next.js application with serverless API routes handling the heavy processing. All state lives in a single Supabase table (etl_jobs), with Realtime subscriptions pushing progress updates to the frontend.
Layers
| Layer | Components | Role |
|---|---|---|
| Browser | React + Mantine, useEtlPipeline (orchestrator), useJobProgress (Realtime subscription) | UI, file upload, progress display |
| Serverless | /api/upload, /api/analyze, /api/extract + core libs (etl.ts, spot-check.ts) | All processing — scanning, structure analysis, extraction, validation |
| AI | Anthropic Claude API (Opus for structure, Sonnet for extraction) | Sheet classification, column mapping, complex data extraction, spot-check audits |
| Storage | Supabase — etl_jobs table (state + progress), etl-files bucket (uploads + outputs) | Persistence, Realtime subscriptions |
Data flow
Browser → fetch() → Serverless API → Claude API → Supabase
Browser ← Realtime subscription ← SupabaseThe browser sends requests to the serverless functions, which do the heavy lifting (parsing Excel, calling Claude, writing output). Progress updates are written to etl_jobs and pushed to the browser via Supabase Realtime — no polling.
AI Models
The pipeline uses two Anthropic Claude models at different stages, chosen for the cost/capability tradeoff at each step.
| Stage | Model | Why |
|---|---|---|
| Structure analysis | claude-opus-4-6 | Deep reasoning about ambiguous spreadsheet layouts — classifying Table types, resolving overlapping regions, mapping columns to a 40-field schema. This is the hardest step and benefits from Opus-level reasoning. |
| Sheet pre-scanning | claude-sonnet-4-6 | Fast classification of each Sheet — finding Table boundaries, detecting charge layouts, identifying header rows. Lower complexity, high throughput. |
| Data extraction | claude-sonnet-4-6 | Extracting Unit data from chunks. Each chunk is 60 rows max, up to 8 chunks processed in parallel. Sonnet provides the best speed/accuracy balance for structured extraction. |
| Spot-check audit | claude-sonnet-4-6 | Re-extracts a biased sample of 8 Units independently to verify extraction accuracy. |
Prompt strategy
- System prompts define the canonical schema and expected output format (JSON)
- User prompts include the actual spreadsheet data as formatted text grids
- Temperature: 0 for all extraction calls (reproducible output)
- Structured output: All extraction calls use JSON mode with explicit field-level instructions
Cost profile
Every file goes through the full AI pipeline. LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. AI calls per job:
- Structure analysis (one Opus call)
- Sheet scanning (one Sonnet call per Sheet)
- Data extraction (up to 8 parallel Sonnet chunk calls)
- Spot-check (one Sonnet call to audit a sample of extracted Units)
Typical file: 2 + chunk count + 1 API calls (e.g., ~11 calls for a file with 8 extraction chunks).
Module Inventory
| File | Purpose | ~Lines |
|---|---|---|
lib/etl.ts | Core ETL engine: Excel parsing, Sheet scanning, structure analysis, chunking, LLM extraction, validation, output writing | 1170 |
lib/deterministic-extractor.ts | Legacy module — contains utility functions (charge code dictionary, status normalization helpers) imported by other modules; not used as an extraction mode | ~500 |
lib/canonical-schema.ts | Canonical field definitions, type coercion (coerceFieldValue), status normalization, shared TypeScript types | ~355 |
lib/api-utils.ts | Shared API route helpers (auth, error handling, job updates) | ~31 |
lib/spot-check.ts | LLM-based post-extraction audit with biased sampling | 226 |
lib/comparison.ts | Builds source-to-canonical comparison data for the review UI | 151 |
lib/supabase.ts | Client-side Supabase singleton (anon key) | 14 |
lib/supabase-server.ts | Server-side Supabase singleton (service role key) | 9 |
lib/auth.ts | Server-side Bearer token authentication | 12 |
lib/auth-client.ts | Client-side auth header builder (reads from sessionStorage) | 10 |
Key Design Decisions
LLM-Only Extraction
LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. Data is chunked intelligently (up to 8 chunks in parallel) and processed via Claude Sonnet. This single-path design eliminates mode-selection logic and quality-gate fallbacks, simplifying the pipeline while maintaining top accuracy.
Two-Phase API Design
Structure analysis and extraction are separate API calls:
/api/analyze(120s timeout) — reads the spreadsheet, classifies Sheets into Tables, maps columns to schema/api/extract(300s timeout) — uses the structure map to pull Units, validate, and generate output
This separation enables the human-in-the-loop review step between phases.
Streaming Progress via Supabase Realtime
Both API phases push granular progress updates to the etl_jobs table. The frontend subscribes via Supabase Realtime and renders milestones in a scrolling timeline. Progress percentages are distributed:
- Upload: 0–5%
- Analysis: 8–40%
- Extraction: 45–95%
- Validation + Output: 95–100%
Spot-Check Verification
After extraction, the system audits a sample of extracted Units against the source spreadsheet. It samples 8 Units using biased selection (first 5, last 5, any vacant Units, random fill) and re-extracts them independently via a separate Sonnet call. Discrepancies are scored and surfaced as a confidence rating.
Constants
const STRUCTURE_MODEL = "claude-opus-4-6"; // Structure analysis (deep reasoning)
const EXTRACTION_MODEL = "claude-sonnet-4-6"; // Data extraction chunks
const SCAN_MODEL = "claude-sonnet-4-6"; // Sheet pre-scanning
const CHUNK_ROWS = 60; // Max rows per horizontal chunk
const OVERLAP_ROWS = 5; // Overlap between chunks (dedup context)
const PARALLEL_CHUNKS = 8; // Max concurrent extraction workers
const UNITS_PER_CHUNK = 18; // Max unit blocks per vertical chunk