Architecture

System Overview

The ETL pipeline is a Next.js application with serverless API routes handling the heavy processing. All state lives in a single Supabase table (etl_jobs), with Realtime subscriptions pushing progress updates to the frontend.

Layers

Layer	Components	Role
Browser	React + Mantine, `useEtlPipeline` (orchestrator), `useJobProgress` (Realtime subscription)	UI, file upload, progress display
Serverless	`/api/upload`, `/api/analyze`, `/api/extract` + core libs (`etl.ts`, `spot-check.ts`)	All processing — scanning, structure analysis, extraction, validation
AI	Anthropic Claude API (Opus for structure, Sonnet for extraction)	Sheet classification, column mapping, complex data extraction, spot-check audits
Storage	Supabase — `etl_jobs` table (state + progress), `etl-files` bucket (uploads + outputs)	Persistence, Realtime subscriptions

Data flow


Browser → fetch() → Serverless API → Claude API → Supabase
Browser ← Realtime subscription ← Supabase

The browser sends requests to the serverless functions, which do the heavy lifting (parsing Excel, calling Claude, writing output). Progress updates are written to etl_jobs and pushed to the browser via Supabase Realtime — no polling.

AI Models

The pipeline uses two Anthropic Claude models at different stages, chosen for the cost/capability tradeoff at each step.

Stage	Model	Why
Structure analysis	`claude-opus-4-6`	Deep reasoning about ambiguous spreadsheet layouts — classifying Table types, resolving overlapping regions, mapping columns to a 40-field schema. This is the hardest step and benefits from Opus-level reasoning.
Sheet pre-scanning	`claude-sonnet-4-6`	Fast classification of each Sheet — finding Table boundaries, detecting charge layouts, identifying header rows. Lower complexity, high throughput.
Data extraction	`claude-sonnet-4-6`	Extracting Unit data from chunks. Each chunk is 60 rows max, up to 8 chunks processed in parallel. Sonnet provides the best speed/accuracy balance for structured extraction.
Spot-check audit	`claude-sonnet-4-6`	Re-extracts a biased sample of 8 Units independently to verify extraction accuracy.

Prompt strategy

System prompts define the canonical schema and expected output format (JSON)
User prompts include the actual spreadsheet data as formatted text grids
Temperature: 0 for all extraction calls (reproducible output)
Structured output: All extraction calls use JSON mode with explicit field-level instructions

Cost profile

Every file goes through the full AI pipeline. LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. AI calls per job:

Structure analysis (one Opus call)
Sheet scanning (one Sonnet call per Sheet)
Data extraction (up to 8 parallel Sonnet chunk calls)
Spot-check (one Sonnet call to audit a sample of extracted Units)

Typical file: 2 + chunk count + 1 API calls (e.g., ~11 calls for a file with 8 extraction chunks).

Module Inventory

File	Purpose	~Lines
`lib/etl.ts`	Core ETL engine: Excel parsing, Sheet scanning, structure analysis, chunking, LLM extraction, validation, output writing	1170
`lib/deterministic-extractor.ts`	Legacy module — contains utility functions (charge code dictionary, status normalization helpers) imported by other modules; not used as an extraction mode	~500
`lib/canonical-schema.ts`	Canonical field definitions, type coercion (`coerceFieldValue`), status normalization, shared TypeScript types	~355
`lib/api-utils.ts`	Shared API route helpers (auth, error handling, job updates)	~31
`lib/spot-check.ts`	LLM-based post-extraction audit with biased sampling	226
`lib/comparison.ts`	Builds source-to-canonical comparison data for the review UI	151
`lib/supabase.ts`	Client-side Supabase singleton (anon key)	14
`lib/supabase-server.ts`	Server-side Supabase singleton (service role key)	9
`lib/auth.ts`	Server-side Bearer token authentication	12
`lib/auth-client.ts`	Client-side auth header builder (reads from sessionStorage)	10

Key Design Decisions

LLM-Only Extraction

LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. Data is chunked intelligently (up to 8 chunks in parallel) and processed via Claude Sonnet. This single-path design eliminates mode-selection logic and quality-gate fallbacks, simplifying the pipeline while maintaining top accuracy.

Two-Phase API Design

Structure analysis and extraction are separate API calls:

/api/analyze (120s timeout) — reads the spreadsheet, classifies Sheets into Tables, maps columns to schema
/api/extract (300s timeout) — uses the structure map to pull Units, validate, and generate output

This separation enables the human-in-the-loop review step between phases.

Streaming Progress via Supabase Realtime

Both API phases push granular progress updates to the etl_jobs table. The frontend subscribes via Supabase Realtime and renders milestones in a scrolling timeline. Progress percentages are distributed:

Upload: 0–5%
Analysis: 8–40%
Extraction: 45–95%
Validation + Output: 95–100%

Spot-Check Verification

After extraction, the system audits a sample of extracted Units against the source spreadsheet. It samples 8 Units using biased selection (first 5, last 5, any vacant Units, random fill) and re-extracts them independently via a separate Sonnet call. Discrepancies are scored and surfaced as a confidence rating.

Constants


const STRUCTURE_MODEL = "claude-opus-4-6";     // Structure analysis (deep reasoning)
const EXTRACTION_MODEL = "claude-sonnet-4-6";  // Data extraction chunks
const SCAN_MODEL = "claude-sonnet-4-6";        // Sheet pre-scanning
const CHUNK_ROWS = 60;                          // Max rows per horizontal chunk
const OVERLAP_ROWS = 5;                         // Overlap between chunks (dedup context)
const PARALLEL_CHUNKS = 8;                      // Max concurrent extraction workers
const UNITS_PER_CHUNK = 18;                     // Max unit blocks per vertical chunk