Skip to Content
Architecture

Architecture

System Overview

The ETL pipeline is a Next.js application with serverless API routes handling the heavy processing. All state lives in a single Supabase table (etl_jobs), with Realtime subscriptions pushing progress updates to the frontend.

Layers

LayerComponentsRole
BrowserReact + Mantine, useEtlPipeline (orchestrator), useJobProgress (Realtime subscription)UI, file upload, progress display
Serverless/api/upload, /api/analyze, /api/extract + core libs (etl.ts, spot-check.ts)All processing — scanning, structure analysis, extraction, validation
AIAnthropic Claude API (Opus for structure, Sonnet for extraction)Sheet classification, column mapping, complex data extraction, spot-check audits
StorageSupabase — etl_jobs table (state + progress), etl-files bucket (uploads + outputs)Persistence, Realtime subscriptions

Data flow

Browser → fetch() → Serverless API → Claude API → Supabase Browser ← Realtime subscription ← Supabase

The browser sends requests to the serverless functions, which do the heavy lifting (parsing Excel, calling Claude, writing output). Progress updates are written to etl_jobs and pushed to the browser via Supabase Realtime — no polling.


AI Models

The pipeline uses two Anthropic Claude models at different stages, chosen for the cost/capability tradeoff at each step.

StageModelWhy
Structure analysisclaude-opus-4-6Deep reasoning about ambiguous spreadsheet layouts — classifying Table types, resolving overlapping regions, mapping columns to a 40-field schema. This is the hardest step and benefits from Opus-level reasoning.
Sheet pre-scanningclaude-sonnet-4-6Fast classification of each Sheet — finding Table boundaries, detecting charge layouts, identifying header rows. Lower complexity, high throughput.
Data extractionclaude-sonnet-4-6Extracting Unit data from chunks. Each chunk is 60 rows max, up to 8 chunks processed in parallel. Sonnet provides the best speed/accuracy balance for structured extraction.
Spot-check auditclaude-sonnet-4-6Re-extracts a biased sample of 8 Units independently to verify extraction accuracy.

Prompt strategy

  • System prompts define the canonical schema and expected output format (JSON)
  • User prompts include the actual spreadsheet data as formatted text grids
  • Temperature: 0 for all extraction calls (reproducible output)
  • Structured output: All extraction calls use JSON mode with explicit field-level instructions

Cost profile

Every file goes through the full AI pipeline. LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. AI calls per job:

  1. Structure analysis (one Opus call)
  2. Sheet scanning (one Sonnet call per Sheet)
  3. Data extraction (up to 8 parallel Sonnet chunk calls)
  4. Spot-check (one Sonnet call to audit a sample of extracted Units)

Typical file: 2 + chunk count + 1 API calls (e.g., ~11 calls for a file with 8 extraction chunks).


Module Inventory

FilePurpose~Lines
lib/etl.tsCore ETL engine: Excel parsing, Sheet scanning, structure analysis, chunking, LLM extraction, validation, output writing1170
lib/deterministic-extractor.tsLegacy module — contains utility functions (charge code dictionary, status normalization helpers) imported by other modules; not used as an extraction mode~500
lib/canonical-schema.tsCanonical field definitions, type coercion (coerceFieldValue), status normalization, shared TypeScript types~355
lib/api-utils.tsShared API route helpers (auth, error handling, job updates)~31
lib/spot-check.tsLLM-based post-extraction audit with biased sampling226
lib/comparison.tsBuilds source-to-canonical comparison data for the review UI151
lib/supabase.tsClient-side Supabase singleton (anon key)14
lib/supabase-server.tsServer-side Supabase singleton (service role key)9
lib/auth.tsServer-side Bearer token authentication12
lib/auth-client.tsClient-side auth header builder (reads from sessionStorage)10

Key Design Decisions

LLM-Only Extraction

LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively. Data is chunked intelligently (up to 8 chunks in parallel) and processed via Claude Sonnet. This single-path design eliminates mode-selection logic and quality-gate fallbacks, simplifying the pipeline while maintaining top accuracy.

Two-Phase API Design

Structure analysis and extraction are separate API calls:

  • /api/analyze (120s timeout) — reads the spreadsheet, classifies Sheets into Tables, maps columns to schema
  • /api/extract (300s timeout) — uses the structure map to pull Units, validate, and generate output

This separation enables the human-in-the-loop review step between phases.

Streaming Progress via Supabase Realtime

Both API phases push granular progress updates to the etl_jobs table. The frontend subscribes via Supabase Realtime and renders milestones in a scrolling timeline. Progress percentages are distributed:

  • Upload: 0–5%
  • Analysis: 8–40%
  • Extraction: 45–95%
  • Validation + Output: 95–100%

Spot-Check Verification

After extraction, the system audits a sample of extracted Units against the source spreadsheet. It samples 8 Units using biased selection (first 5, last 5, any vacant Units, random fill) and re-extracts them independently via a separate Sonnet call. Discrepancies are scored and surfaced as a confidence rating.

Constants

const STRUCTURE_MODEL = "claude-opus-4-6"; // Structure analysis (deep reasoning) const EXTRACTION_MODEL = "claude-sonnet-4-6"; // Data extraction chunks const SCAN_MODEL = "claude-sonnet-4-6"; // Sheet pre-scanning const CHUNK_ROWS = 60; // Max rows per horizontal chunk const OVERLAP_ROWS = 5; // Overlap between chunks (dedup context) const PARALLEL_CHUNKS = 8; // Max concurrent extraction workers const UNITS_PER_CHUNK = 18; // Max unit blocks per vertical chunk
Last updated on