Database

The ETL pipeline uses a single Supabase Postgres table (etl_jobs) to track every job from upload through completion. Supabase Realtime pushes row-level changes to the frontend so the UI updates without polling. File storage uses a Supabase Storage bucket.

Schema: `etl_jobs`

The table was created across three migrations, each adding columns as the pipeline evolved.

All Columns

Column	Type	Default	Description
`id`	`UUID`	`gen_random_uuid()`	Primary key
`created_at`	`TIMESTAMPTZ`	`now()`	Row creation timestamp
`updated_at`	`TIMESTAMPTZ`	`now()`	Last update timestamp
`status`	`TEXT NOT NULL`	`'pending'`	Current pipeline status (see status flow below)
`progress_pct`	`INTEGER`	`0`	Progress percentage (0-100)
`stage_message`	`TEXT`	`NULL`	Human-readable description of what the pipeline is doing right now
`original_filename`	`TEXT NOT NULL`	—	Name of the uploaded File (e.g. `Montrose_RentRoll_Feb2025.xlsx`)
`storage_path`	`TEXT NOT NULL`	—	Supabase Storage path for the uploaded File
`structure_result`	`JSONB`	`NULL`	AI-detected structure analysis: column mapping, header row, data range, charge orientation
`property_name`	`TEXT`	`NULL`	Detected property name (e.g. `Montrose at Vintage Park`)
`report_date`	`TEXT`	`NULL`	Detected report date from the File
`sheet_count`	`INTEGER`	`NULL`	Number of Sheets found in the workbook
`unit_count`	`INTEGER`	`NULL`	Total Units extracted
`error_count`	`INTEGER`	`0`	Number of extraction errors
`warning_count`	`INTEGER`	`0`	Number of extraction warnings
`errors`	`JSONB`	`'[]'`	Array of error message strings
`warnings`	`JSONB`	`'[]'`	Array of warning message strings
`output_storage_path`	`TEXT`	`NULL`	Supabase Storage path for the Output File
`file_size_bytes`	`BIGINT`	`NULL`	Size of the uploaded File in bytes
`total_rows`	`INTEGER`	`NULL`	Total data rows detected in the spreadsheet
`total_chunks`	`INTEGER`	`NULL`	Number of extraction chunks/passes
`analyze_started_at`	`TIMESTAMPTZ`	`NULL`	When the analysis phase began
`analyze_finished_at`	`TIMESTAMPTZ`	`NULL`	When the analysis phase completed
`extract_started_at`	`TIMESTAMPTZ`	`NULL`	When the extraction phase began
`extract_finished_at`	`TIMESTAMPTZ`	`NULL`	When the extraction phase completed
`total_duration_ms`	`BIGINT`	`NULL`	Total pipeline duration in milliseconds (used for ETA estimates)
`extraction_method`	`TEXT`	`NULL`	Method used: always `llm` (legacy values `deterministic`, `hybrid` may exist in older rows)
`spot_check_confidence`	`INTEGER`	`NULL`	Spot-check confidence score (0-100)
`spot_check_discrepancies`	`JSONB`	`'[]'`	Array of discrepancy objects from spot-check validation
`structure_confirmed_at`	`TIMESTAMPTZ`	`NULL`	When the user confirmed the detected structure
`structure_overrides`	`JSONB`	`NULL`	User-provided overrides to the detected column mapping
`unit_breakdown`	`JSONB`	`NULL`	Accuracy breakdown: `{ successful, flagged, failed, total }`

Structure Result Shape

The structure_result JSONB column stores the AI analysis output:


{
  "column_mapping": { "A": "unit_id", "B": "unit_status", "C": "tenant_name", ... },
  "header_row": 4,
  "data_start_row": 5,
  "data_end_row": 110,
  "charge_orientation": "horizontal",
  "multi_row_per_unit": false,
  "property_name": "Montrose at Vintage Park",
  "report_date": "2025-02-10",
  "notes": "Charges are in columns G through M"
}

Unit Breakdown Shape

The unit_breakdown JSONB column stores accuracy funnel data:


{
  "successful": 100,
  "flagged": 5,
  "failed": 1,
  "total": 106
}

Status Flow

Jobs progress through these statuses. Any stage can transition to failed.


pending --> uploading --> analyzing --> analyzed --> extracting --> validating --> complete
   |            |             |            |             |              |
   +------------+-------------+------------+-------------+--------------+--> failed

Status	Description
`pending`	Job row created, File not yet uploaded
`uploading`	File is being uploaded to Supabase Storage
`analyzing`	AI is reading the spreadsheet structure
`analyzed`	Structure detected, awaiting user review
`extracting`	Unit data is being extracted from the spreadsheet
`validating`	Extracted Units are being validated and spot-checked
`complete`	Pipeline finished, Output File ready for download
`failed`	Something went wrong at any stage

Realtime

The etl_jobs table is added to the Supabase Realtime publication so the frontend can subscribe to row-level changes:


ALTER PUBLICATION supabase_realtime ADD TABLE etl_jobs;

The useJobProgress hook subscribes to postgres_changes events filtered by id=eq.{jobId}, receiving every UPDATE to the job row in real time.

Row Level Security

RLS is enabled with a permissive policy for the demo — all operations are allowed without authentication:


ALTER TABLE etl_jobs ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Allow all" ON etl_jobs FOR ALL USING (true) WITH CHECK (true);

This means any client (authenticated or anonymous) can read, insert, update, and delete rows. In production, this policy should be replaced with proper auth checks.

Supabase Storage

File uploads and outputs are stored in the etl-files bucket with the following path conventions:

Path Pattern	Purpose
`uploads/{jobId}/{filename}`	Original uploaded File
`outputs/{jobId}/output.xlsx`	Standardized Output File

The storage_path column stores the upload path, and output_storage_path stores the output path. The download API generates signed URLs for the Output File.

Migrations

Migration 001: `001_create_etl_jobs.sql`

Creates the etl_jobs table with all core columns: identity fields (id, timestamps), status tracking (status, progress_pct, stage_message), input metadata (original_filename, storage_path), structure analysis results (structure_result, property_name, report_date, sheet_count), extraction results (unit_count, error_count, warning_count, errors, warnings), output path, and timing fields for ETA calculation. Also enables Realtime and creates the permissive RLS policy.

Migration 002: `002_add_extraction_method_and_spot_check.sql`

Adds five columns:

extraction_method — tracks the extraction approach used; always llm in the current pipeline
spot_check_confidence — integer confidence score from automated spot-check validation
spot_check_discrepancies — JSONB array of discrepancy objects with unit_id, field, extracted, expected, severity, and explanation
structure_confirmed_at — timestamp when the user confirmed the detected structure
structure_overrides — JSONB object containing user-provided column mapping overrides

Migration 003: `003_add_unit_breakdown.sql`

Adds the unit_breakdown JSONB column to support the accuracy funnel visualization. Contains counts of successful, flagged, and failed Units out of the total.

Migration 004: `004_add_constraints_and_indexes.sql`

Adds database safety constraints and performance indexes:

CHECK constraint on status — restricts to 8 valid pipeline states
CHECK constraint on progress_pct — enforces 0-100 range
Auto-update trigger for updated_at column
Indexes on status, created_at (DESC), and property_name