Database
The ETL pipeline uses a single Supabase Postgres table (etl_jobs) to track every job from upload through completion. Supabase Realtime pushes row-level changes to the frontend so the UI updates without polling. File storage uses a Supabase Storage bucket.
Schema: etl_jobs
The table was created across three migrations, each adding columns as the pipeline evolved.
All Columns
| Column | Type | Default | Description |
|---|---|---|---|
id | UUID | gen_random_uuid() | Primary key |
created_at | TIMESTAMPTZ | now() | Row creation timestamp |
updated_at | TIMESTAMPTZ | now() | Last update timestamp |
status | TEXT NOT NULL | 'pending' | Current pipeline status (see status flow below) |
progress_pct | INTEGER | 0 | Progress percentage (0-100) |
stage_message | TEXT | NULL | Human-readable description of what the pipeline is doing right now |
original_filename | TEXT NOT NULL | — | Name of the uploaded File (e.g. Montrose_RentRoll_Feb2025.xlsx) |
storage_path | TEXT NOT NULL | — | Supabase Storage path for the uploaded File |
structure_result | JSONB | NULL | AI-detected structure analysis: column mapping, header row, data range, charge orientation |
property_name | TEXT | NULL | Detected property name (e.g. Montrose at Vintage Park) |
report_date | TEXT | NULL | Detected report date from the File |
sheet_count | INTEGER | NULL | Number of Sheets found in the workbook |
unit_count | INTEGER | NULL | Total Units extracted |
error_count | INTEGER | 0 | Number of extraction errors |
warning_count | INTEGER | 0 | Number of extraction warnings |
errors | JSONB | '[]' | Array of error message strings |
warnings | JSONB | '[]' | Array of warning message strings |
output_storage_path | TEXT | NULL | Supabase Storage path for the Output File |
file_size_bytes | BIGINT | NULL | Size of the uploaded File in bytes |
total_rows | INTEGER | NULL | Total data rows detected in the spreadsheet |
total_chunks | INTEGER | NULL | Number of extraction chunks/passes |
analyze_started_at | TIMESTAMPTZ | NULL | When the analysis phase began |
analyze_finished_at | TIMESTAMPTZ | NULL | When the analysis phase completed |
extract_started_at | TIMESTAMPTZ | NULL | When the extraction phase began |
extract_finished_at | TIMESTAMPTZ | NULL | When the extraction phase completed |
total_duration_ms | BIGINT | NULL | Total pipeline duration in milliseconds (used for ETA estimates) |
extraction_method | TEXT | NULL | Method used: always llm (legacy values deterministic, hybrid may exist in older rows) |
spot_check_confidence | INTEGER | NULL | Spot-check confidence score (0-100) |
spot_check_discrepancies | JSONB | '[]' | Array of discrepancy objects from spot-check validation |
structure_confirmed_at | TIMESTAMPTZ | NULL | When the user confirmed the detected structure |
structure_overrides | JSONB | NULL | User-provided overrides to the detected column mapping |
unit_breakdown | JSONB | NULL | Accuracy breakdown: { successful, flagged, failed, total } |
Structure Result Shape
The structure_result JSONB column stores the AI analysis output:
{
"column_mapping": { "A": "unit_id", "B": "unit_status", "C": "tenant_name", ... },
"header_row": 4,
"data_start_row": 5,
"data_end_row": 110,
"charge_orientation": "horizontal",
"multi_row_per_unit": false,
"property_name": "Montrose at Vintage Park",
"report_date": "2025-02-10",
"notes": "Charges are in columns G through M"
}Unit Breakdown Shape
The unit_breakdown JSONB column stores accuracy funnel data:
{
"successful": 100,
"flagged": 5,
"failed": 1,
"total": 106
}Status Flow
Jobs progress through these statuses. Any stage can transition to failed.
pending --> uploading --> analyzing --> analyzed --> extracting --> validating --> complete
| | | | | |
+------------+-------------+------------+-------------+--------------+--> failed| Status | Description |
|---|---|
pending | Job row created, File not yet uploaded |
uploading | File is being uploaded to Supabase Storage |
analyzing | AI is reading the spreadsheet structure |
analyzed | Structure detected, awaiting user review |
extracting | Unit data is being extracted from the spreadsheet |
validating | Extracted Units are being validated and spot-checked |
complete | Pipeline finished, Output File ready for download |
failed | Something went wrong at any stage |
Realtime
The etl_jobs table is added to the Supabase Realtime publication so the frontend can subscribe to row-level changes:
ALTER PUBLICATION supabase_realtime ADD TABLE etl_jobs;The useJobProgress hook subscribes to postgres_changes events filtered by id=eq.{jobId}, receiving every UPDATE to the job row in real time.
Row Level Security
RLS is enabled with a permissive policy for the demo — all operations are allowed without authentication:
ALTER TABLE etl_jobs ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Allow all" ON etl_jobs FOR ALL USING (true) WITH CHECK (true);This means any client (authenticated or anonymous) can read, insert, update, and delete rows. In production, this policy should be replaced with proper auth checks.
Supabase Storage
File uploads and outputs are stored in the etl-files bucket with the following path conventions:
| Path Pattern | Purpose |
|---|---|
uploads/{jobId}/{filename} | Original uploaded File |
outputs/{jobId}/output.xlsx | Standardized Output File |
The storage_path column stores the upload path, and output_storage_path stores the output path. The download API generates signed URLs for the Output File.
Migrations
Migration 001: 001_create_etl_jobs.sql
Creates the etl_jobs table with all core columns: identity fields (id, timestamps), status tracking (status, progress_pct, stage_message), input metadata (original_filename, storage_path), structure analysis results (structure_result, property_name, report_date, sheet_count), extraction results (unit_count, error_count, warning_count, errors, warnings), output path, and timing fields for ETA calculation. Also enables Realtime and creates the permissive RLS policy.
Migration 002: 002_add_extraction_method_and_spot_check.sql
Adds five columns:
extraction_method— tracks the extraction approach used; alwaysllmin the current pipelinespot_check_confidence— integer confidence score from automated spot-check validationspot_check_discrepancies— JSONB array of discrepancy objects withunit_id,field,extracted,expected,severity, andexplanationstructure_confirmed_at— timestamp when the user confirmed the detected structurestructure_overrides— JSONB object containing user-provided column mapping overrides
Migration 003: 003_add_unit_breakdown.sql
Adds the unit_breakdown JSONB column to support the accuracy funnel visualization. Contains counts of successful, flagged, and failed Units out of the total.
Migration 004: 004_add_constraints_and_indexes.sql
Adds database safety constraints and performance indexes:
- CHECK constraint on
status— restricts to 8 valid pipeline states - CHECK constraint on
progress_pct— enforces 0-100 range - Auto-update trigger for
updated_atcolumn - Indexes on
status,created_at(DESC), andproperty_name