Skip to Content
Database

Database

The ETL pipeline uses a single Supabase Postgres table (etl_jobs) to track every job from upload through completion. Supabase Realtime pushes row-level changes to the frontend so the UI updates without polling. File storage uses a Supabase Storage bucket.

Schema: etl_jobs

The table was created across three migrations, each adding columns as the pipeline evolved.

All Columns

ColumnTypeDefaultDescription
idUUIDgen_random_uuid()Primary key
created_atTIMESTAMPTZnow()Row creation timestamp
updated_atTIMESTAMPTZnow()Last update timestamp
statusTEXT NOT NULL'pending'Current pipeline status (see status flow below)
progress_pctINTEGER0Progress percentage (0-100)
stage_messageTEXTNULLHuman-readable description of what the pipeline is doing right now
original_filenameTEXT NOT NULLName of the uploaded File (e.g. Montrose_RentRoll_Feb2025.xlsx)
storage_pathTEXT NOT NULLSupabase Storage path for the uploaded File
structure_resultJSONBNULLAI-detected structure analysis: column mapping, header row, data range, charge orientation
property_nameTEXTNULLDetected property name (e.g. Montrose at Vintage Park)
report_dateTEXTNULLDetected report date from the File
sheet_countINTEGERNULLNumber of Sheets found in the workbook
unit_countINTEGERNULLTotal Units extracted
error_countINTEGER0Number of extraction errors
warning_countINTEGER0Number of extraction warnings
errorsJSONB'[]'Array of error message strings
warningsJSONB'[]'Array of warning message strings
output_storage_pathTEXTNULLSupabase Storage path for the Output File
file_size_bytesBIGINTNULLSize of the uploaded File in bytes
total_rowsINTEGERNULLTotal data rows detected in the spreadsheet
total_chunksINTEGERNULLNumber of extraction chunks/passes
analyze_started_atTIMESTAMPTZNULLWhen the analysis phase began
analyze_finished_atTIMESTAMPTZNULLWhen the analysis phase completed
extract_started_atTIMESTAMPTZNULLWhen the extraction phase began
extract_finished_atTIMESTAMPTZNULLWhen the extraction phase completed
total_duration_msBIGINTNULLTotal pipeline duration in milliseconds (used for ETA estimates)
extraction_methodTEXTNULLMethod used: always llm (legacy values deterministic, hybrid may exist in older rows)
spot_check_confidenceINTEGERNULLSpot-check confidence score (0-100)
spot_check_discrepanciesJSONB'[]'Array of discrepancy objects from spot-check validation
structure_confirmed_atTIMESTAMPTZNULLWhen the user confirmed the detected structure
structure_overridesJSONBNULLUser-provided overrides to the detected column mapping
unit_breakdownJSONBNULLAccuracy breakdown: { successful, flagged, failed, total }

Structure Result Shape

The structure_result JSONB column stores the AI analysis output:

{ "column_mapping": { "A": "unit_id", "B": "unit_status", "C": "tenant_name", ... }, "header_row": 4, "data_start_row": 5, "data_end_row": 110, "charge_orientation": "horizontal", "multi_row_per_unit": false, "property_name": "Montrose at Vintage Park", "report_date": "2025-02-10", "notes": "Charges are in columns G through M" }

Unit Breakdown Shape

The unit_breakdown JSONB column stores accuracy funnel data:

{ "successful": 100, "flagged": 5, "failed": 1, "total": 106 }

Status Flow

Jobs progress through these statuses. Any stage can transition to failed.

pending --> uploading --> analyzing --> analyzed --> extracting --> validating --> complete | | | | | | +------------+-------------+------------+-------------+--------------+--> failed
StatusDescription
pendingJob row created, File not yet uploaded
uploadingFile is being uploaded to Supabase Storage
analyzingAI is reading the spreadsheet structure
analyzedStructure detected, awaiting user review
extractingUnit data is being extracted from the spreadsheet
validatingExtracted Units are being validated and spot-checked
completePipeline finished, Output File ready for download
failedSomething went wrong at any stage

Realtime

The etl_jobs table is added to the Supabase Realtime publication so the frontend can subscribe to row-level changes:

ALTER PUBLICATION supabase_realtime ADD TABLE etl_jobs;

The useJobProgress hook subscribes to postgres_changes events filtered by id=eq.{jobId}, receiving every UPDATE to the job row in real time.

Row Level Security

RLS is enabled with a permissive policy for the demo — all operations are allowed without authentication:

ALTER TABLE etl_jobs ENABLE ROW LEVEL SECURITY; CREATE POLICY "Allow all" ON etl_jobs FOR ALL USING (true) WITH CHECK (true);

This means any client (authenticated or anonymous) can read, insert, update, and delete rows. In production, this policy should be replaced with proper auth checks.

Supabase Storage

File uploads and outputs are stored in the etl-files bucket with the following path conventions:

Path PatternPurpose
uploads/{jobId}/{filename}Original uploaded File
outputs/{jobId}/output.xlsxStandardized Output File

The storage_path column stores the upload path, and output_storage_path stores the output path. The download API generates signed URLs for the Output File.

Migrations

Migration 001: 001_create_etl_jobs.sql

Creates the etl_jobs table with all core columns: identity fields (id, timestamps), status tracking (status, progress_pct, stage_message), input metadata (original_filename, storage_path), structure analysis results (structure_result, property_name, report_date, sheet_count), extraction results (unit_count, error_count, warning_count, errors, warnings), output path, and timing fields for ETA calculation. Also enables Realtime and creates the permissive RLS policy.

Migration 002: 002_add_extraction_method_and_spot_check.sql

Adds five columns:

  • extraction_method — tracks the extraction approach used; always llm in the current pipeline
  • spot_check_confidence — integer confidence score from automated spot-check validation
  • spot_check_discrepancies — JSONB array of discrepancy objects with unit_id, field, extracted, expected, severity, and explanation
  • structure_confirmed_at — timestamp when the user confirmed the detected structure
  • structure_overrides — JSONB object containing user-provided column mapping overrides

Migration 003: 003_add_unit_breakdown.sql

Adds the unit_breakdown JSONB column to support the accuracy funnel visualization. Contains counts of successful, flagged, and failed Units out of the total.

Migration 004: 004_add_constraints_and_indexes.sql

Adds database safety constraints and performance indexes:

  • CHECK constraint on status — restricts to 8 valid pipeline states
  • CHECK constraint on progress_pct — enforces 0-100 range
  • Auto-update trigger for updated_at column
  • Indexes on status, created_at (DESC), and property_name
Last updated on