POST /api/extract

The heavy lifter. Downloads the uploaded File, parses it, extracts every Unit row using the confirmed structure, validates the output, and writes a standardized 2-Sheet XLSX Output File to storage.

Request


{
  "job_id": "uuid-string"
}

Headers: Authorization: Bearer <token>

Max duration: 300 seconds (5 minutes — Vercel serverless timeout)

Prerequisites

The job must have a structure_result (i.e., /api/analyze must have completed). If the structure has not been analyzed, the endpoint returns a 400 error prompting you to run /api/analyze first.

Processing Steps

Authenticate via checkAuth() and fetch the full job row
Record start — set extract_started_at, status to extracting, progress to 45%
Download File from Supabase Storage using the job’s storage_path
Parse Excel into CellGrid objects (one per Sheet)
Extract data — data is chunked and extracted in parallel via Claude Sonnet (LLM extraction). Each chunk is up to 60 rows, with up to 8 chunks processed concurrently.
Spot-check — a sample of extracted Units is audited against the source spreadsheet via an independent LLM call. Produces a confidence score (0–100) and a list of discrepancies.
Validate — run validateRows() across all extracted rows, checking for missing required fields, out-of-range values, and row-count expectations derived from data_start_row / data_end_row. Any chunk extraction failures from step 5 are surfaced as warnings in this validation report rather than being silently dropped.
Write output — generate a 2-Sheet XLSX (Rent Roll + Validation) via writeOutputBuffer()
Upload output to Supabase Storage at outputs/{job_id}/output.xlsx
Mark complete — compute total duration, store all results, set status to complete

Progress Updates

%	Stage Message
45	Loading the File…
50–90	Extraction progress (updated per chunk)
91	Spot-checking a sample of Units…
92	Checking N Units look right…
95	Building your clean File…
100	All done!

During step 5, the extraction callback fires incremental progress updates between 50% and 90%. If the extraction message contains a section/chunk count (e.g., “Processing 4 sections”), the total_chunks field is also written to the job row.

Job Fields Updated

Field	Type	Description
`status`	string	`extracting` during processing, `complete` on success, `failed` on error
`progress_pct`	number	45 through 100
`stage_message`	string	Human-readable progress label
`extract_started_at`	timestamp	When extraction began
`extract_finished_at`	timestamp	When extraction completed
`total_duration_ms`	number	Sum of analyze duration + extract duration (excludes the review/confirmation pause between the two steps)
`unit_count`	number	Total extracted Units
`error_count`	number	Validation errors
`warning_count`	number	Validation warnings
`errors`	string[]	List of validation error messages
`warnings`	string[]	List of validation warning messages
`output_storage_path`	string	Storage path to the Output File
`total_chunks`	number	Number of extraction sections (if reported)

Enriched `structure_result` Fields

At completion, the following metadata fields are merged into the job’s structure_result JSON object (not stored as top-level job columns):

Key	Type	Description
`_extraction_method`	string	Always `"llm"`
`_spot_check_confidence`	number \| null	0–100 confidence score from spot-check audit
`_spot_check_discrepancies`	object[]	Discrepancies found during spot-check
`_unit_breakdown`	object	Per-unit-type counts from validation

Response

Success (200):


{
  "job_id": "uuid-string",
  "unit_count": 312,
  "error_count": 0,
  "warning_count": 2,
  "extraction_method": "llm"
}

Error Sanitization

Raw error messages are sanitized before returning to the client:

Condition	User-facing Message
Credit balance issues	AI service temporarily unavailable. Please try again later.
Rate limits	Too many requests. Please wait a moment and try again.
Overloaded	AI service is busy. Please try again in a minute.
Other	Something went wrong. Please try again.

On failure, the job status is set to failed with the sanitized message.

Error (400):


{
  "error": "Run /api/analyze first"
}

Error (500):


{
  "error": "Something went wrong. Please try again."
}

Abort Handling

If the client aborts the request (e.g., the user navigates away), the route catches the AbortError, and returns a 499 response:


{
  "error": "Cancelled"
}

Unlike /api/analyze, the abort handler here does not update the job status — the job remains in extracting state. A subsequent call to /api/extract will re-run extraction from scratch.

Spot-Check Details

After extraction, a random sample of extracted Units is audited against the source spreadsheet via an independent LLM call. The result includes:

spot_check_confidence — a score from 0 to 100 indicating how well the extraction matches an independent LLM re-reading of the same source cells. Scores below 50 automatically generate a warning. On LLM failure, confidence defaults to 50 (not 100), with a discrepancy entry indicating the audit was inconclusive.
spot_check_discrepancies — an array of objects describing each field where the extraction and the independent audit disagree.