Skip to Content
API ReferencePOST /api/extract

POST /api/extract

The heavy lifter. Downloads the uploaded File, parses it, extracts every Unit row using the confirmed structure, validates the output, and writes a standardized 2-Sheet XLSX Output File to storage.

Request

{ "job_id": "uuid-string" }

Headers: Authorization: Bearer <token>

Max duration: 300 seconds (5 minutes — Vercel serverless timeout)

Prerequisites

The job must have a structure_result (i.e., /api/analyze must have completed). If the structure has not been analyzed, the endpoint returns a 400 error prompting you to run /api/analyze first.

Processing Steps

  1. Authenticate via checkAuth() and fetch the full job row
  2. Record start — set extract_started_at, status to extracting, progress to 45%
  3. Download File from Supabase Storage using the job’s storage_path
  4. Parse Excel into CellGrid objects (one per Sheet)
  5. Extract data — data is chunked and extracted in parallel via Claude Sonnet (LLM extraction). Each chunk is up to 60 rows, with up to 8 chunks processed concurrently.
  6. Spot-check — a sample of extracted Units is audited against the source spreadsheet via an independent LLM call. Produces a confidence score (0–100) and a list of discrepancies.
  7. Validate — run validateRows() across all extracted rows, checking for missing required fields, out-of-range values, and row-count expectations derived from data_start_row / data_end_row. Any chunk extraction failures from step 5 are surfaced as warnings in this validation report rather than being silently dropped.
  8. Write output — generate a 2-Sheet XLSX (Rent Roll + Validation) via writeOutputBuffer()
  9. Upload output to Supabase Storage at outputs/{job_id}/output.xlsx
  10. Mark complete — compute total duration, store all results, set status to complete

Progress Updates

%Stage Message
45Loading the File…
50–90Extraction progress (updated per chunk)
91Spot-checking a sample of Units
92Checking N Units look right…
95Building your clean File…
100All done!

During step 5, the extraction callback fires incremental progress updates between 50% and 90%. If the extraction message contains a section/chunk count (e.g., “Processing 4 sections”), the total_chunks field is also written to the job row.

Job Fields Updated

FieldTypeDescription
statusstringextracting during processing, complete on success, failed on error
progress_pctnumber45 through 100
stage_messagestringHuman-readable progress label
extract_started_attimestampWhen extraction began
extract_finished_attimestampWhen extraction completed
total_duration_msnumberSum of analyze duration + extract duration (excludes the review/confirmation pause between the two steps)
unit_countnumberTotal extracted Units
error_countnumberValidation errors
warning_countnumberValidation warnings
errorsstring[]List of validation error messages
warningsstring[]List of validation warning messages
output_storage_pathstringStorage path to the Output File
total_chunksnumberNumber of extraction sections (if reported)

Enriched structure_result Fields

At completion, the following metadata fields are merged into the job’s structure_result JSON object (not stored as top-level job columns):

KeyTypeDescription
_extraction_methodstringAlways "llm"
_spot_check_confidencenumber | null0–100 confidence score from spot-check audit
_spot_check_discrepanciesobject[]Discrepancies found during spot-check
_unit_breakdownobjectPer-unit-type counts from validation

Response

Success (200):

{ "job_id": "uuid-string", "unit_count": 312, "error_count": 0, "warning_count": 2, "extraction_method": "llm" }

Error Sanitization

Raw error messages are sanitized before returning to the client:

ConditionUser-facing Message
Credit balance issuesAI service temporarily unavailable. Please try again later.
Rate limitsToo many requests. Please wait a moment and try again.
OverloadedAI service is busy. Please try again in a minute.
OtherSomething went wrong. Please try again.

On failure, the job status is set to failed with the sanitized message.

Error (400):

{ "error": "Run /api/analyze first" }

Error (500):

{ "error": "Something went wrong. Please try again." }

Abort Handling

If the client aborts the request (e.g., the user navigates away), the route catches the AbortError, and returns a 499 response:

{ "error": "Cancelled" }

Unlike /api/analyze, the abort handler here does not update the job status — the job remains in extracting state. A subsequent call to /api/extract will re-run extraction from scratch.

Spot-Check Details

After extraction, a random sample of extracted Units is audited against the source spreadsheet via an independent LLM call. The result includes:

  • spot_check_confidence — a score from 0 to 100 indicating how well the extraction matches an independent LLM re-reading of the same source cells. Scores below 50 automatically generate a warning. On LLM failure, confidence defaults to 50 (not 100), with a discrepancy entry indicating the audit was inconclusive.
  • spot_check_discrepancies — an array of objects describing each field where the extraction and the independent audit disagree.
Last updated on