Extraction

Once structure analysis is confirmed, the system extracts individual Units from each data Table. All extraction uses Claude Sonnet via chunked parallel LLM calls. LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively.

LLM Extraction

How it works:

Chunking

The data region is split into chunks based on the charge orientation:

Vertical: Chunks are built from Unit blocks (groups of rows separated by blank rows). Each chunk contains up to UNITS_PER_CHUNK (18) blocks, with no overlap between chunks (block boundaries are clean separators).
Horizontal: Chunks are sliced by row count. Each chunk contains up to CHUNK_ROWS (60) data rows, with OVERLAP_ROWS (5) rows of context from the previous chunk.

Parallel Processing


runParallel(chunks, PARALLEL_CHUNKS, extractChunkFn)

Up to 8 chunks are processed concurrently via Claude Sonnet. Each chunk receives:

The full structure analysis result (column mapping, charge handling, etc.)
The canonical schema definition and extraction rules
Header rows for context, plus the grid text for that chunk’s row range
Overlap instructions (if applicable) telling the LLM to skip units whose first row falls in the overlap region

Deduplication

Overlapping rows between chunks can produce duplicate Units. After all chunks complete, Units are deduplicated by unit_id — the first occurrence wins (since earlier chunks have more context).

Chunk Failure Handling

Individual chunk extraction failures now surface as validation warnings in the final report rather than being silently dropped. This ensures operators are aware of any data gaps even when the overall extraction succeeds.

Response Parsing

The LLM returns JSON arrays of Unit objects. The parser (parseJsonResponse) tries multiple strategies:

Extract JSON from markdown code fences
Direct JSON.parse if the response starts with [ or {
Search for balanced JSON brackets within the response text

Post-Extraction Processing

After LLM extraction, the pipeline performs cleanup:

Step	Behavior
Deduplication	Rows are deduplicated by `property_name\|unit_id` — first occurrence wins
No-ID filtering	Rows without a `unit_id` are silently dropped (typically summary rows, future residents, or applicants)
Row-count reconciliation	If extracted count differs from expected data range by >20%, a warning is surfaced in validation

Progress Updates During Extraction

Progress	Message
50%	Starting extraction
50–90%	Incremental Unit counts as chunks complete
90%	All chunks done, deduplication
91%	Spot-check
92%	Validation
95%	Writing output file
100%	Complete