Extraction
Once structure analysis is confirmed, the system extracts individual Units from each data Table. All extraction uses Claude Sonnet via chunked parallel LLM calls. LLM extraction consistently produces the highest accuracy across all vendor formats, so the pipeline uses it exclusively.
LLM Extraction
How it works:
Chunking
The data region is split into chunks based on the charge orientation:
- Vertical: Chunks are built from Unit blocks (groups of rows separated by blank rows). Each chunk contains up to
UNITS_PER_CHUNK(18) blocks, with no overlap between chunks (block boundaries are clean separators). - Horizontal: Chunks are sliced by row count. Each chunk contains up to
CHUNK_ROWS(60) data rows, withOVERLAP_ROWS(5) rows of context from the previous chunk.
Parallel Processing
runParallel(chunks, PARALLEL_CHUNKS, extractChunkFn)Up to 8 chunks are processed concurrently via Claude Sonnet. Each chunk receives:
- The full structure analysis result (column mapping, charge handling, etc.)
- The canonical schema definition and extraction rules
- Header rows for context, plus the grid text for that chunk’s row range
- Overlap instructions (if applicable) telling the LLM to skip units whose first row falls in the overlap region
Deduplication
Overlapping rows between chunks can produce duplicate Units. After all chunks complete, Units are deduplicated by unit_id — the first occurrence wins (since earlier chunks have more context).
Chunk Failure Handling
Individual chunk extraction failures now surface as validation warnings in the final report rather than being silently dropped. This ensures operators are aware of any data gaps even when the overall extraction succeeds.
Response Parsing
The LLM returns JSON arrays of Unit objects. The parser (parseJsonResponse) tries multiple strategies:
- Extract JSON from markdown code fences
- Direct
JSON.parseif the response starts with[or{ - Search for balanced JSON brackets within the response text
Post-Extraction Processing
After LLM extraction, the pipeline performs cleanup:
| Step | Behavior |
|---|---|
| Deduplication | Rows are deduplicated by property_name|unit_id — first occurrence wins |
| No-ID filtering | Rows without a unit_id are silently dropped (typically summary rows, future residents, or applicants) |
| Row-count reconciliation | If extracted count differs from expected data range by >20%, a warning is surfaced in validation |
Progress Updates During Extraction
| Progress | Message |
|---|---|
| 50% | Starting extraction |
| 50–90% | Incremental Unit counts as chunks complete |
| 90% | All chunks done, deduplication |
| 91% | Spot-check |
| 92% | Validation |
| 95% | Writing output file |
| 100% | Complete |