Skip to Content
PipelineSheet Scanning

Sheet Scanning

The scanner is the first processing phase after upload. It reads every Sheet in the Excel file and breaks each one into classified sections (Tables).

Why Scanning Is Needed

Rent rolls are messy. A single Sheet might contain:

  • A header block with property name and report date
  • A unit data Table with tenant info and charges
  • A summary Table with totals
  • A charge code breakdown Table
  • A floor plan summary

The scanner identifies these distinct regions so the structure analysis phase knows where to look for actual unit data.

The CellGrid Model

Every Excel Sheet is parsed into a CellGrid — a sparse cell map using "row,col" string keys (1-indexed):

interface CellGrid { sheetName: string; cells: Map<string, string>; // Key: "row,col", Value: cell text rowCount: number; colCount: number; }

Cell type handling during parsing:

  • Dates: Converted to YYYY-MM-DD ISO format
  • Numbers: Integers stored as-is, decimals to 2 decimal places
  • Booleans: Lowercase string ("true" / "false")
  • Empty strings: Skipped entirely (sparse storage)

Section Classification

Each Sheet is classified into a SheetMap containing SheetSection entries:

interface SheetSection { startRow: number; // 1-indexed inclusive endRow: number; // 1-indexed inclusive classification: | "unit_data" // Individual unit records | "summary_totals" // Aggregate totals/subtotals | "charge_code_summary"// Charge code breakdowns | "floor_plan_summary" // Floor plan aggregates | "header_metadata" // Property name, report date, headers | "other"; // Unclassified hasVerticalCharges: boolean; reason: string; // Human-readable explanation } interface SheetMap { sheetName: string; sections: SheetSection[]; primaryDataSection: SheetSection | null; // Largest unit_data section }

How It Works

  1. readExcelBuffer(buffer, filename) — parses the XLSX into a dictionary of CellGrid objects, keyed by Sheet name
  2. scanAllSheets(sheets, onProgress?, signal?) — scans all Sheets in parallel via Promise.all
  3. For each Sheet, scanSheet(grid) sends the Sheet text to Claude Sonnet with a classification prompt
  4. The LLM returns a JSON array of section descriptors
  5. The primaryDataSection is automatically selected as the largest unit_data section by row span

Grid Text Serialization

Three functions convert grid regions to text for LLM prompts:

FunctionBehavior
gridToText(grid, maxRows?)Full grid dump. Format: R{n} | A: value | B: value | ...
gridRangeToText(grid, startRow, endRow)Specific row range only
buildSheetText(grid)Smart sampling: full dump for sheets ≤200 rows; first 30 + every 5th middle row + last 30 for larger sheets (capped at 30 columns)

Progress Updates

  • 16%: Starting scan
  • 19%: Scan complete, reports section count (e.g., “Found 5 sections across 3 sheets”)
Last updated on