What we shipped.
Public build log. Truthful, technical, no marketing. If you'd rather read the why behind it, see the thesis.
- 2026-05-05
Entity resolution v1: deterministic floor + probabilistic match
Layered architecture for resolving the same contractor / project across messy public bid sources. Deterministic exact-match catches the obvious cases; Splink-backed probabilistic matching handles the long tail of name/address variations. Eval set + Wilson confidence intervals make the lift measurable.
- 2026-05-05
NCDOT 2024 ingestion + cross-source resolution
North Carolina DOT bid-tab XLS ingestion, end-to-end. The interesting part isn't the scrape — it's resolving entities across NCDOT, FDOT, and the project ontology so the same contractor in two states is one record.
- 2026-05-05
Next.js 16 site + Supabase waitlist wired
Migrated cassandri.com from a static prototype to a Next.js 16 app on Vercel. Waitlist signups now persist to Supabase. Foundation for the rest of the surface area.
- 2026-05-05
Scrapers SOP: onboarding skill + manifest + health checks
Standard operating procedure for adding a new public source: onboarding skill, manifest schema, health check, fixture invariants. Phase 1 + Phase 6 gates hard-enforced. Adding the next state is now a checklist, not a snowflake.
- 2026-05-04
ER eval set + Wilson CI baseline
Cross-source entity resolution now has a versioned eval set, with Wilson confidence intervals on precision/recall. Future ER changes are measured against this baseline, not eyeballed.
- 2026-05-04
Splink probabilistic ER pipeline + Parquet bridge
Splink (probabilistic record linkage) integrated into the ER pipeline with a Parquet bridge between the warehouse and the matcher. Handles the cases deterministic match can't: name variants, address normalization, partial matches.
- 2026-05-04
FDOT 2025 backfill + multi-slug URL resolution
Florida DOT publishes the same bid letting under multiple URL slugs depending on the year/letting type. Backfilled 2025 + made the resolver tolerant to the slug variations so we don't miss future lettings.
- 2026-05-04
FDOT bid-tab scraper end-to-end
First public source landed: Florida DOT bid tabs, scraped, parsed, and persisted. Sets the pattern every subsequent state follows.
- 2026-05-04
dbt scaffolding + synthetic bid_tabs source
dbt models for bid_tabs and a synthetic source so transforms can be tested before real data lands. Lets us evolve the schema with confidence.
- 2026-05-04
Engine + ontology layer scaffolded
Foundation for the intelligence layer: an engine that runs over a typed ontology of construction-domain entities (contractors, projects, lettings, line items) so downstream questions are queries against meaning, not text.
- 2026-05-04
Supabase pooler + R2 archival
Connected to Supabase via Session Pooler for warehouse workloads; raw scraped artifacts archived to Cloudflare R2 so we always have provenance.