Skip to content

#304/#305 Phase 3: viewport + search facet counts via mask index (stacked on #308)#309

Draft
rdhyee wants to merge 14 commits into
isamplesorg:mainfrom
rdhyee:feat/304-viewport-search-counts
Draft

#304/#305 Phase 3: viewport + search facet counts via mask index (stacked on #308)#309
rdhyee wants to merge 14 commits into
isamplesorg:mainfrom
rdhyee:feat/304-viewport-search-counts

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Phase 3 of #305 — viewport + search facet counts via the mask index

🔗 Stacked on #308 (Phase 2) → #307 (Phase 1). Draft until the base PRs merge; the net-new change is explorer.qmd.

Phase 2 routed global, no-search multi-filter counts through sample_facet_index. Phase 3 extends the same index path to the two cases it gated out:

  • Viewport — the index has no coordinates, so a non-global view JOINs samples_map_lite (l.latitude/l.longitude) on pid (one row/pid → no grain change) and appends the same VIEWPORT_PAD_FACTOR=0.3 padded bbox the table / point loader / heatmap use.
  • Search — appends the search_pids semi-join.

Both fold into the single UNION ALL statement via a bare-predicate WHERE builder. Index columns are now s.-qualified (samples_map_lite also has a source column).

Gate + honesty

The index path now runs whenever there's any constraint (cross-filter, viewport, or search) in full tree mode. The fallback pivots on whether a viewport bbox exists, not on search:

  • Global (bbox null), with or without search → an index miss shows the "(—)" dash (the legacy global tree path baselines, ignoring filters/search — Interactive Explorer: Facet counts wrong when filtering by multiple facet values #304); never baseline.
  • Viewport (±search) → an index miss falls through to the legacy membership path, which is correct there (JOINs lite + bbox + search). Both paths coexist; membership-path retirement is deferred to a browser-verified follow-up (plan: retire only after in-app parity).

Characterization (native DuckDB parity, real 202608 data)

object_type counts for material=anthropogenic scoped to a Mediterranean bbox are IDENTICAL between the index path and the membership path (COUNT(*) over the mask join == COUNT(DISTINCT pid) over membership). The search semi-join is structurally identical to the membership path's, so search parity follows from the proven mask≡membership set identity.

Review

LGTM after 2 Codex rounds (the round-1 fix closed a global-search honesty defect where an index miss could show baseline tree counts).

Deferred to Phase 4

Playwright (incl. global+search forced-unavailable → "(—)" not baseline), CI validator gate, remove the legacy baseline fallback + retire the membership count path, WASM cold/warm benchmark, staging verify.

Refs #305 #304

rdhyee and others added 14 commits June 19, 2026 21:51
Joint Claude+Codex plan for computing multi-filter facet counts from the
isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the
masks histogram approach, native DuckDB benchmarks, source handling, the
isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under
active filters), a 4-phase rollout, and risks.

Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index)

Build sample_facet_index: one row per LOCATED sample (samp_geo), not only
those with tree membership. sample_facet_masks is built FROM membership and
silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the
multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole
located universe, so it scans this complete index instead.

- start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the
  no-membership pids (correct: in no subtree, but still located + still
  counts toward source/total)
- add `source` VARCHAR (exclusive, not multi-valued -> not a mask)
- build_id = "<membership_id>:<coverage_id>": membership half matches
  facet_node_bits (mask-bit interpretation gate); coverage half fingerprints
  the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale
  silently (today's membership-only id is blind to it)
- schema_version column for forward-compat

Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2
(located universe completeness, isamplesorg#306), source==facets_v2, structured single
build_id matching node_bits, masks bit-identical to sample_facet_masks for
shared pids, and zero-mask for every no-membership pid.

Tests: a located no-membership sample (root-only material) — present + zero
masked in the index, absent from masks; coverage-id changes on source flip;
all corruption gates bite. 29/29 pass.

Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership /
"anywhere in tree") served by this index.

Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1)

Address Codex adversarial review of the sample_facet_index Phase-1 diff:

- #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id
  halves from the written siblings (coverage from sample_facets_v2,
  membership from sample_facet_membership) and asserts equality — a stale or
  hand-edited coverage id now FAILS at build time even though node_bits is
  unchanged. Runtime explorer handshake documented as a Phase 2 item.
- #2 validator metadata/NULL gates: schema_version must equal the exact
  contract version (999 was accepted); build_id must match the exact shape
  <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted
  <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask
  checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0).
- #3 independent mask check: re-derive masks from membership + node_bits for
  EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the
  optional sample_facet_masks file.
- #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is
  built, so --only sample_facet_index can't ship an uninterpretable mask file.
- #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+
  COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with
  a sentinel byte so NULL≠''; honest non-cryptographic caveat.
- SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the
  dup check, broke pid joins, and was absent from the coverage hash).

Tests: +5 adversarial (null mask, schema_version=999, malformed build_id,
stale coverage id, null pid) — all gates bite. 34/34 pass.

Refs isamplesorg#305 isamplesorg#306
… (Codex round 2)

- r2 #1: index validation now FAILS CLOSED when sample_facet_membership or
  facet_node_bits is absent — those siblings are what the build_id recompute
  and mask re-derivation depend on, so a missing sibling must gate, not run a
  silent partial pass (Codex reproduced a corrupt index + restamped node_bits
  passing once the siblings were removed). They're always co-produced, so
  --dir/--tag satisfies it.
- r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the
  membership-half match — 0 or >1 distinct ids previously skipped the match
  entirely (fail-open).

Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly;
coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected;
node_bits force-emit has no double-build and masks/index can't ship without
node_bits (even with --skip).

Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3)

Codex r3 caught a regression my own fixes introduced: --only sample_facet_index
emitted index + node_bits but NOT sample_facet_membership, so the new validator
gate (which requires membership as its independent oracle) would reject that
output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306
completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against
an existing full build, by design.

- force_dep(): whenever masks or index is built, force-emit BOTH
  sample_facet_membership and facet_node_bits (the bundle the index validator
  depends on), exactly once, recorded for the manifest.
- test now proves the real workflow: full build -> --only sample_facet_index
  rebuild into the same dir -> re-validate PASSES (not just file-existence).

Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every
bypass) with no new skip path. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4)

Delete index+node_bits+membership after the full build and before the
--only sample_facet_index rebuild, so a stale full-build file can't satisfy
the existence checks. Retains the re-emission + validator-pass assertions.

Codex r4 confirmed the implementation correct: index-only and masks-only emit
membership+node_bits exactly once, no duplicate emissions in a full build,
partial-rebuild validation passes, manifest records each forced dep once, and
no invocation produces masks/index without node_bits+membership.

36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits

The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio —
but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks
carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's
facetIndexReady preflight matches the index's membership-half against the deployed
node_bits.build_id. The trio format would never match, leaving multi-filter counts
permanently "unavailable" against already-published artifacts.

Revert membership_build_id to the bare order-independent XOR (grain is unique →
no cancellation, as the original argued). Keep the richer trio ONLY for
coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore
"<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with
bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape.

Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id,
so the index can be published standalone (new filename, non-destructive) without
republishing node_bits/masks/membership. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
… global view via mask index

The single-filter cube (isamplesorg#298) serves correct cross-filtered counts for exactly
ONE active filter at global view. With 2+ filters, effectiveSingleFilter()
returns null, the cube is skipped, and updateCrossFilteredCounts falls to the
legacy early-return that reverts every tree dim to the UNFILTERED baseline
(explorer.qmd) — the isamplesorg#304 bug (e.g. Material=anthropogenic + Object=artifact
showed 261,086 artifacts instead of the cross-filtered 145,770).

Fix: route the global / full-tree-mode / no-search / multi-filter case through
the complete per-pid index (sample_facet_index, isamplesorg#305/isamplesorg#306) with the node_bits
bitmask predicate:
- directFilterSnapshot(): reads ALL selections from the controls WITHOUT zeroing
  tree dims at global view (that zeroing in describeCrossFilters is the root cause).
- maskIndexWhere(): per-dim cross-filter predicate (OR within a dim via the bit
  mask, AND across dims, exclude-self); source-impossible zeroes tree targets but
  not source's own histogram (the plan's special case).
- applyMaskIndexCounts(): one columnar bitmask query per dim, applied ATOMICALLY
  after a single facetCountsReqId stale check; COUNT(*) over the one-row-per-pid
  index == distinct pids.
- HONESTY RULE: on query failure → 'unavailable' (a "(—)" dash, .count-unavailable),
  NEVER the baseline. If the index/bitmap isn't usable yet (unpublished) →
  'fallthrough' to the legacy paths unchanged, so this is safe to ship first.
- facetIndexReady preflight: index present + schema_version==1 + build_id
  membership-half == node_bits generation; else stays unavailable, not wrong.

Single-filter cube kept as a pure optimization. Viewport + search still take the
live membership path (Phase 3).

Verified on real 202608 data: index covers 6,026,242 located pids vs masks'
5,996,325 — exactly the 29,917 isamplesorg#306 samples recovered; full validator passes;
Eric's case returns 145,770 (cross-filtered) not 261,086 (baseline).

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…y, boot race, coverage, decoupling)

- P1 fallthrough reintroduced isamplesorg#304: at global + full tree + active filters there
  is NO correct legacy path (it reverts to baseline), so 'unavailable' AND
  'fallthrough' now both show the dash, never legacy baseline. + defensive stale
  check before markFacetCountsUnavailable.
- P1 boot race: facetIndexReady now calls window.__onFacetIndexReady() when it
  settles true; the main script registers it to refreshFacetCounts() if filters
  are active, so a URL-restored multi-filter that computed before readiness
  reconciles to correct counts instead of staying on the dash.
- P1/P2 runtime coverage handshake: facetIndexReady now also verifies the index
  row count == facets (located universe) — catches stale/partial/mixed-cache
  indexes (the isamplesorg#306 drift class a membership-only id is blind to). Full
  fingerprint stays a build-time gate (SERIALIZATIONS §4.12).
- P2 decouple node_bits from masks: nodeBitsReady publishes __nodeBitsMap +
  __nodeBitsBuild as soon as node_bits is valid (count path needs only these);
  __nodeBits (masks-gated) stays for facetFilterSQL. A missing masks file no
  longer disables a valid count bundle. Count path now reads __nodeBitsMap.
- P2 WASM concurrency: the 4 per-dim queries are now ONE UNION ALL statement
  (one connection slot) instead of 4 concurrent full-index scans.
- P2 SQL injection: R2_BASE (user-controllable via ?data_base=/localStorage) is
  rejected if it contains a quote/backslash/control char or isn't http(s)/root-
  relative, so it can't break out of read_parquet('...') in any query.

Re-verified on real 202608 data: single UNION ALL query returns Eric's case as
145,770 cross-filtered across all dims (vs 261,086 baseline).

Codex confirmed sound: COUNT(*) per concept == distinct pids; exclude-self /
OR-within / AND-across / source-impossible / multi-value-single-dim semantics;
single post-Promise stale check. WASM cold/warm browser benchmark remains the
Phase 4 release gate.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
… source-only refresh (Codex r2)

- P1: nodeBitsReady split into TWO try blocks — a masks-query failure no longer
  throws into the shared catch that nulled __nodeBitsMap/__nodeBitsBuild. node_bits
  publishes the count-path values in step 1; masks preflight is step 2 and only
  withholds __nodeBits (the FILTER signal) on failure. Count path now truly
  survives a missing/mismatched masks file.
- P1: __onFacetIndexReady refreshes UNCONDITIONALLY — hasFacetFilters() excludes
  source, so a source-only filter that painted dashes before readiness now also
  reconciles. Debounce + reqId guard make a redundant no-filter refresh harmless.
- P2 staleness: coverage handshake upgraded from bare row-count to per-SOURCE
  histogram equality (index vs facets_v3) — catches equal-cardinality PID
  replacement + source drift, not just a smaller index. Verified mismatch=0 on real
  202608 data (GEOME/OPENCONTEXT/SESAR/SMITHSONIAN all match). Still a proxy, not
  the full fingerprint (build-time gate); generation-id match guards mask corruption.

Codex r2 confirmed otherwise sound: no global active-filter route reaches baseline;
UNION ALL exclude-self + dim regrouping correct; unqualified cols bind to index `s`;
all R2_BASE forms pass; 36 py + 13 node tests pass.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…laim (Codex r3)

The per-source histogram handshake does NOT catch a same-source, same-cardinality
PID swap (mismatch stays 0) — correct the overclaim. That residual is closed by
(i) the generation-id match (the membership half hashes pid, so any swap of a pid
WITH membership changes it — leaving only isamplesorg#306 no-membership pids) and (ii) the
authoritative BUILD-TIME coverage fingerprint + pid-set equality gate. Runtime
stays a cheap source/cardinality proxy by design (boot-query budget); a future
co-published expected-id pointer would let it assert the full fingerprint.

Codex r3 confirmed all P1s closed: masks failure never clears the count-path
values; map/nbBuild correctly scoped; unconditional readiness refresh can't loop;
coverage SQL valid + NULL-safe + cheap.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…erage check cheap (boot-perf)

In-browser DuckDB-WASM verification against the live index showed the per-source
coverage handshake cost ~8.6s at boot — it scanned the 60MB facets_v3. Compare
the index's per-source histogram to facet_summaries' source rows instead: the
builder computes those as GROUP BY source over the SAME located set (samp_geo),
the file is ~2KB and already loaded at boot, and it's verified equivalent
(mismatch=0 on real 202608 data). This removes a multi-second connection hold at
boot — exactly the single-connection starvation the app guards against — with no
loss of coverage signal. Index source filter mirrors the builder's
NULLIF(TRIM(source),'') so empty-string sources can't cause a false mismatch.

Verified end-to-end in WASM: preflight PASS, Eric's case = 145,770, cold 1121ms /
warm 174ms.

Refs isamplesorg#305 isamplesorg#304
…viewport + search

Phase 2 routed global, no-search multi-filter counts through sample_facet_index.
Phase 3 extends the SAME path to the two cases it gated OUT:
- VIEWPORT: the index has no coordinates, so a non-global view JOINs
  samples_map_lite (l.latitude/l.longitude) on pid (one row/pid → no grain
  change) and appends the same VIEWPORT_PAD_FACTOR=0.3 padded bbox the table,
  point loader and heatmap use.
- SEARCH: appends the search_pids semi-join (s.pid IN (SELECT pid FROM
  search_pids)).
Both fold into the single UNION ALL statement via a bare-predicate WHERE builder.
Index columns are now s.-qualified because lite ALSO has a `source` column.

Gate generalized: the index path now runs whenever there's ANY constraint
(cross-filter, viewport, or search) in full tree mode. Fallback differs by view
(honesty rule): at GLOBAL+no-search an index miss shows the dash (legacy =
baseline = isamplesorg#304); at VIEWPORT/SEARCH an index miss falls through to the legacy
membership path, which is CORRECT there. Both paths coexist — the membership
count path is RETAINED this phase; retirement is deferred to a browser-verified
follow-up (plan: retire only after parity is confirmed in-app).

Characterization (native DuckDB parity, real 202608 data): object_type counts for
material=anthropogenic scoped to a Mediterranean bbox are IDENTICAL between the
index path and the membership path (COUNT(*) over the mask join == COUNT(DISTINCT
pid) over membership). The search semi-join is structurally identical to the
membership path's, so search parity follows from the proven mask≡membership set
identity.

Refs isamplesorg#305 isamplesorg#304
…llback (Codex P3)

At global view WITH active search, an index miss/error fell through (atGlobalNoSearch
was false because search was active) to the legacy tree path, which hits
`if (treeActive && !bboxSQL) applyFacetCounts(d.key, null)` — the unfiltered
BASELINE, ignoring search. Honesty violation.

The correct pivot is whether a VIEWPORT BBOX exists, not whether search is active:
at global (bboxSQL === null) there is no correct legacy tree path regardless of
search, so an index miss now shows the dash. Fall through to the legacy membership
path ONLY when a viewport bbox exists (where it is correct). Drops atGlobalNoSearch
in favor of `bboxSQL === null`.

Codex confirmed everything else sound: grain (lite join one-row-per-pid under the
validated PID-set-equality contract), column binding, bbox regex (strips only the
first leading AND, anti-meridian inner ANDs intact), escaping, viewport-only
histograms, concurrency/atomicity.

The requested regression test (global+search + forced index-unavailable → tree
counts "(—)" not baseline) is recorded as a Phase 4 Playwright case (isamplesorg#43) — the
count engine runs in-browser, so it needs the Playwright harness, not a headless
unit test.

Refs isamplesorg#305 isamplesorg#304
@rdhyee rdhyee force-pushed the feat/304-viewport-search-counts branch from 7c6b23d to 4aa989b Compare June 20, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant