#304/#305 Phase 3: viewport + search facet counts via mask index (stacked on #308)#309
Draft
rdhyee wants to merge 14 commits into
Draft
#304/#305 Phase 3: viewport + search facet counts via mask index (stacked on #308)#309rdhyee wants to merge 14 commits into
rdhyee wants to merge 14 commits into
Conversation
Joint Claude+Codex plan for computing multi-filter facet counts from the isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the masks histogram approach, native DuckDB benchmarks, source handling, the isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under active filters), a 4-phase rollout, and risks. Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index) Build sample_facet_index: one row per LOCATED sample (samp_geo), not only those with tree membership. sample_facet_masks is built FROM membership and silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole located universe, so it scans this complete index instead. - start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the no-membership pids (correct: in no subtree, but still located + still counts toward source/total) - add `source` VARCHAR (exclusive, not multi-valued -> not a mask) - build_id = "<membership_id>:<coverage_id>": membership half matches facet_node_bits (mask-bit interpretation gate); coverage half fingerprints the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale silently (today's membership-only id is blind to it) - schema_version column for forward-compat Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2 (located universe completeness, isamplesorg#306), source==facets_v2, structured single build_id matching node_bits, masks bit-identical to sample_facet_masks for shared pids, and zero-mask for every no-membership pid. Tests: a located no-membership sample (root-only material) — present + zero masked in the index, absent from masks; coverage-id changes on source flip; all corruption gates bite. 29/29 pass. Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership / "anywhere in tree") served by this index. Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1) Address Codex adversarial review of the sample_facet_index Phase-1 diff: - #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id halves from the written siblings (coverage from sample_facets_v2, membership from sample_facet_membership) and asserts equality — a stale or hand-edited coverage id now FAILS at build time even though node_bits is unchanged. Runtime explorer handshake documented as a Phase 2 item. - #2 validator metadata/NULL gates: schema_version must equal the exact contract version (999 was accepted); build_id must match the exact shape <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0). - #3 independent mask check: re-derive masks from membership + node_bits for EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the optional sample_facet_masks file. - #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is built, so --only sample_facet_index can't ship an uninterpretable mask file. - #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+ COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with a sentinel byte so NULL≠''; honest non-cryptographic caveat. - SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the dup check, broke pid joins, and was absent from the coverage hash). Tests: +5 adversarial (null mask, schema_version=999, malformed build_id, stale coverage id, null pid) — all gates bite. 34/34 pass. Refs isamplesorg#305 isamplesorg#306
… (Codex round 2) - r2 #1: index validation now FAILS CLOSED when sample_facet_membership or facet_node_bits is absent — those siblings are what the build_id recompute and mask re-derivation depend on, so a missing sibling must gate, not run a silent partial pass (Codex reproduced a corrupt index + restamped node_bits passing once the siblings were removed). They're always co-produced, so --dir/--tag satisfies it. - r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the membership-half match — 0 or >1 distinct ids previously skipped the match entirely (fail-open). Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly; coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected; node_bits force-emit has no double-build and masks/index can't ship without node_bits (even with --skip). Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3) Codex r3 caught a regression my own fixes introduced: --only sample_facet_index emitted index + node_bits but NOT sample_facet_membership, so the new validator gate (which requires membership as its independent oracle) would reject that output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306 completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against an existing full build, by design. - force_dep(): whenever masks or index is built, force-emit BOTH sample_facet_membership and facet_node_bits (the bundle the index validator depends on), exactly once, recorded for the manifest. - test now proves the real workflow: full build -> --only sample_facet_index rebuild into the same dir -> re-validate PASSES (not just file-existence). Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every bypass) with no new skip path. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4) Delete index+node_bits+membership after the full build and before the --only sample_facet_index rebuild, so a stale full-build file can't satisfy the existence checks. Retains the re-emission + validator-pass assertions. Codex r4 confirmed the implementation correct: index-only and masks-only emit membership+node_bits exactly once, no duplicate emissions in a full build, partial-rebuild validation passes, manifest records each forced dep once, and no invocation produces masks/index without node_bits+membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio — but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's facetIndexReady preflight matches the index's membership-half against the deployed node_bits.build_id. The trio format would never match, leaving multi-filter counts permanently "unavailable" against already-published artifacts. Revert membership_build_id to the bare order-independent XOR (grain is unique → no cancellation, as the original argued). Keep the richer trio ONLY for coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore "<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape. Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id, so the index can be published standalone (new filename, non-destructive) without republishing node_bits/masks/membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
… global view via mask index The single-filter cube (isamplesorg#298) serves correct cross-filtered counts for exactly ONE active filter at global view. With 2+ filters, effectiveSingleFilter() returns null, the cube is skipped, and updateCrossFilteredCounts falls to the legacy early-return that reverts every tree dim to the UNFILTERED baseline (explorer.qmd) — the isamplesorg#304 bug (e.g. Material=anthropogenic + Object=artifact showed 261,086 artifacts instead of the cross-filtered 145,770). Fix: route the global / full-tree-mode / no-search / multi-filter case through the complete per-pid index (sample_facet_index, isamplesorg#305/isamplesorg#306) with the node_bits bitmask predicate: - directFilterSnapshot(): reads ALL selections from the controls WITHOUT zeroing tree dims at global view (that zeroing in describeCrossFilters is the root cause). - maskIndexWhere(): per-dim cross-filter predicate (OR within a dim via the bit mask, AND across dims, exclude-self); source-impossible zeroes tree targets but not source's own histogram (the plan's special case). - applyMaskIndexCounts(): one columnar bitmask query per dim, applied ATOMICALLY after a single facetCountsReqId stale check; COUNT(*) over the one-row-per-pid index == distinct pids. - HONESTY RULE: on query failure → 'unavailable' (a "(—)" dash, .count-unavailable), NEVER the baseline. If the index/bitmap isn't usable yet (unpublished) → 'fallthrough' to the legacy paths unchanged, so this is safe to ship first. - facetIndexReady preflight: index present + schema_version==1 + build_id membership-half == node_bits generation; else stays unavailable, not wrong. Single-filter cube kept as a pure optimization. Viewport + search still take the live membership path (Phase 3). Verified on real 202608 data: index covers 6,026,242 located pids vs masks' 5,996,325 — exactly the 29,917 isamplesorg#306 samples recovered; full validator passes; Eric's case returns 145,770 (cross-filtered) not 261,086 (baseline). Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…y, boot race, coverage, decoupling) - P1 fallthrough reintroduced isamplesorg#304: at global + full tree + active filters there is NO correct legacy path (it reverts to baseline), so 'unavailable' AND 'fallthrough' now both show the dash, never legacy baseline. + defensive stale check before markFacetCountsUnavailable. - P1 boot race: facetIndexReady now calls window.__onFacetIndexReady() when it settles true; the main script registers it to refreshFacetCounts() if filters are active, so a URL-restored multi-filter that computed before readiness reconciles to correct counts instead of staying on the dash. - P1/P2 runtime coverage handshake: facetIndexReady now also verifies the index row count == facets (located universe) — catches stale/partial/mixed-cache indexes (the isamplesorg#306 drift class a membership-only id is blind to). Full fingerprint stays a build-time gate (SERIALIZATIONS §4.12). - P2 decouple node_bits from masks: nodeBitsReady publishes __nodeBitsMap + __nodeBitsBuild as soon as node_bits is valid (count path needs only these); __nodeBits (masks-gated) stays for facetFilterSQL. A missing masks file no longer disables a valid count bundle. Count path now reads __nodeBitsMap. - P2 WASM concurrency: the 4 per-dim queries are now ONE UNION ALL statement (one connection slot) instead of 4 concurrent full-index scans. - P2 SQL injection: R2_BASE (user-controllable via ?data_base=/localStorage) is rejected if it contains a quote/backslash/control char or isn't http(s)/root- relative, so it can't break out of read_parquet('...') in any query. Re-verified on real 202608 data: single UNION ALL query returns Eric's case as 145,770 cross-filtered across all dims (vs 261,086 baseline). Codex confirmed sound: COUNT(*) per concept == distinct pids; exclude-self / OR-within / AND-across / source-impossible / multi-value-single-dim semantics; single post-Promise stale check. WASM cold/warm browser benchmark remains the Phase 4 release gate. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
… source-only refresh (Codex r2) - P1: nodeBitsReady split into TWO try blocks — a masks-query failure no longer throws into the shared catch that nulled __nodeBitsMap/__nodeBitsBuild. node_bits publishes the count-path values in step 1; masks preflight is step 2 and only withholds __nodeBits (the FILTER signal) on failure. Count path now truly survives a missing/mismatched masks file. - P1: __onFacetIndexReady refreshes UNCONDITIONALLY — hasFacetFilters() excludes source, so a source-only filter that painted dashes before readiness now also reconciles. Debounce + reqId guard make a redundant no-filter refresh harmless. - P2 staleness: coverage handshake upgraded from bare row-count to per-SOURCE histogram equality (index vs facets_v3) — catches equal-cardinality PID replacement + source drift, not just a smaller index. Verified mismatch=0 on real 202608 data (GEOME/OPENCONTEXT/SESAR/SMITHSONIAN all match). Still a proxy, not the full fingerprint (build-time gate); generation-id match guards mask corruption. Codex r2 confirmed otherwise sound: no global active-filter route reaches baseline; UNION ALL exclude-self + dim regrouping correct; unqualified cols bind to index `s`; all R2_BASE forms pass; 36 py + 13 node tests pass. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…laim (Codex r3) The per-source histogram handshake does NOT catch a same-source, same-cardinality PID swap (mismatch stays 0) — correct the overclaim. That residual is closed by (i) the generation-id match (the membership half hashes pid, so any swap of a pid WITH membership changes it — leaving only isamplesorg#306 no-membership pids) and (ii) the authoritative BUILD-TIME coverage fingerprint + pid-set equality gate. Runtime stays a cheap source/cardinality proxy by design (boot-query budget); a future co-published expected-id pointer would let it assert the full fingerprint. Codex r3 confirmed all P1s closed: masks failure never clears the count-path values; map/nbBuild correctly scoped; unconditional readiness refresh can't loop; coverage SQL valid + NULL-safe + cheap. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…erage check cheap (boot-perf) In-browser DuckDB-WASM verification against the live index showed the per-source coverage handshake cost ~8.6s at boot — it scanned the 60MB facets_v3. Compare the index's per-source histogram to facet_summaries' source rows instead: the builder computes those as GROUP BY source over the SAME located set (samp_geo), the file is ~2KB and already loaded at boot, and it's verified equivalent (mismatch=0 on real 202608 data). This removes a multi-second connection hold at boot — exactly the single-connection starvation the app guards against — with no loss of coverage signal. Index source filter mirrors the builder's NULLIF(TRIM(source),'') so empty-string sources can't cause a false mismatch. Verified end-to-end in WASM: preflight PASS, Eric's case = 145,770, cold 1121ms / warm 174ms. Refs isamplesorg#305 isamplesorg#304
…viewport + search Phase 2 routed global, no-search multi-filter counts through sample_facet_index. Phase 3 extends the SAME path to the two cases it gated OUT: - VIEWPORT: the index has no coordinates, so a non-global view JOINs samples_map_lite (l.latitude/l.longitude) on pid (one row/pid → no grain change) and appends the same VIEWPORT_PAD_FACTOR=0.3 padded bbox the table, point loader and heatmap use. - SEARCH: appends the search_pids semi-join (s.pid IN (SELECT pid FROM search_pids)). Both fold into the single UNION ALL statement via a bare-predicate WHERE builder. Index columns are now s.-qualified because lite ALSO has a `source` column. Gate generalized: the index path now runs whenever there's ANY constraint (cross-filter, viewport, or search) in full tree mode. Fallback differs by view (honesty rule): at GLOBAL+no-search an index miss shows the dash (legacy = baseline = isamplesorg#304); at VIEWPORT/SEARCH an index miss falls through to the legacy membership path, which is CORRECT there. Both paths coexist — the membership count path is RETAINED this phase; retirement is deferred to a browser-verified follow-up (plan: retire only after parity is confirmed in-app). Characterization (native DuckDB parity, real 202608 data): object_type counts for material=anthropogenic scoped to a Mediterranean bbox are IDENTICAL between the index path and the membership path (COUNT(*) over the mask join == COUNT(DISTINCT pid) over membership). The search semi-join is structurally identical to the membership path's, so search parity follows from the proven mask≡membership set identity. Refs isamplesorg#305 isamplesorg#304
…llback (Codex P3) At global view WITH active search, an index miss/error fell through (atGlobalNoSearch was false because search was active) to the legacy tree path, which hits `if (treeActive && !bboxSQL) applyFacetCounts(d.key, null)` — the unfiltered BASELINE, ignoring search. Honesty violation. The correct pivot is whether a VIEWPORT BBOX exists, not whether search is active: at global (bboxSQL === null) there is no correct legacy tree path regardless of search, so an index miss now shows the dash. Fall through to the legacy membership path ONLY when a viewport bbox exists (where it is correct). Drops atGlobalNoSearch in favor of `bboxSQL === null`. Codex confirmed everything else sound: grain (lite join one-row-per-pid under the validated PID-set-equality contract), column binding, bbox regex (strips only the first leading AND, anti-meridian inner ANDs intact), escaping, viewport-only histograms, concurrency/atomicity. The requested regression test (global+search + forced index-unavailable → tree counts "(—)" not baseline) is recorded as a Phase 4 Playwright case (isamplesorg#43) — the count engine runs in-browser, so it needs the Playwright harness, not a headless unit test. Refs isamplesorg#305 isamplesorg#304
7c6b23d to
4aa989b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 3 of #305 — viewport + search facet counts via the mask index
🔗 Stacked on #308 (Phase 2) → #307 (Phase 1). Draft until the base PRs merge; the net-new change is
explorer.qmd.Phase 2 routed global, no-search multi-filter counts through
sample_facet_index. Phase 3 extends the same index path to the two cases it gated out:samples_map_lite(l.latitude/l.longitude) onpid(one row/pid → no grain change) and appends the sameVIEWPORT_PAD_FACTOR=0.3padded bbox the table / point loader / heatmap use.search_pidssemi-join.Both fold into the single
UNION ALLstatement via a bare-predicate WHERE builder. Index columns are nows.-qualified (samples_map_lite also has asourcecolumn).Gate + honesty
The index path now runs whenever there's any constraint (cross-filter, viewport, or search) in full tree mode. The fallback pivots on whether a viewport bbox exists, not on search:
Characterization (native DuckDB parity, real 202608 data)
object_typecounts formaterial=anthropogenicscoped to a Mediterranean bbox are IDENTICAL between the index path and the membership path (COUNT(*)over the mask join ==COUNT(DISTINCT pid)over membership). The search semi-join is structurally identical to the membership path's, so search parity follows from the proven mask≡membership set identity.Review
LGTM after 2 Codex rounds (the round-1 fix closed a global-search honesty defect where an index miss could show baseline tree counts).
Deferred to Phase 4
Playwright (incl. global+search forced-unavailable → "(—)" not baseline), CI validator gate, remove the legacy baseline fallback + retire the membership count path, WASM cold/warm benchmark, staging verify.
Refs #305 #304