#306/#305 Phase 1: complete per-pid facet index (sample_facet_index)#307
Open
rdhyee wants to merge 7 commits into
Open
#306/#305 Phase 1: complete per-pid facet index (sample_facet_index)#307rdhyee wants to merge 7 commits into
rdhyee wants to merge 7 commits into
Conversation
Joint Claude+Codex plan for computing multi-filter facet counts from the isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the masks histogram approach, native DuckDB benchmarks, source handling, the isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under active filters), a 4-phase rollout, and risks. Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index) Build sample_facet_index: one row per LOCATED sample (samp_geo), not only those with tree membership. sample_facet_masks is built FROM membership and silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole located universe, so it scans this complete index instead. - start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the no-membership pids (correct: in no subtree, but still located + still counts toward source/total) - add `source` VARCHAR (exclusive, not multi-valued -> not a mask) - build_id = "<membership_id>:<coverage_id>": membership half matches facet_node_bits (mask-bit interpretation gate); coverage half fingerprints the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale silently (today's membership-only id is blind to it) - schema_version column for forward-compat Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2 (located universe completeness, isamplesorg#306), source==facets_v2, structured single build_id matching node_bits, masks bit-identical to sample_facet_masks for shared pids, and zero-mask for every no-membership pid. Tests: a located no-membership sample (root-only material) — present + zero masked in the index, absent from masks; coverage-id changes on source flip; all corruption gates bite. 29/29 pass. Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership / "anywhere in tree") served by this index. Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1) Address Codex adversarial review of the sample_facet_index Phase-1 diff: - #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id halves from the written siblings (coverage from sample_facets_v2, membership from sample_facet_membership) and asserts equality — a stale or hand-edited coverage id now FAILS at build time even though node_bits is unchanged. Runtime explorer handshake documented as a Phase 2 item. - #2 validator metadata/NULL gates: schema_version must equal the exact contract version (999 was accepted); build_id must match the exact shape <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0). - #3 independent mask check: re-derive masks from membership + node_bits for EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the optional sample_facet_masks file. - #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is built, so --only sample_facet_index can't ship an uninterpretable mask file. - #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+ COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with a sentinel byte so NULL≠''; honest non-cryptographic caveat. - SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the dup check, broke pid joins, and was absent from the coverage hash). Tests: +5 adversarial (null mask, schema_version=999, malformed build_id, stale coverage id, null pid) — all gates bite. 34/34 pass. Refs isamplesorg#305 isamplesorg#306
… (Codex round 2) - r2 #1: index validation now FAILS CLOSED when sample_facet_membership or facet_node_bits is absent — those siblings are what the build_id recompute and mask re-derivation depend on, so a missing sibling must gate, not run a silent partial pass (Codex reproduced a corrupt index + restamped node_bits passing once the siblings were removed). They're always co-produced, so --dir/--tag satisfies it. - r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the membership-half match — 0 or >1 distinct ids previously skipped the match entirely (fail-open). Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly; coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected; node_bits force-emit has no double-build and masks/index can't ship without node_bits (even with --skip). Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3) Codex r3 caught a regression my own fixes introduced: --only sample_facet_index emitted index + node_bits but NOT sample_facet_membership, so the new validator gate (which requires membership as its independent oracle) would reject that output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306 completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against an existing full build, by design. - force_dep(): whenever masks or index is built, force-emit BOTH sample_facet_membership and facet_node_bits (the bundle the index validator depends on), exactly once, recorded for the manifest. - test now proves the real workflow: full build -> --only sample_facet_index rebuild into the same dir -> re-validate PASSES (not just file-existence). Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every bypass) with no new skip path. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4) Delete index+node_bits+membership after the full build and before the --only sample_facet_index rebuild, so a stale full-build file can't satisfy the existence checks. Retains the re-emission + validator-pass assertions. Codex r4 confirmed the implementation correct: index-only and masks-only emit membership+node_bits exactly once, no duplicate emissions in a full build, partial-rebuild validation passes, manifest records each forced dep once, and no invocation produces masks/index without node_bits+membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio — but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's facetIndexReady preflight matches the index's membership-half against the deployed node_bits.build_id. The trio format would never match, leaving multi-filter counts permanently "unavailable" against already-published artifacts. Revert membership_build_id to the bare order-independent XOR (grain is unique → no cancellation, as the original argued). Keep the richer trio ONLY for coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore "<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape. Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id, so the index can be published standalone (new filename, non-destructive) without republishing node_bits/masks/membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 1 of #305 — the complete per-pid facet index (
sample_facet_index)Part of the #305 meta (facet-count correctness beyond the single-filter cube). This is the data/build foundation that the later phases consume; it also fixes the #306 data bug. Stacked PR series — this is PR 1 of 4.
Why
At the global/zoomed-out view, facet counts only cross-filter for a single active filter (served by the
facet_tree_cross_filtercube). With 2+ selections the counts revert to the unfiltered baseline (#304) because the live multi-filter path scanned a ~39M-row membership table that stalls DuckDB-WASM. Phase 2 will route multi-filter counts through a mask-index histogram instead — but it needs a complete per-pid index to scan.sample_facet_masks(#299) is built frommembership, so it silently omits ~29,917 located samples that carry no tree concept (#306). Counting off it undercounts the located universe.What this PR adds
sample_facet_index— one row per located sample (fromsamp_geo),LEFT JOINed to the membership-derived masks and zero-masked for no-membership pids. Schema:pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version.sourceis a plain VARCHAR (exclusive, not multi-valued → not a mask).build_id = "<membership_id>:<coverage_id>"— the membership half gates mask-bit interpretation againstfacet_node_bits; the coverage half fingerprints the(pid, source)universe so Data: sample_facet_masks omits ~29,917 located samples with no tree membership (built from membership, not samp_geo) #306-class drift can't go stale silently. Fingerprint = XOR+SUM+COUNT trio with a NULL-source sentinel.--indexgates (an independent oracle): schema + NOT-NULL, one-row-per-pid,pid==facets_v2(located-universe completeness, Data: sample_facet_masks omits ~29,917 located samples with no tree membership (built from membership, not samp_geo) #306),source== facets_v2, exactschema_version, well-formed singlebuild_idrecomputed from siblings, masks re-derived frommembership+node_bitsfor every pid (incl. zero rows), and fail-closed if siblings are absent.membership+node_bits) with masks/index so a build never ships an orphan/unvalidatable artifact.SERIALIZATIONS.md §4.12documents the artifact and ratifies the explorer: material facet semantics — single "first non-root" value vs membership ("anywhere in array") counts #276 count contract (membership / "anywhere in tree").build_base_tables.Immutability
New artifact name under a new versioned tag — never overwrites the cached
202608files.Tests & review
Not in this PR (later phases)
Refs #305 #306 #304 #276