Skip to content

#306/#305 Phase 1: complete per-pid facet index (sample_facet_index)#307

Open
rdhyee wants to merge 7 commits into
isamplesorg:mainfrom
rdhyee:feat/306-complete-facet-index
Open

#306/#305 Phase 1: complete per-pid facet index (sample_facet_index)#307
rdhyee wants to merge 7 commits into
isamplesorg:mainfrom
rdhyee:feat/306-complete-facet-index

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Phase 1 of #305 — the complete per-pid facet index (sample_facet_index)

Part of the #305 meta (facet-count correctness beyond the single-filter cube). This is the data/build foundation that the later phases consume; it also fixes the #306 data bug. Stacked PR series — this is PR 1 of 4.

Includes the joint Claude+Codex plan of record (PLAN_305_facet_counts.md) so the roadmap lands with the foundation.

Why

At the global/zoomed-out view, facet counts only cross-filter for a single active filter (served by the facet_tree_cross_filter cube). With 2+ selections the counts revert to the unfiltered baseline (#304) because the live multi-filter path scanned a ~39M-row membership table that stalls DuckDB-WASM. Phase 2 will route multi-filter counts through a mask-index histogram instead — but it needs a complete per-pid index to scan.

sample_facet_masks (#299) is built from membership, so it silently omits ~29,917 located samples that carry no tree concept (#306). Counting off it undercounts the located universe.

What this PR adds

Immutability

New artifact name under a new versioned tag — never overwrites the cached 202608 files.

Tests & review

  • 36/36 fixture tests pass, including a located no-membership sample (present + zero-masked in the index, absent from masks) and adversarial gates for every corruption (stale coverage id, schema_version=999, malformed build_id, NULL mask, missing siblings, multi-id node_bits, NULL pid).
  • Iterated to convergence with OpenAI Codex (4 adversarial review rounds): coverage-staleness gate, validator metadata/NULL bypasses, fail-open-without-siblings, and an orphan-bundle regression were all found and closed, each locked in by a test.

Not in this PR (later phases)

Refs #305 #306 #304 #276

rdhyee and others added 6 commits June 19, 2026 21:51
Joint Claude+Codex plan for computing multi-filter facet counts from the
isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the
masks histogram approach, native DuckDB benchmarks, source handling, the
isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under
active filters), a 4-phase rollout, and risks.

Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index)

Build sample_facet_index: one row per LOCATED sample (samp_geo), not only
those with tree membership. sample_facet_masks is built FROM membership and
silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the
multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole
located universe, so it scans this complete index instead.

- start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the
  no-membership pids (correct: in no subtree, but still located + still
  counts toward source/total)
- add `source` VARCHAR (exclusive, not multi-valued -> not a mask)
- build_id = "<membership_id>:<coverage_id>": membership half matches
  facet_node_bits (mask-bit interpretation gate); coverage half fingerprints
  the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale
  silently (today's membership-only id is blind to it)
- schema_version column for forward-compat

Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2
(located universe completeness, isamplesorg#306), source==facets_v2, structured single
build_id matching node_bits, masks bit-identical to sample_facet_masks for
shared pids, and zero-mask for every no-membership pid.

Tests: a located no-membership sample (root-only material) — present + zero
masked in the index, absent from masks; coverage-id changes on source flip;
all corruption gates bite. 29/29 pass.

Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership /
"anywhere in tree") served by this index.

Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1)

Address Codex adversarial review of the sample_facet_index Phase-1 diff:

- #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id
  halves from the written siblings (coverage from sample_facets_v2,
  membership from sample_facet_membership) and asserts equality — a stale or
  hand-edited coverage id now FAILS at build time even though node_bits is
  unchanged. Runtime explorer handshake documented as a Phase 2 item.
- #2 validator metadata/NULL gates: schema_version must equal the exact
  contract version (999 was accepted); build_id must match the exact shape
  <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted
  <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask
  checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0).
- #3 independent mask check: re-derive masks from membership + node_bits for
  EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the
  optional sample_facet_masks file.
- #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is
  built, so --only sample_facet_index can't ship an uninterpretable mask file.
- #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+
  COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with
  a sentinel byte so NULL≠''; honest non-cryptographic caveat.
- SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the
  dup check, broke pid joins, and was absent from the coverage hash).

Tests: +5 adversarial (null mask, schema_version=999, malformed build_id,
stale coverage id, null pid) — all gates bite. 34/34 pass.

Refs isamplesorg#305 isamplesorg#306
… (Codex round 2)

- r2 #1: index validation now FAILS CLOSED when sample_facet_membership or
  facet_node_bits is absent — those siblings are what the build_id recompute
  and mask re-derivation depend on, so a missing sibling must gate, not run a
  silent partial pass (Codex reproduced a corrupt index + restamped node_bits
  passing once the siblings were removed). They're always co-produced, so
  --dir/--tag satisfies it.
- r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the
  membership-half match — 0 or >1 distinct ids previously skipped the match
  entirely (fail-open).

Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly;
coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected;
node_bits force-emit has no double-build and masks/index can't ship without
node_bits (even with --skip).

Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3)

Codex r3 caught a regression my own fixes introduced: --only sample_facet_index
emitted index + node_bits but NOT sample_facet_membership, so the new validator
gate (which requires membership as its independent oracle) would reject that
output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306
completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against
an existing full build, by design.

- force_dep(): whenever masks or index is built, force-emit BOTH
  sample_facet_membership and facet_node_bits (the bundle the index validator
  depends on), exactly once, recorded for the manifest.
- test now proves the real workflow: full build -> --only sample_facet_index
  rebuild into the same dir -> re-validate PASSES (not just file-existence).

Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every
bypass) with no new skip path. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4)

Delete index+node_bits+membership after the full build and before the
--only sample_facet_index rebuild, so a stale full-build file can't satisfy
the existence checks. Retains the re-emission + validator-pass assertions.

Codex r4 confirmed the implementation correct: index-only and masks-only emit
membership+node_bits exactly once, no duplicate emissions in a full build,
partial-rebuild validation passes, manifest records each forced dep once, and
no invocation produces masks/index without node_bits+membership.

36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits

The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio —
but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks
carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's
facetIndexReady preflight matches the index's membership-half against the deployed
node_bits.build_id. The trio format would never match, leaving multi-filter counts
permanently "unavailable" against already-published artifacts.

Revert membership_build_id to the bare order-independent XOR (grain is unique →
no cancellation, as the original argued). Keep the richer trio ONLY for
coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore
"<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with
bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape.

Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id,
so the index can be published standalone (new filename, non-destructive) without
republishing node_bits/masks/membership. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant