diff --git a/PLAN_305_facet_counts.md b/PLAN_305_facet_counts.md new file mode 100644 index 0000000..22f4c7a --- /dev/null +++ b/PLAN_305_facet_counts.md @@ -0,0 +1,128 @@ +# Plan — #305 Facet-count correctness & coherence (counts beyond the single-filter cube) + +**Status:** plan of record · **Authors:** Claude + Codex (joint, 2026-06-19) · **Tracks:** #305 (meta), #304 (live bug), #306 (data bug), #276 (semantics), #230, #286 + +## Problem + +At the global/zoomed-out view, facet **counts** only cross-filter for a **single** active +filter (served by the precomputed `facet_tree_cross_filter` cube via `effectiveSingleFilter()`, +`explorer.qmd:3374`). With **2+ active selections** the cube is skipped and tree-dim counts +revert to the **unfiltered baseline** (`explorer.qmd:3557-3562`) — deliberately, to avoid a +~39M-row `sample_facet_membership` GROUP BY that stalls DuckDB-WASM. So the **map is correctly +filtered** (#299 bitmask) but the **count numbers are wrong** for multi-filter at world view. +Zoomed-in is already correct (live membership path, `explorer.qmd:3564-3598`). Reported by +Eric in #304 (e.g. *Material: Anthropogenic* + *Specimen Type: Artifact*). + +## Core idea — count per node FROM the bitmasks (no membership scan) + +For a count dimension D over a qualifying pid set Q: + +```sql +SELECT nb.concept_uri AS value, COUNT(*) AS count +FROM s +JOIN read_parquet('${node_bits_url}') nb ON nb.facet_type = '' + AND (s._mask & (1::BIGINT << nb.bit_index)) <> 0 +WHERE +GROUP BY nb.concept_uri +``` + +- Masks are **one row per pid** → `COUNT(*)` == `COUNT(DISTINCT pid)` (do **not** add `COUNT(DISTINCT)` — wasted work). +- Membership encodes **ancestor closure** → child bit ⇒ parent bit, so **parent ≥ child** holds for free. +- The bitwise join is against `facet_node_bits` (≤63 rows/dim), not the 39M-row membership table. + +**Cross-filter semantics (confirmed correct):** OR among selected nodes *within* a dimension; +AND *across* dimensions; **omit the target dimension's own predicate** (exclude-self), matching +the UI contract (`explorer.qmd:3305`) and the existing cube (`build_frontend_derived.py:460`). +Special case: **zero selected sources** ⇒ impossible filter for tree targets, but the source +dimension's own histogram must still exclude source and stay constrained by the three trees. + +## Feasibility (Codex native DuckDB benchmarks, real 6M-pid data) + +| Query | Time | +|---|---| +| One unfiltered 22-node histogram | ~0.17s | +| All three tree dims via one 56-node join | ~0.69s | +| Eric's Material+Specimen-Type → context histogram | ~0.02s | +| Embedded source predicate/group | ~0.014s | +| `facets_v3` semi-join (rejected approach) | ~0.145s | + +Strongly supports feasibility. **Worst case ≈ 6M × 56 = ~336M cheap bit-tests**, not merely a +6M-row scan → **WASM cold+warm benchmarking remains a hard release gate** (the app has documented +DuckDB-WASM concurrency/deadlock sensitivity). + +## Source handling + +Add `source` as a plain **VARCHAR** column to a **new complete per-pid index** (source is +exclusive, not multi-valued → a mask is wrong; a `facets_v3` semi-join adds a second 6M-row +read + hash join). New schema: + +``` +pid · source · material_mask · context_mask · object_type_mask · build_id · schema_version +``` + +## Data-correctness prerequisite (#306) + +The current `sample_facet_masks` is built from `membership`, so it **omits ~29,917 located +samples** with coordinates but no tree membership (`facets_v3` = 6,026,242 vs masks = 5,996,325). +The new index must start from **`samp_geo`**, LEFT JOIN masks, and emit **zero masks** for +no-membership pids. Tracked as **#306** (must land in Phase 1). + +## Honesty rule (non-negotiable) + +**Never fall back to the unfiltered baseline when filters are active** — that is exactly the +#304 bug. `applyFacetCounts(..., null)` means baseline (`explorer.qmd:1381`). On any +miss/error under active filters, show **"—" / "unavailable" / stale styling / hidden counts** — +never authoritative-looking baseline numbers. Apply all dimension maps **atomically after one +stale-guard check** (`facetCountsReqId`); never partially repaint dimensions. + +## Phased plan + +### Phase 1 — Artifact + semantic foundation +- Build the complete per-pid index above. Start from `samp_geo`; include zero-mask pids (#306). +- Add `source` VARCHAR; extend the build **fingerprint** to cover the pid/source universe + (today it fingerprints membership only → would go stale silently once source/coverage change). +- AI-free validator: pid completeness vs `facets_v3`; source equality; node-bit coverage; + mask ≡ membership for pids that have membership; build_id consistency. +- **Publish under a new immutable filename/tag — do NOT overwrite the cached `202608` artifact.** +- Ratify the count contract (#276) as **membership / "anywhere in tree"** — already the shipped + semantics, so this is confirmation, not a redesign. + +### Phase 2 — Fix #304 at global view +- Generalized **direct-control filter snapshot** (today's `describeCrossFilters()` zeroes tree + selections at global view — the root cause). Generalize `effectiveSingleFilter()` to all selections. +- Route **multi-filter / full-tree-mode / global / no-search** counts through one mask-index + histogram query (the `tree_counts ∪ source_counts` shape). +- Keep the single-filter cube as a **pure optimization**, not a required semantic path. +- Apply dimension maps atomically after the stale check. On failure → unavailable, **not** baseline. +- **Benchmark cold + warm DuckDB-WASM** across empty / narrow / broad / all-pid qualifying sets; + set a concrete interaction budget before release. + +### Phase 3 — Extend (not premature unification) +- Add **viewport** support (identical padded bbox, `VIEWPORT_PAD_FACTOR = 0.3`). +- Add **search** via `search_pids`. +- Characterization-test the new path against the existing membership path; **then** retire the + membership count implementation. (Immediate unification would balloon the first fix into + viewport + search + flat-mode + stale-update refactoring — defer.) + +### Phase 4 — Verification + cleanup +- Unit-test predicate generation (exclude-self, zero-source states). +- Data validator gates from Phase 1, run in CI. +- Playwright: Eric's exact URL, multi-value-in-one-dim, source+tree, impossible combos, + global↔viewport transitions, superseded requests. +- Remove the misleading active-filter baseline fallback everywhere. +- Codex adversarial review; staging verify on real R2 data; deploy smoke gate. + +## Constraints / landmines +- **63-bit ceiling:** current dims are 19 / 22 / 15 nodes — ample headroom. Keep the build + hard-fail; design a future multiword schema but **do not implement now**. +- **Flat / mixed render mode:** keep the mask path gated OFF (subtree-membership semantics are + wrong for a flat dim) — defer to existing paths, as the cube already does. +- **Immutable data:** new tag/filename; never overwrite a cached artifact. + +## Top 3 risks +1. **WASM latency / connection starvation** — credible SQL, unmeasured in-browser; ~336M bit-tests worst case. +2. **Stale / incomplete index generations** — #306 missing pids + membership-only build_id; mixed cached generations → convincing-but-false counts. +3. **Honesty & state-transition failures** — baseline-as-truth fallback + async camera transitions unless results applied atomically. + +--- +*Joint plan: Claude (code trace, drafting) + Codex (native benchmarks, source design, #306 discovery, honesty rule, phasing). 2026-06-19.* diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index b5a9be5..2488f09 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -122,6 +122,7 @@ builder — a fresh build is NOT bit-for-bit identical to them (see | `isamples_202601_sample_facets_v2.parquet` | `(pid, source, material, context, object_type, label, description, place_name)` — all VARCHAR scalars; each facet column is a single URI per sample (not an array) | 63 MB | 6.0 M | wide | Search Explorer multi-dim facet filtering | QUERY_SPEC §3.3, §5.1 | | `isamples_202601_facet_summaries.parquet` | Baseline `(facet_type, facet_value, scheme, count)` | 2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 | | `isamples_202601_facet_cross_filter.parquet` | Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a | +| `_sample_facet_index.parquet` | Complete per-pid facet index `(pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version)` — **one row per located sample**, including samples with no tree membership (zero-masked, #306). Scanned by the multi-filter global-view count path (#304/#305). | ~60 MB | 6.0 M | wide (membership + samp_geo) | Interactive Explorer multi-filter facet counts | §4.12 below | ### Tier: vocabulary labels @@ -296,6 +297,53 @@ for the alias when you want "latest." - **Consumer**: `scripts/enrich_wide_with_oc_thumbnails.py` uses OC narrow to fill thumbnails into 202604 unified wide. Also used directly in PQG benchmark work. - **Future**: these become the prototype upstream for per-source sidecars (see §3, bottom row). +### 4.12 `_sample_facet_index.parquet` (complete per-pid facet index, #305/#306) + +- **Role**: The single artifact the **multi-filter** global-view count path scans + (#304/#305). Where `sample_facet_masks` is built FROM `membership` and therefore + silently omits located samples that carry no tree concept (~29,917 in the 202608 + generation — **#306**), this index starts from `samp_geo` (the authoritative + located set) and is **complete**: one row per located `pid`. Samples with no + membership are present and **zero-masked**. +- **Headline schema** (7 cols): `pid (VARCHAR), source (VARCHAR), material_mask + (BIGINT), context_mask (BIGINT), object_type_mask (BIGINT), build_id (VARCHAR), + schema_version (INTEGER)`. `source` is a plain VARCHAR (source is exclusive, not + multi-valued, so a mask would be wrong). Each `*_mask` bit `(1 << bit_index)` is + set iff the pid is a member of that node, under the bit assignment in + `facet_node_bits`; membership encodes ancestor closure, so a parent bit is set for + the whole subtree. +- **`build_id` is structured** as `":"`: + - the **membership half** equals `facet_node_bits.build_id` — the explorer must + only interpret the mask bits when the two agree (else the bits are read under a + foreign assignment); + - the **coverage half** is a fingerprint of the `(pid, source)` universe over + `samp_geo`, so a changed source value or located-pid set (the #306 drift class) + changes `build_id` instead of going stale silently. +- **Count contract (ratifies #276)**: a facet count is **membership / "anywhere in + the tree"** — a sample counts toward a node if that node is anywhere in the + sample's asserted-concept subtree closure (OR within a dimension, AND across + dimensions, **excluding the target dimension's own predicate**). This is the + already-shipped cube/UI semantics; this index does not change it, it serves the + same contract to the *multi-filter* case the single-filter cube cannot reach. +- **Validation**: `validate_frontend_derived.py --index ` is an INDEPENDENT + oracle — it asserts schema + NOT-NULL columns, one row per pid, `pid` set == + `sample_facets_v2.pid` (the located universe), `source` == facets_v2 source, + `schema_version` == the exact contract version, a well-formed single `build_id`, + and then **recomputes** both `build_id` halves from the written sibling files (the + coverage half from `sample_facets_v2`, the membership half from + `sample_facet_membership`) and asserts equality — so a stale/edited coverage id is + caught at build time even though `node_bits` is unchanged. Masks are re-derived + directly from `membership` + `node_bits` for **every** index pid (zero rows + included) and symmetric-diffed against the file. +- **Runtime staleness handshake (Phase 2)**: the *explorer* loads the index by URL + and so cannot, on its own, know the expected coverage half. The load-time refusal + of a stale-coverage index requires an external trusted pointer (e.g. the expected + composite id co-published in the manifest / a `current` alias the explorer reads); + that handshake lands with the count-query consumer in Phase 2. Build-time + validation (above) is the gate for now. +- **Immutability**: published under a **new** versioned filename — it is a new + artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. + ## 5. URL convention All substrate files live under `https://data.isamples.org/` — a diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index d92cff0..dd96baf 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -19,6 +19,7 @@ - {tag}_facet_summaries.parquet facet_type, facet_value, scheme, count - {tag}_facet_cross_filter.parquet filter_source/material/context/object_type, facet_type, facet_value, count - {tag}_wide_h3.parquet wide + h3_res4/6/8 (large; built only on --only wide_h3) + - {tag}_sample_facet_index.parquet pid, source, material_mask, context_mask, object_type_mask(BIGINT), build_id, schema_version(INT) — COMPLETE per-pid index over ALL located samples (incl. #306 no-membership pids, zero-masked); the multi-filter global-view count path (#304/#305) scans this - {tag}_manifest.json provenance + per-output rowcount/schema/sha256 MATERIAL SELECTION (issue #265/#271): the broad SKOS root @@ -64,11 +65,16 @@ ARTIFACTS = ["sample_facets_v2", "samples_map_lite", "h3_summaries", "facet_summaries", "facet_cross_filter", "wide_h3", "sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"] + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"] # #293: max tree nodes per dim that fit in a signed BIGINT mask (bits 0..62). # Live max is 22 (context); guard so a future vocab explosion fails loudly # instead of silently overflowing a mask bit. MASK_MAX_BITS = 63 +# #305/#306: schema version for the complete per-pid facet index. Bump when the +# column set / semantics of sample_facet_index change so the explorer can refuse +# an index it doesn't understand instead of mis-reading it. +INDEX_SCHEMA_VERSION = 1 # Shared SQL expression for sample_facets_v2.description (#277 part 2). # Appends space-joined concept labels (IC labels across all 4 concept dims) @@ -202,12 +208,19 @@ def build_base_tables(con, wide, t0): n_geo = con.sql("SELECT COUNT(*) FROM samp_geo").fetchone()[0] n_dup = con.sql("SELECT COUNT(*) FROM (SELECT pid FROM samp_geo GROUP BY pid HAVING COUNT(*)>1)").fetchone()[0] n_icdup = con.sql("SELECT COUNT(*) FROM (SELECT row_id FROM ic GROUP BY row_id HAVING COUNT(*)>1)").fetchone()[0] - log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} duplicate_concept_row_ids={n_icdup:,}", t0) - if n_dup or n_icdup: - # HARD fail (Codex): non-unique keys make the output grain wrong (inflated + # #305/#306 (Codex): the GROUP BY...HAVING COUNT>1 dup check above collapses all + # NULL pids into ONE group, so a SINGLE NULL pid slips through — yet a NULL pid + # breaks every pid join (joins drop it) and is silently absent from the coverage + # fingerprint. Reject it explicitly. + n_null = con.sql("SELECT COUNT(*) FROM samp_geo WHERE pid IS NULL").fetchone()[0] + log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} " + f"null_pids={n_null:,} duplicate_concept_row_ids={n_icdup:,}", t0) + if n_dup or n_icdup or n_null: + # HARD fail (Codex): non-unique/NULL keys make the output grain wrong (inflated # facet counts, ambiguous joins, non-total ORDER BY pid). Abort before writing. raise SystemExit( - f"FATAL: non-unique keys — duplicate_pids={n_dup}, duplicate_concept_row_ids={n_icdup}. " + f"FATAL: non-unique/NULL keys — duplicate_pids={n_dup}, null_pids={n_null}, " + f"duplicate_concept_row_ids={n_icdup}. " f"Output grain/joins would be wrong; refusing to write.") @@ -493,6 +506,32 @@ def build_facet_tree_cross_filter(con, out): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +# --- generation fingerprints (#293, #305/#306) ----------------------------- +# The per-row token expressions are module constants so the VALIDATOR can +# recompute the exact same fingerprint independently from the written sibling +# files (membership, sample_facets_v2) and assert the index's build_id — i.e. +# the builder is not the sole authority on the index's generation identity. +# chr(31) (US) separates fields. A NULL source is encoded with a distinct +# sentinel byte so NULL and '' can never collide (Codex #5). +MEMBERSHIP_TOKEN_EXPR = "pid || chr(31) || facet_type || chr(31) || concept_uri" +COVERAGE_TOKEN_EXPR = ("pid || chr(31) || CASE WHEN source IS NULL THEN chr(0) ELSE chr(1) END " + "|| chr(31) || COALESCE(source, '')") + + +def _fingerprint(con, relation, token_expr): + # Order-independent generation fingerprint over `relation`. Combines THREE + # order-independent accumulators — XOR, SUM (HUGEINT, no overflow), and COUNT — + # of per-row hashes. XOR alone cancels identical rows and is vulnerable to a + # 2-row swap; SUM+COUNT defeat that. NOT a cryptographic digest: this defends + # against accidental drift / stale generations, not an adversary who engineers a + # multi-row hash collision. Grain is unique (validated), so the trio is stable. + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f" COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) " + f"FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def membership_build_id(con): # #293 (Codex P1, r2): a fingerprint of the FULL membership generation — not # just the node set. Both node_bits (positional bit assignment) and masks are @@ -500,12 +539,46 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). Order-independent XOR of per-row hashes (membership grain is - # unique per (pid,facet_type,concept_uri) — validated — so no XOR cancellation). - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || facet_type || chr(31) || concept_uri)), 0) - AS VARCHAR) - FROM membership""").fetchone()[0] + # masks file). + # + # FORMAT IS A DEPLOYED CONTRACT (#305): the live facet_node_bits / sample_facet_masks + # (202608) carry this id as a BARE bit_xor decimal, and the explorer's facetIndexReady + # preflight matches the index's membership-half against the DEPLOYED node_bits.build_id. + # So this MUST stay the plain order-independent XOR — do NOT switch it to the + # _fingerprint trio (that would change the string and break generation-matching + # against every already-published artifact). membership grain is unique per + # (pid,facet_type,concept_uri) — validated — so no XOR cancellation. The richer + # trio is reserved for coverage_build_id, which is a NEW id with no compat constraint. + return con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({MEMBERSHIP_TOKEN_EXPR})), 0) AS VARCHAR) FROM membership" + ).fetchone()[0] + + +def coverage_build_id(con): + # #305/#306: fingerprint of the COMPLETE per-pid universe the index covers — + # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree + # membership. membership_build_id() alone is blind to this: an index built from + # the SAME membership but a changed source value or a changed located-pid set + # (exactly the #306 class of drift — located pids with no membership) would carry + # an identical membership id and go stale SILENTLY. NULL vs '' source are encoded + # distinctly (sentinel byte) so a NULL↔'' flip changes the fingerprint. + return _fingerprint(con, "samp_geo", COVERAGE_TOKEN_EXPR) + + +def index_build_id(con): + # #305/#306: the index's generation identity = ":". + # The two halves are deliberately separable so the explorer can BOTH: + # 1. gate the mask-bit interpretation by matching the membership half against + # facet_node_bits.build_id (the bit assignment is a pure function of + # membership — the index masks are only meaningful under the SAME node_bits + # generation), and + # 2. detect coverage/source drift via the coverage half (would otherwise be + # invisible to a membership-only id). + # FORMATS DIFFER BY DESIGN: membership half is the BARE bit_xor decimal (the + # deployed node_bits/masks contract — see membership_build_id), coverage half is + # the richer "__" trio (a new id). ':' appears in neither half + # (decimals + underscores only), so split(':', 1) is unambiguous. + return f"{membership_build_id(con)}:{coverage_build_id(con)}" def build_facet_node_bits(con, out, build_id): @@ -559,6 +632,51 @@ def build_sample_facet_masks(con, out, build_id): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +def build_sample_facet_index(con, out, build_id): + # #305/#306: the COMPLETE per-pid facet index — one row for EVERY located + # sample (samp_geo), not only those with tree membership. This is the artifact + # the multi-filter global-view count path (#304) scans: it must be able to count + # over the whole located universe, including samples that carry NO tree concept. + # + # WHY this exists separately from sample_facet_masks: masks is built FROM + # `membership`, so it silently OMITS located samples with no membership row + # (~29,917 in the 202608 generation — #306). Counting/ filtering off masks alone + # undercounts the located universe. Here we start from samp_geo (the authoritative + # located set), LEFT JOIN the same membership-derived masks, and emit a ZERO mask + # for no-membership pids (a zero mask matches no node bit → contributes 0 to every + # tree facet, which is correct: the sample is in no subtree — but it IS still a + # located sample and still counts toward `source` and the located total). + # + # `source` is a plain VARCHAR (source is exclusive, not multi-valued — a mask + # would be wrong, a facets_v3 semi-join would add a second 6M-row read). The + # mask columns are bit-identical to sample_facet_masks for pids that have + # membership (validated). bit assignment == build_facet_node_bits / masks. + con.execute(f"""COPY ( + WITH nb AS ( + SELECT facet_type, concept_uri, + (1::BIGINT << (ROW_NUMBER() OVER (PARTITION BY facet_type ORDER BY concept_uri) - 1)) AS bitval + FROM (SELECT DISTINCT facet_type, concept_uri FROM membership) + ), + masks AS ( + SELECT m.pid, + COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END), 0)::BIGINT AS material_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END), 0)::BIGINT AS context_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END), 0)::BIGINT AS object_type_mask + FROM membership m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri + GROUP BY m.pid + ) + SELECT sg.pid, + sg.source::VARCHAR AS source, + COALESCE(mk.material_mask, 0)::BIGINT AS material_mask, + COALESCE(mk.context_mask, 0)::BIGINT AS context_mask, + COALESCE(mk.object_type_mask, 0)::BIGINT AS object_type_mask, + '{build_id}' AS build_id, + {INDEX_SCHEMA_VERSION}::INTEGER AS schema_version + FROM samp_geo sg LEFT JOIN masks mk ON mk.pid = sg.pid + ORDER BY sg.pid + ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") + + def file_meta(con, path): n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0] schema = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchall()] @@ -623,7 +741,8 @@ def emit(name, fn): # Hierarchy artifacts (#281/#282) — need vocab_labels for the SKOS tree. HIER_ARTIFACTS = {"sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"} + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"} if any(want(a) for a in HIER_ARTIFACTS): if not args.vocab_labels: # Fail loud if the user EXPLICITLY asked for a hierarchy artifact @@ -634,17 +753,44 @@ def emit(name, fn): log("SKIP hierarchy artifacts: pass --vocab-labels ", t0) else: build_concept_membership(con, args.wide, args.vocab_labels, t0) - emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + # The mask fast-path bundle = {membership, node_bits, masks?, index}. + # sample_facet_masks AND sample_facet_index store raw mask bits that are + # uninterpretable without facet_node_bits, and that the validator can only + # gate by re-deriving from sample_facet_membership. So whenever masks or + # index is requested, FORCE-emit membership + node_bits too (even under + # `--only sample_facet_index`) — otherwise the build ships an artifact its + # own validator must reject (Codex #4 / r3). force_dep() builds a not-wanted + # artifact exactly once and records it for the manifest. + need_fastpath = want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index") + force_deps = want("sample_facet_masks") or want("sample_facet_index") + def force_dep(name, fn): + if want(name): + emit(name, fn) + elif p(name) not in produced: + fn(p(name)); produced.append(p(name)); log(f"{name} ✓ (auto-paired with masks/index)", t0) + + if force_deps: + force_dep("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + else: + emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) emit("facet_tree_summaries", lambda o: build_facet_tree_summaries(con, o)) # #290/#293 cross-filter cube — needs membership (above) + samp_geo (source). emit("facet_tree_cross_filter", lambda o: build_facet_tree_cross_filter(con, o)) - # #293 bitmask filter artifacts — needs membership (above). node_bits - # and masks share a node-set build_id so the explorer only uses the mask - # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks"): + # #293 bitmask filter artifacts — needs membership (above). node_bits and + # masks share a node-set build_id so the explorer only uses the mask path + # when the two are from the same generation (Codex P1). + if need_fastpath: _bid = membership_build_id(con) - emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + if force_deps: + force_dep("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + else: + emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) + # #305/#306: complete per-pid index (every located pid + source, + # zero-mask for no-membership pids). Its build_id embeds the SAME + # membership id as node_bits (mask-bit interpretation gate) PLUS a + # coverage id over samp_geo's (pid, source) universe (staleness gate). + emit("sample_facet_index", lambda o: build_sample_facet_index(con, o, index_build_id(con))) if not args.no_manifest: log("hashing inputs/outputs for manifest…", t0) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index debd8e4..754c9d0 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -13,6 +13,8 @@ python scripts/validate_frontend_derived.py \ --facets URL --map-lite URL --summaries URL --cross-filter URL \ --h3 URL4 URL6 URL8 + # hierarchy + #305/#306 complete index (auto-discovered with --dir/--tag, or): + python scripts/validate_frontend_derived.py --dir DIR --tag TAG --index INDEX.parquet """ import argparse, hashlib, json, os, sys import duckdb @@ -55,6 +57,7 @@ def main(): ap.add_argument("--tree-cross-filter", help="facet_tree_cross_filter parquet (#290/#293); optional") ap.add_argument("--node-bits", help="facet_node_bits parquet (#293); optional") ap.add_argument("--masks", help="sample_facet_masks parquet (#293); optional") + ap.add_argument("--index", help="sample_facet_index parquet (#305/#306); optional") ap.add_argument("--wide", help="source wide parquet — enables the SEMANTIC gate " "(re-derive and diff the written files against a fresh build)") ap.add_argument("--min-rows", type=int, default=1_000_000, @@ -300,6 +303,7 @@ def _opt(name, attr): treexf = _opt("facet_tree_cross_filter", "tree_cross_filter") nodebits = _opt("facet_node_bits", "node_bits") masks = _opt("sample_facet_masks", "masks") + index = _opt("sample_facet_index", "index") if tree: T = f"read_parquet('{tree}')" # parent ≥ child for every edge, every dim (distinct-pid UNION semantics — @@ -477,6 +481,161 @@ def _opt(name, attr): + (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a))""") check(f"masks filter == membership ({dim})", d == 0, f"{d} pids differ for a real {dim} node") + # --- #305/#306 complete per-pid facet index (sample_facet_index) --- + # The multi-filter global-view count path scans THIS file. It must cover the + # ENTIRE located universe (facets_v2 = samp_geo), carry the correct source per + # pid, zero-mask the no-membership pids (#306), and be bit-identical to + # sample_facet_masks for pids that DO have membership — otherwise counts drift. + if index: + # Import the builder's canonical schema version + fingerprint token exprs so + # the validator is an INDEPENDENT oracle (Codex #1/#5): it recomputes the + # expected build_id from the written sibling files and asserts equality, + # rather than trusting whatever the builder stamped. + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, + COVERAGE_TOKEN_EXPR) + + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint (coverage trio) + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def _xor_fp(relation, token_expr): # mirror membership_build_id (bare XOR — deployed contract) + return str(con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({token_expr})), 0) AS VARCHAR) FROM {relation}").fetchone()[0]) + + IX = f"read_parquet('{index}')" + # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated + # without BOTH sample_facet_membership and facet_node_bits — they are what + # the independent build_id recompute and mask re-derivation depend on. If + # either is absent we must FAIL rather than silently run a partial (fail-open) + # validation that would pass a corrupt index/node_bits pair. They are always + # co-produced by a build, so --dir/--tag (or passing both) satisfies this. + check("index validation has membership + node_bits (required to gate)", + bool(mem and nodebits), + "pass --membership and --node-bits (or --dir/--tag with both present); " + "the index can't be validated in isolation") + # schema/types contract (the explorer reads these columns by name) + ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] + EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), + ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), + ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), + ("schema_version", "INTEGER")] + check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # NOT-NULL contract (source MAY be NULL — it mirrors facets_v2; everything + # else must be present). A NULL mask is a real corruption (Codex #2): the + # zero-mask check below uses IS DISTINCT FROM so it also rejects NULL. + nn_bad = scalar(f"""SELECT COUNT(*) FROM {IX} WHERE pid IS NULL + OR material_mask IS NULL OR context_mask IS NULL OR object_type_mask IS NULL + OR build_id IS NULL OR schema_version IS NULL""") + check("index: required columns non-NULL", nn_bad == 0, f"{nn_bad} rows with NULL pid/mask/build_id/version") + # one row per pid + ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") + check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") + # COMPLETENESS (#306): index pid SET == facets_v2 pid SET (the located universe). + # This is the whole point — masks omits no-membership located pids; the index + # must not. A symmetric set-diff catches both missing and extra pids. + ix_diff = scalar(f"""SELECT + (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""") + check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, + f"{ix_diff} pids differ between index and facets_v2") + # SOURCE equality: index.source must equal facets_v2.source for every pid + # (IS DISTINCT FROM so NULL==NULL). A source-value drift is exactly what the + # coverage half of build_id is meant to catch — assert it directly too. + src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid + WHERE i.source IS DISTINCT FROM f.source""") + check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") + # schema_version must be EXACTLY the contract version (Codex #2: "one distinct + # value" let 999 through). The explorer refuses an index it can't interpret. + ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") + check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, + f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") + # build_id must be a SINGLE value of the EXACT shape ":__" + # — the membership half is the BARE bit_xor decimal (deployed node_bits contract), + # the coverage half is the three '_'-separated decimal trio. (Codex #2: "a colon + # somewhere" accepted ":bogus:extra".) + ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") + check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") + bid_re = r'^[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") + check("index: build_id is well-formed ':'", ix_bid_fmt == 0, + "build_id(s) not exactly :__") + # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is + # recomputed from facets_v2 (the located (pid, source) universe — equivalent + # to samp_geo) and, when membership is present, the membership half from it. + # This is the gate that actually catches coverage/source staleness at + # validation time — a membership-only id (or a hand-edited coverage half) + # FAILS here even though node_bits is unchanged. + if ix_bids == 1: + ix_bid = scalar(f"SELECT MIN(build_id) FROM {IX}") + exp_cov = _fp(F, COVERAGE_TOKEN_EXPR) + got_cov = ix_bid.split(":", 1)[1] if ":" in ix_bid else "" + check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, + f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") + if mem: + exp_mem = _xor_fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) # bare XOR (deployed contract) + got_mem = ix_bid.split(":", 1)[0] + check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, + f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)") + if nodebits and ix_bids == 1: + # the membership half of the index build_id MUST equal node_bits.build_id + # so the mask bits are interpreted under the SAME assignment (else counts + # are computed against a stale/foreign bit layout — silently wrong). + # (Codex r2 #2): assert node_bits has EXACTLY ONE non-NULL build_id FIRST — + # 0 or >1 distinct ids previously skipped the match entirely (fail-open). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}') WHERE build_id IS NOT NULL") + nb_null = scalar(f"SELECT COUNT(*) FROM read_parquet('{nodebits}') WHERE build_id IS NULL") + check("node_bits: exactly one non-NULL build_id", nb_bid == 1 and nb_null == 0, + f"{nb_bid} distinct non-NULL build_ids, {nb_null} NULLs (want 1 / 0)") + if nb_bid == 1 and nb_null == 0: + membership_half_matches = scalar( + f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " + f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") + check("index build_id membership-half == node_bits build_id", + bool(membership_half_matches), + "index masks would be read under a different bit assignment than node_bits") + # INDEPENDENT mask re-derivation (Codex #3): re-derive the expected masks for + # EVERY index pid directly from membership + node_bits (zero for pids with no + # membership) and symmetric-diff against the index. This validates the masks + # without trusting the optional sample_facet_masks file, AND covers the + # zero-mask rows that a masks-only diff never sees. + if mem and nodebits: + M = f"read_parquet('{mem}')" + NB = f"read_parquet('{nodebits}')" + ref = (f"WITH nb AS (SELECT facet_type, concept_uri, (1::BIGINT << bit_index) AS bitval FROM {NB}), " + f"mm AS (SELECT m.pid, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END),0)::BIGINT material_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END),0)::BIGINT context_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END),0)::BIGINT object_type_mask " + f"FROM {M} m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri GROUP BY m.pid) " + f"SELECT i.pid, COALESCE(mm.material_mask,0)::BIGINT material_mask, " + f"COALESCE(mm.context_mask,0)::BIGINT context_mask, " + f"COALESCE(mm.object_type_mask,0)::BIGINT object_type_mask " + f"FROM (SELECT pid FROM {IX}) i LEFT JOIN mm ON mm.pid=i.pid") + fil = f"SELECT pid, material_mask, context_mask, object_type_mask FROM {IX}" + mm_diff = scalar(f"SELECT (SELECT COUNT(*) FROM (({ref}) EXCEPT ({fil}))) " + f"+ (SELECT COUNT(*) FROM (({fil}) EXCEPT ({ref})))") + check("index masks == re-derived from membership+node_bits (all pids)", mm_diff == 0, + f"{mm_diff} index rows disagree with an independent re-derivation") + elif masks: + # Fallback when membership/node_bits aren't both available: diff shared + # pids against the masks file and assert no-membership extras are zero. + MK = f"read_parquet('{masks}')" + mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") + check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, + f"{mk_diff} shared pids with differing masks") + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) + AND (i.material_mask IS DISTINCT FROM 0 OR i.context_mask IS DISTINCT FROM 0 + OR i.object_type_mask IS DISTINCT FROM 0)""") + check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, + f"{extra_nonzero} index-only pids carry a non-zero/NULL mask (should be 0)") + print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True for name, passed, detail in R: diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 94e05bc..8ef109c 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -211,6 +211,24 @@ def test_duplicate_pid_hard_fails(tmp_path): f"builder should hard-fail on duplicate pids:\n{r.stdout}\n{r.stderr}" +def test_null_pid_hard_fails(tmp_path): + """Codex (#305 SQL gap): a SINGLE NULL pid slips the GROUP BY...HAVING COUNT>1 + dup check but breaks every pid join + the coverage fingerprint. Reject it.""" + wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + nullp = str(tmp_path / "wide_nullpid.parquet") + # append one located row with a NULL pid + con.execute(f"""COPY ( + SELECT * FROM read_parquet('{wide}') + UNION ALL + SELECT * REPLACE (NULL::VARCHAR AS pid) FROM read_parquet('{wide}') WHERE pid='m-real-first' + ) TO '{nullp}' (FORMAT PARQUET)""") + con.close() + r = _build(tmp_path, nullp) + assert r.returncode != 0 and "null_pids=1" in (r.stdout + r.stderr).lower(), \ + f"builder should hard-fail on a NULL pid:\n{r.stdout}\n{r.stderr}" + + def test_manifest_emitted(tmp_path): import json wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") @@ -519,6 +537,307 @@ def test_facet_masks_only_builds(tmp_path): assert (tmp_path / "t_sample_facet_masks.parquet").exists() +# --------------------------------------------------------------------------- +# sample_facet_index — COMPLETE per-pid index (#305/#306) +# Reuses the tree vocab but adds a LOCATED sample with NO tree membership +# (root-only material, null context/object) — the exact #306 class that +# sample_facet_masks silently omits. The index must include it, zero-masked. +# --------------------------------------------------------------------------- +# (pid, source, material_rids, context_rids, object_rids) +INDEX_SAMPLES = [ + ("s1", "A", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s2", "A", [3], [11], [21]), # rock, earthinterior, othersolidobject (full membership) + ("s3", "B", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s-nomem", "C", [1], None, None), # material ROOT-only -> NULL material -> NO membership; located +] + + +def build_index_fixture(wide, vocab): + """Tree fixture + a located no-membership sample (the #306 case).""" + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + ic_rows = " UNION ALL ".join( + f"SELECT 'IdentifiedConcept' AS otype, '{uri}' AS pid, {rid}::BIGINT AS row_id, NULL::VARCHAR AS n, " + f"'lbl' AS label, NULL::VARCHAR AS description, NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL AS geometry, NULL::BIGINT[] AS p__has_material_category, NULL::BIGINT[] AS p__has_context_category, " + f"NULL::BIGINT[] AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords" + for rid, uri in TREE_CONCEPTS) + msr = [] + for i, (pid, src, m, c, o) in enumerate(INDEX_SAMPLES): + msr.append( + f"SELECT 'MaterialSampleRecord' AS otype, '{pid}' AS pid, NULL::BIGINT AS row_id, '{src}' AS n, " + f"'label {pid}' AS label, 'desc {pid}' AS description, ['plc-{pid}']::VARCHAR[] AS place_name, " + f"NULL::TIMESTAMP AS result_time, ST_AsWKB(ST_Point({10.0+i},{40.0+i})) AS geometry, " + f"{_arr(m)} AS p__has_material_category, {_arr(c)} AS p__has_context_category, " + f"{_arr(o)} AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords") + con.execute(f"COPY ({ic_rows} UNION ALL {' UNION ALL '.join(msr)}) TO '{wide}' (FORMAT PARQUET)") + edges = " UNION ALL ".join( + f"SELECT '{u}' uri, {('NULL' if p is None else repr(p))}::VARCHAR broader, 'data_v1' uri_form" + for u, p in TREE_EDGES) + con.execute(f"COPY ({edges}) TO '{vocab}' (FORMAT PARQUET)") + con.close() + + +def _build_index(tmp_path, wide, vocab, tag="t"): + cmd = [sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", tag, + "--skip", "wide_h3", "--no-manifest", "--vocab-labels", vocab] + return subprocess.run(cmd, capture_output=True, text=True) + + +def test_sample_facet_index_complete_and_zero_masked(tmp_path): + """#305/#306: the index covers EVERY located pid (incl. the no-membership one, + which masks omits), carries source, and zero-masks no-membership pids while + staying bit-identical to masks for membership pids.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = _build_index(tmp_path, wide, vocab) + assert r.returncode == 0, f"index build failed:\n{r.stdout}\n{r.stderr}" + con = duckdb.connect() + IX = f"read_parquet('{tmp_path / 't_sample_facet_index.parquet'}')" + MK = f"read_parquet('{tmp_path / 't_sample_facet_masks.parquet'}')" + F = f"read_parquet('{tmp_path / 't_sample_facets_v2.parquet'}')" + + # COMPLETENESS: index has all 4 located pids; masks has only the 3 with membership + assert con.sql(f"SELECT COUNT(*) FROM {IX}").fetchone()[0] == 4 + assert con.sql(f"SELECT COUNT(*) FROM {MK}").fetchone()[0] == 3 + # index pid set == facets_v2 pid set (the located universe) + diff = con.sql(f"""SELECT (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""").fetchone()[0] + assert diff == 0, "index pid set != facets_v2 located universe" + # the no-membership located sample IS present and FULLY zero-masked (#306) + row = con.sql(f"SELECT material_mask, context_mask, object_type_mask, source FROM {IX} WHERE pid='s-nomem'").fetchone() + assert row is not None, "s-nomem (located, no membership) missing from index — the #306 bug" + assert row[:3] == (0, 0, 0), f"s-nomem should be zero-masked, got {row[:3]}" + assert row[3] == "C", f"s-nomem source should be 'C', got {row[3]!r}" + # masks bit-identical for shared pids + mk_diff = con.sql(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""").fetchone()[0] + assert mk_diff == 0, "index masks drift from sample_facet_masks for shared pids" + # build_id is structured ":" and its membership half == node_bits + bid = con.sql(f"SELECT DISTINCT build_id FROM {IX}").fetchall() + assert len(bid) == 1 and ":" in bid[0][0], f"index build_id not single/structured: {bid}" + nb_bid = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 't_facet_node_bits.parquet'}')").fetchone()[0] + assert bid[0][0].split(":", 1)[0] == nb_bid, "index build_id membership-half != node_bits build_id" + + +def test_sample_facet_index_coverage_id_changes_with_source(tmp_path): + """The coverage half of build_id must change when source/coverage changes even + though membership is identical (the silent-staleness hole the plan calls out).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab, tag="a").returncode == 0 + con = duckdb.connect() + bid_a = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'a_sample_facet_index.parquet'}')").fetchone()[0] + + # rebuild with a flipped source on s-nomem (same membership, different coverage) + wide2 = str(tmp_path / "wide2.parquet") + con.execute(f"""COPY ( + SELECT * REPLACE (CASE WHEN pid='s-nomem' THEN 'Z' ELSE n END AS n) FROM read_parquet('{wide}') + ) TO '{wide2}' (FORMAT PARQUET)""") + assert _build_index(tmp_path, wide2, vocab, tag="b").returncode == 0 + bid_b = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'b_sample_facet_index.parquet'}')").fetchone()[0] + + mem_a, cov_a = bid_a.split(":", 1) + mem_b, cov_b = bid_b.split(":", 1) + assert mem_a == mem_b, "membership half should be unchanged (membership identical)" + assert cov_a != cov_b, "coverage half should change when a source value changes" + + +def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # clean: validator (incl. the new index gates) must PASS + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", + "--min-rows", "1", "--wide", wide], capture_output=True, text=True) + assert v.returncode == 0, f"validator failed on clean index fixture:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + # corrupt 1: drop the no-membership pid (re-introduce the #306 bug) -> completeness gate fires + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT * FROM read_parquet('{ix}') WHERE pid <> 's-nomem') + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v2 = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v2.returncode != 0 and "index.pid == facets_v2.pid" in v2.stdout, \ + f"completeness gate failed to catch a dropped located pid:\n{v2.stdout}" + + +def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): + """A no-membership pid carrying a non-zero mask is a corruption the gate must + catch. With membership+node_bits present the independent re-derivation fires.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN 1::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "re-derived from membership" in v.stdout, \ + f"mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_null_mask_caught(tmp_path): + """Codex #2: a NULL mask escaped the old `<> 0` predicate (NULL=unknown). The + NOT-NULL gate must catch it.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN NULL::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "non-NULL" in v.stdout, \ + f"NOT-NULL gate failed to catch a NULL mask:\n{v.stdout}" + + +def test_sample_facet_index_bad_schema_version_caught(tmp_path): + """Codex #2: schema_version=999 passed the old 'one distinct value' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + build_id, 999::INTEGER AS schema_version FROM read_parquet('{ix}')) + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "schema_version ==" in v.stdout, \ + f"schema_version gate failed to catch 999:\n{v.stdout}" + + +def test_sample_facet_index_malformed_build_id_caught(tmp_path): + """Codex #2: ':bogus:extra' passed the old 'a colon somewhere' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # keep a correct membership half, append junk extra ':' segments + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':bogus:extra' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "well-formed" in v.stdout, \ + f"build_id format gate failed to catch a malformed id:\n{v.stdout}" + + +def test_sample_facet_index_stale_coverage_id_caught(tmp_path): + """Codex #1: a build_id whose coverage half doesn't match the actual (pid,source) + universe (e.g. membership-only id, or hand-edited coverage) must fail — even + though node_bits is unchanged.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # replace the coverage half with a structurally-valid but WRONG fingerprint + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':1_2_3' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "coverage-half" in v.stdout, \ + f"coverage staleness gate failed to catch a wrong coverage id:\n{v.stdout}" + + +def test_sample_facet_index_source_drift_caught(tmp_path): + """index.source must equal facets_v2.source; a drifted source must fail the gate.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, + CASE WHEN pid='s1' THEN 'WRONG' ELSE source END AS source, + material_mask, context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "index.source == facets_v2.source" in v.stdout, \ + f"source-equality gate failed to catch drift:\n{v.stdout}" + + +def test_sample_facet_index_validation_requires_siblings(tmp_path): + """Codex r2 #1: validating the index WITHOUT membership+node_bits must FAIL + (fail-closed), not silently run a partial pass. Reproduces the bypass where a + corrupt index + restamped node_bits passed once the siblings were removed.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the siblings the index validation depends on + os.remove(tmp_path / "t_sample_facet_membership.parquet") + os.remove(tmp_path / "t_facet_node_bits.parquet") + os.remove(tmp_path / "t_sample_facet_masks.parquet") + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "required to gate" in v.stdout, \ + f"index validation should fail-closed without membership+node_bits:\n{v.stdout}" + + +def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): + """Codex r2 #2: a node_bits file with !=1 non-NULL build_id previously skipped + the membership-half match entirely. It must now FAIL.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + nb = str(tmp_path / "t_facet_node_bits.parquet") + con = duckdb.connect(); tmp_nb = nb + ".tmp" + # give node_bits TWO distinct build_ids (one row restamped) + con.execute(f"""COPY (SELECT facet_type, concept_uri, bit_index, + CASE WHEN bit_index = (SELECT MIN(bit_index) FROM read_parquet('{nb}')) + THEN 'OTHER_GEN' ELSE build_id END AS build_id + FROM read_parquet('{nb}')) TO '{tmp_nb}' (FORMAT PARQUET)"""); os.replace(tmp_nb, nb) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "exactly one non-NULL build_id" in v.stdout, \ + f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" + + +def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): + """Codex #4/r3: --only sample_facet_index must NOT ship an orphan. It + force-emits its whole fast-path bundle — facet_node_bits (bit interpretation) + AND sample_facet_membership (the validator's independent oracle). A partial + rebuild into a directory holding a prior FULL build must then re-validate + cleanly (the real iterate-on-one-artifact workflow).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the bundle the index-only rebuild must RE-EMIT, so a stale full-build + # file can't satisfy the existence checks (Codex r4: prove re-emission, not + # leftover presence). + bundle = ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet") + for sib in bundle: + os.remove(tmp_path / sib) + # 2) rebuild ONLY the index into the same dir + r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", + "--no-manifest", "--vocab-labels", vocab, + "--only", "sample_facet_index"], capture_output=True, text=True) + assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" + # the bundle the index needs to be validatable was (re)emitted, not orphaned + for sib in bundle: + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not re-emit {sib}:\n{r.stdout}" + # 3) the rebuilt artifacts validate against the existing full-build siblings + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode == 0, f"--only sample_facet_index rebuild failed validation:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + def test_scheme_corruption_caught(tmp_path): wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") assert _build(tmp_path, wide).returncode == 0