From f155783dfc4b57efc41d310bdfce23aea4ec939e Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 21:51:40 -0700 Subject: [PATCH 01/14] docs: plan of record for #305 facet-count coherence Joint Claude+Codex plan for computing multi-filter facet counts from the #299 bitmask index (avoiding the 39M-row membership scan). Covers the masks histogram approach, native DuckDB benchmarks, source handling, the #306 missing-pid prerequisite, the honesty rule (never baseline under active filters), a 4-phase rollout, and risks. Refs #305, #304, #306, #276. Co-Authored-By: Claude Opus 4.8 (1M context) --- PLAN_305_facet_counts.md | 128 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 PLAN_305_facet_counts.md diff --git a/PLAN_305_facet_counts.md b/PLAN_305_facet_counts.md new file mode 100644 index 0000000..22f4c7a --- /dev/null +++ b/PLAN_305_facet_counts.md @@ -0,0 +1,128 @@ +# Plan — #305 Facet-count correctness & coherence (counts beyond the single-filter cube) + +**Status:** plan of record · **Authors:** Claude + Codex (joint, 2026-06-19) · **Tracks:** #305 (meta), #304 (live bug), #306 (data bug), #276 (semantics), #230, #286 + +## Problem + +At the global/zoomed-out view, facet **counts** only cross-filter for a **single** active +filter (served by the precomputed `facet_tree_cross_filter` cube via `effectiveSingleFilter()`, +`explorer.qmd:3374`). With **2+ active selections** the cube is skipped and tree-dim counts +revert to the **unfiltered baseline** (`explorer.qmd:3557-3562`) — deliberately, to avoid a +~39M-row `sample_facet_membership` GROUP BY that stalls DuckDB-WASM. So the **map is correctly +filtered** (#299 bitmask) but the **count numbers are wrong** for multi-filter at world view. +Zoomed-in is already correct (live membership path, `explorer.qmd:3564-3598`). Reported by +Eric in #304 (e.g. *Material: Anthropogenic* + *Specimen Type: Artifact*). + +## Core idea — count per node FROM the bitmasks (no membership scan) + +For a count dimension D over a qualifying pid set Q: + +```sql +SELECT nb.concept_uri AS value, COUNT(*) AS count +FROM s +JOIN read_parquet('${node_bits_url}') nb ON nb.facet_type = '' + AND (s._mask & (1::BIGINT << nb.bit_index)) <> 0 +WHERE +GROUP BY nb.concept_uri +``` + +- Masks are **one row per pid** → `COUNT(*)` == `COUNT(DISTINCT pid)` (do **not** add `COUNT(DISTINCT)` — wasted work). +- Membership encodes **ancestor closure** → child bit ⇒ parent bit, so **parent ≥ child** holds for free. +- The bitwise join is against `facet_node_bits` (≤63 rows/dim), not the 39M-row membership table. + +**Cross-filter semantics (confirmed correct):** OR among selected nodes *within* a dimension; +AND *across* dimensions; **omit the target dimension's own predicate** (exclude-self), matching +the UI contract (`explorer.qmd:3305`) and the existing cube (`build_frontend_derived.py:460`). +Special case: **zero selected sources** ⇒ impossible filter for tree targets, but the source +dimension's own histogram must still exclude source and stay constrained by the three trees. + +## Feasibility (Codex native DuckDB benchmarks, real 6M-pid data) + +| Query | Time | +|---|---| +| One unfiltered 22-node histogram | ~0.17s | +| All three tree dims via one 56-node join | ~0.69s | +| Eric's Material+Specimen-Type → context histogram | ~0.02s | +| Embedded source predicate/group | ~0.014s | +| `facets_v3` semi-join (rejected approach) | ~0.145s | + +Strongly supports feasibility. **Worst case ≈ 6M × 56 = ~336M cheap bit-tests**, not merely a +6M-row scan → **WASM cold+warm benchmarking remains a hard release gate** (the app has documented +DuckDB-WASM concurrency/deadlock sensitivity). + +## Source handling + +Add `source` as a plain **VARCHAR** column to a **new complete per-pid index** (source is +exclusive, not multi-valued → a mask is wrong; a `facets_v3` semi-join adds a second 6M-row +read + hash join). New schema: + +``` +pid · source · material_mask · context_mask · object_type_mask · build_id · schema_version +``` + +## Data-correctness prerequisite (#306) + +The current `sample_facet_masks` is built from `membership`, so it **omits ~29,917 located +samples** with coordinates but no tree membership (`facets_v3` = 6,026,242 vs masks = 5,996,325). +The new index must start from **`samp_geo`**, LEFT JOIN masks, and emit **zero masks** for +no-membership pids. Tracked as **#306** (must land in Phase 1). + +## Honesty rule (non-negotiable) + +**Never fall back to the unfiltered baseline when filters are active** — that is exactly the +#304 bug. `applyFacetCounts(..., null)` means baseline (`explorer.qmd:1381`). On any +miss/error under active filters, show **"—" / "unavailable" / stale styling / hidden counts** — +never authoritative-looking baseline numbers. Apply all dimension maps **atomically after one +stale-guard check** (`facetCountsReqId`); never partially repaint dimensions. + +## Phased plan + +### Phase 1 — Artifact + semantic foundation +- Build the complete per-pid index above. Start from `samp_geo`; include zero-mask pids (#306). +- Add `source` VARCHAR; extend the build **fingerprint** to cover the pid/source universe + (today it fingerprints membership only → would go stale silently once source/coverage change). +- AI-free validator: pid completeness vs `facets_v3`; source equality; node-bit coverage; + mask ≡ membership for pids that have membership; build_id consistency. +- **Publish under a new immutable filename/tag — do NOT overwrite the cached `202608` artifact.** +- Ratify the count contract (#276) as **membership / "anywhere in tree"** — already the shipped + semantics, so this is confirmation, not a redesign. + +### Phase 2 — Fix #304 at global view +- Generalized **direct-control filter snapshot** (today's `describeCrossFilters()` zeroes tree + selections at global view — the root cause). Generalize `effectiveSingleFilter()` to all selections. +- Route **multi-filter / full-tree-mode / global / no-search** counts through one mask-index + histogram query (the `tree_counts ∪ source_counts` shape). +- Keep the single-filter cube as a **pure optimization**, not a required semantic path. +- Apply dimension maps atomically after the stale check. On failure → unavailable, **not** baseline. +- **Benchmark cold + warm DuckDB-WASM** across empty / narrow / broad / all-pid qualifying sets; + set a concrete interaction budget before release. + +### Phase 3 — Extend (not premature unification) +- Add **viewport** support (identical padded bbox, `VIEWPORT_PAD_FACTOR = 0.3`). +- Add **search** via `search_pids`. +- Characterization-test the new path against the existing membership path; **then** retire the + membership count implementation. (Immediate unification would balloon the first fix into + viewport + search + flat-mode + stale-update refactoring — defer.) + +### Phase 4 — Verification + cleanup +- Unit-test predicate generation (exclude-self, zero-source states). +- Data validator gates from Phase 1, run in CI. +- Playwright: Eric's exact URL, multi-value-in-one-dim, source+tree, impossible combos, + global↔viewport transitions, superseded requests. +- Remove the misleading active-filter baseline fallback everywhere. +- Codex adversarial review; staging verify on real R2 data; deploy smoke gate. + +## Constraints / landmines +- **63-bit ceiling:** current dims are 19 / 22 / 15 nodes — ample headroom. Keep the build + hard-fail; design a future multiword schema but **do not implement now**. +- **Flat / mixed render mode:** keep the mask path gated OFF (subtree-membership semantics are + wrong for a flat dim) — defer to existing paths, as the cube already does. +- **Immutable data:** new tag/filename; never overwrite a cached artifact. + +## Top 3 risks +1. **WASM latency / connection starvation** — credible SQL, unmeasured in-browser; ~336M bit-tests worst case. +2. **Stale / incomplete index generations** — #306 missing pids + membership-only build_id; mixed cached generations → convincing-but-false counts. +3. **Honesty & state-transition failures** — baseline-as-truth fallback + async camera transitions unless results applied atomically. + +--- +*Joint plan: Claude (code trace, drafting) + Codex (native benchmarks, source design, #306 discovery, honesty rule, phasing). 2026-06-19.* From 473ec29eebce5e62e375ee6b99ddb2291908a5b7 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:08:09 -0700 Subject: [PATCH 02/14] #306/#305: complete per-pid facet index (sample_facet_index) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Build sample_facet_index: one row per LOCATED sample (samp_geo), not only those with tree membership. sample_facet_masks is built FROM membership and silently omits ~29,917 located samples with no tree concept (#306); the multi-filter global-view count path (#304/#305) must count over the whole located universe, so it scans this complete index instead. - start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the no-membership pids (correct: in no subtree, but still located + still counts toward source/total) - add `source` VARCHAR (exclusive, not multi-valued -> not a mask) - build_id = ":": membership half matches facet_node_bits (mask-bit interpretation gate); coverage half fingerprints the samp_geo (pid, source) universe so #306-class drift can't go stale silently (today's membership-only id is blind to it) - schema_version column for forward-compat Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2 (located universe completeness, #306), source==facets_v2, structured single build_id matching node_bits, masks bit-identical to sample_facet_masks for shared pids, and zero-mask for every no-membership pid. Tests: a located no-membership sample (root-only material) — present + zero masked in the index, absent from masks; coverage-id changes on source flip; all corruption gates bite. 29/29 pass. Docs: SERIALIZATIONS.md §4.12 ratifies the #276 count contract (membership / "anywhere in tree") served by this index. Refs #305 #306 #304 #276 --- SERIALIZATIONS.md | 37 ++++++ scripts/build_frontend_derived.py | 92 +++++++++++++- scripts/validate_frontend_derived.py | 86 +++++++++++++ tests/test_frontend_derived.py | 174 +++++++++++++++++++++++++++ 4 files changed, 386 insertions(+), 3 deletions(-) diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index b5a9be5..bc9e0d1 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -122,6 +122,7 @@ builder — a fresh build is NOT bit-for-bit identical to them (see | `isamples_202601_sample_facets_v2.parquet` | `(pid, source, material, context, object_type, label, description, place_name)` — all VARCHAR scalars; each facet column is a single URI per sample (not an array) | 63 MB | 6.0 M | wide | Search Explorer multi-dim facet filtering | QUERY_SPEC §3.3, §5.1 | | `isamples_202601_facet_summaries.parquet` | Baseline `(facet_type, facet_value, scheme, count)` | 2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 | | `isamples_202601_facet_cross_filter.parquet` | Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a | +| `_sample_facet_index.parquet` | Complete per-pid facet index `(pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version)` — **one row per located sample**, including samples with no tree membership (zero-masked, #306). Scanned by the multi-filter global-view count path (#304/#305). | ~60 MB | 6.0 M | wide (membership + samp_geo) | Interactive Explorer multi-filter facet counts | §4.12 below | ### Tier: vocabulary labels @@ -296,6 +297,42 @@ for the alias when you want "latest." - **Consumer**: `scripts/enrich_wide_with_oc_thumbnails.py` uses OC narrow to fill thumbnails into 202604 unified wide. Also used directly in PQG benchmark work. - **Future**: these become the prototype upstream for per-source sidecars (see §3, bottom row). +### 4.12 `_sample_facet_index.parquet` (complete per-pid facet index, #305/#306) + +- **Role**: The single artifact the **multi-filter** global-view count path scans + (#304/#305). Where `sample_facet_masks` is built FROM `membership` and therefore + silently omits located samples that carry no tree concept (~29,917 in the 202608 + generation — **#306**), this index starts from `samp_geo` (the authoritative + located set) and is **complete**: one row per located `pid`. Samples with no + membership are present and **zero-masked**. +- **Headline schema** (7 cols): `pid (VARCHAR), source (VARCHAR), material_mask + (BIGINT), context_mask (BIGINT), object_type_mask (BIGINT), build_id (VARCHAR), + schema_version (INTEGER)`. `source` is a plain VARCHAR (source is exclusive, not + multi-valued, so a mask would be wrong). Each `*_mask` bit `(1 << bit_index)` is + set iff the pid is a member of that node, under the bit assignment in + `facet_node_bits`; membership encodes ancestor closure, so a parent bit is set for + the whole subtree. +- **`build_id` is structured** as `":"`: + - the **membership half** equals `facet_node_bits.build_id` — the explorer must + only interpret the mask bits when the two agree (else the bits are read under a + foreign assignment); + - the **coverage half** is a fingerprint of the `(pid, source)` universe over + `samp_geo`, so a changed source value or located-pid set (the #306 drift class) + changes `build_id` instead of going stale silently. +- **Count contract (ratifies #276)**: a facet count is **membership / "anywhere in + the tree"** — a sample counts toward a node if that node is anywhere in the + sample's asserted-concept subtree closure (OR within a dimension, AND across + dimensions, **excluding the target dimension's own predicate**). This is the + already-shipped cube/UI semantics; this index does not change it, it serves the + same contract to the *multi-filter* case the single-filter cube cannot reach. +- **Validation**: `validate_frontend_derived.py --index ` asserts: schema, + one row per pid, `pid` set == `sample_facets_v2.pid` (the located universe), + `source` == facets_v2 source, single structured `build_id` whose membership half + matches `node_bits`, masks bit-identical to `sample_facet_masks` for shared pids, + and that every no-membership pid is zero-masked. +- **Immutability**: published under a **new** versioned filename — it is a new + artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. + ## 5. URL convention All substrate files live under `https://data.isamples.org/` — a diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index d92cff0..661a78e 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -19,6 +19,7 @@ - {tag}_facet_summaries.parquet facet_type, facet_value, scheme, count - {tag}_facet_cross_filter.parquet filter_source/material/context/object_type, facet_type, facet_value, count - {tag}_wide_h3.parquet wide + h3_res4/6/8 (large; built only on --only wide_h3) + - {tag}_sample_facet_index.parquet pid, source, material_mask, context_mask, object_type_mask(BIGINT), build_id, schema_version(INT) — COMPLETE per-pid index over ALL located samples (incl. #306 no-membership pids, zero-masked); the multi-filter global-view count path (#304/#305) scans this - {tag}_manifest.json provenance + per-output rowcount/schema/sha256 MATERIAL SELECTION (issue #265/#271): the broad SKOS root @@ -64,11 +65,16 @@ ARTIFACTS = ["sample_facets_v2", "samples_map_lite", "h3_summaries", "facet_summaries", "facet_cross_filter", "wide_h3", "sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"] + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"] # #293: max tree nodes per dim that fit in a signed BIGINT mask (bits 0..62). # Live max is 22 (context); guard so a future vocab explosion fails loudly # instead of silently overflowing a mask bit. MASK_MAX_BITS = 63 +# #305/#306: schema version for the complete per-pid facet index. Bump when the +# column set / semantics of sample_facet_index change so the explorer can refuse +# an index it doesn't understand instead of mis-reading it. +INDEX_SCHEMA_VERSION = 1 # Shared SQL expression for sample_facets_v2.description (#277 part 2). # Appends space-joined concept labels (IC labels across all 4 concept dims) @@ -508,6 +514,35 @@ def membership_build_id(con): FROM membership""").fetchone()[0] +def coverage_build_id(con): + # #305/#306: fingerprint of the COMPLETE per-pid universe the index covers — + # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree + # membership. membership_build_id() alone is blind to this: an index built from + # the SAME membership but a changed source value or a changed located-pid set + # (exactly the #306 class of drift — located pids with no membership) would + # carry an identical membership id and go stale SILENTLY. Order-independent XOR + # of per-row hashes; samp_geo.pid is unique (hard-checked in build_base_tables), + # so no XOR cancellation. COALESCE(source,'') so a NULL→'' source flip is caught. + return con.sql(""" + SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || COALESCE(source, ''))), 0) + AS VARCHAR) + FROM samp_geo""").fetchone()[0] + + +def index_build_id(con): + # #305/#306: the index's generation identity = ":". + # The two halves are deliberately separable so the explorer can BOTH: + # 1. gate the mask-bit interpretation by matching the membership half against + # facet_node_bits.build_id (the bit assignment is a pure function of + # membership — the index masks are only meaningful under the SAME node_bits + # generation), and + # 2. detect coverage/source drift via the coverage half (would otherwise be + # invisible to a membership-only id). + # ':' never appears in either half (both are decimal hash strings), so a single + # split is unambiguous. node_bits/masks keep their plain membership id. + return f"{membership_build_id(con)}:{coverage_build_id(con)}" + + def build_facet_node_bits(con, out, build_id): # #293: authoritative concept_uri -> bit_index assignment per tree dim. The # explorer loads this to turn a node selection into a bitmask and filter @@ -559,6 +594,51 @@ def build_sample_facet_masks(con, out, build_id): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +def build_sample_facet_index(con, out, build_id): + # #305/#306: the COMPLETE per-pid facet index — one row for EVERY located + # sample (samp_geo), not only those with tree membership. This is the artifact + # the multi-filter global-view count path (#304) scans: it must be able to count + # over the whole located universe, including samples that carry NO tree concept. + # + # WHY this exists separately from sample_facet_masks: masks is built FROM + # `membership`, so it silently OMITS located samples with no membership row + # (~29,917 in the 202608 generation — #306). Counting/ filtering off masks alone + # undercounts the located universe. Here we start from samp_geo (the authoritative + # located set), LEFT JOIN the same membership-derived masks, and emit a ZERO mask + # for no-membership pids (a zero mask matches no node bit → contributes 0 to every + # tree facet, which is correct: the sample is in no subtree — but it IS still a + # located sample and still counts toward `source` and the located total). + # + # `source` is a plain VARCHAR (source is exclusive, not multi-valued — a mask + # would be wrong, a facets_v3 semi-join would add a second 6M-row read). The + # mask columns are bit-identical to sample_facet_masks for pids that have + # membership (validated). bit assignment == build_facet_node_bits / masks. + con.execute(f"""COPY ( + WITH nb AS ( + SELECT facet_type, concept_uri, + (1::BIGINT << (ROW_NUMBER() OVER (PARTITION BY facet_type ORDER BY concept_uri) - 1)) AS bitval + FROM (SELECT DISTINCT facet_type, concept_uri FROM membership) + ), + masks AS ( + SELECT m.pid, + COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END), 0)::BIGINT AS material_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END), 0)::BIGINT AS context_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END), 0)::BIGINT AS object_type_mask + FROM membership m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri + GROUP BY m.pid + ) + SELECT sg.pid, + sg.source::VARCHAR AS source, + COALESCE(mk.material_mask, 0)::BIGINT AS material_mask, + COALESCE(mk.context_mask, 0)::BIGINT AS context_mask, + COALESCE(mk.object_type_mask, 0)::BIGINT AS object_type_mask, + '{build_id}' AS build_id, + {INDEX_SCHEMA_VERSION}::INTEGER AS schema_version + FROM samp_geo sg LEFT JOIN masks mk ON mk.pid = sg.pid + ORDER BY sg.pid + ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") + + def file_meta(con, path): n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0] schema = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchall()] @@ -623,7 +703,8 @@ def emit(name, fn): # Hierarchy artifacts (#281/#282) — need vocab_labels for the SKOS tree. HIER_ARTIFACTS = {"sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"} + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"} if any(want(a) for a in HIER_ARTIFACTS): if not args.vocab_labels: # Fail loud if the user EXPLICITLY asked for a hierarchy artifact @@ -641,10 +722,15 @@ def emit(name, fn): # #293 bitmask filter artifacts — needs membership (above). node_bits # and masks share a node-set build_id so the explorer only uses the mask # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks"): + if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): _bid = membership_build_id(con) emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) + # #305/#306: complete per-pid index (every located pid + source, + # zero-mask for no-membership pids). Its build_id embeds the SAME + # membership id as node_bits (mask-bit interpretation gate) PLUS a + # coverage id over samp_geo's (pid, source) universe (staleness gate). + emit("sample_facet_index", lambda o: build_sample_facet_index(con, o, index_build_id(con))) if not args.no_manifest: log("hashing inputs/outputs for manifest…", t0) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index debd8e4..17d3904 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -13,6 +13,8 @@ python scripts/validate_frontend_derived.py \ --facets URL --map-lite URL --summaries URL --cross-filter URL \ --h3 URL4 URL6 URL8 + # hierarchy + #305/#306 complete index (auto-discovered with --dir/--tag, or): + python scripts/validate_frontend_derived.py --dir DIR --tag TAG --index INDEX.parquet """ import argparse, hashlib, json, os, sys import duckdb @@ -55,6 +57,7 @@ def main(): ap.add_argument("--tree-cross-filter", help="facet_tree_cross_filter parquet (#290/#293); optional") ap.add_argument("--node-bits", help="facet_node_bits parquet (#293); optional") ap.add_argument("--masks", help="sample_facet_masks parquet (#293); optional") + ap.add_argument("--index", help="sample_facet_index parquet (#305/#306); optional") ap.add_argument("--wide", help="source wide parquet — enables the SEMANTIC gate " "(re-derive and diff the written files against a fresh build)") ap.add_argument("--min-rows", type=int, default=1_000_000, @@ -300,6 +303,7 @@ def _opt(name, attr): treexf = _opt("facet_tree_cross_filter", "tree_cross_filter") nodebits = _opt("facet_node_bits", "node_bits") masks = _opt("sample_facet_masks", "masks") + index = _opt("sample_facet_index", "index") if tree: T = f"read_parquet('{tree}')" # parent ≥ child for every edge, every dim (distinct-pid UNION semantics — @@ -477,6 +481,88 @@ def _opt(name, attr): + (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a))""") check(f"masks filter == membership ({dim})", d == 0, f"{d} pids differ for a real {dim} node") + # --- #305/#306 complete per-pid facet index (sample_facet_index) --- + # The multi-filter global-view count path scans THIS file. It must cover the + # ENTIRE located universe (facets_v2 = samp_geo), carry the correct source per + # pid, zero-mask the no-membership pids (#306), and be bit-identical to + # sample_facet_masks for pids that DO have membership — otherwise counts drift. + if index: + IX = f"read_parquet('{index}')" + # schema/types contract (the explorer reads these columns positionally-ish) + ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] + EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), + ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), + ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), + ("schema_version", "INTEGER")] + check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # one row per pid + ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") + check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") + # COMPLETENESS (#306): index pid SET == facets_v2 pid SET (the located universe). + # This is the whole point — masks omits no-membership located pids; the index + # must not. A symmetric set-diff catches both missing and extra pids. + ix_diff = scalar(f"""SELECT + (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""") + check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, + f"{ix_diff} pids differ between index and facets_v2") + # SOURCE equality: index.source must equal facets_v2.source for every pid + # (IS NOT DISTINCT FROM so NULL==NULL). A source-value drift is exactly what + # the coverage half of build_id is meant to catch — assert it directly too. + src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid + WHERE i.source IS DISTINCT FROM f.source""") + check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") + # single build_id, structured ":" + ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") + check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE build_id NOT LIKE '%:%'") + check("index: build_id is ':'", ix_bid_fmt == 0, + "build_id(s) missing the ':' coverage separator") + # single schema_version (sanity — the explorer keys behavior off it) + ix_sv = scalar(f"SELECT COUNT(DISTINCT schema_version) FROM {IX}") + check("index: single schema_version", ix_sv == 1, f"{ix_sv} distinct schema_versions") + if nodebits and ix_bids == 1: + # the membership half of the index build_id MUST equal node_bits.build_id + # so the mask bits are interpreted under the SAME assignment (else counts + # are computed against a stale/foreign bit layout — silently wrong). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}')") + if nb_bid == 1: + membership_half_matches = scalar( + f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " + f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") + check("index build_id membership-half == node_bits build_id", + bool(membership_half_matches), + "index masks would be read under a different bit assignment than node_bits") + if masks: + MK = f"read_parquet('{masks}')" + # For pids that ALSO appear in masks (i.e. have membership), the three + # mask columns must be byte-identical — the index is a superset, not a + # re-computation that could drift. + mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") + check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, + f"{mk_diff} shared pids with differing masks") + # The ONLY pids allowed to differ between the two files are the + # no-membership located pids (#306): present in index, absent from masks, + # and they MUST be zero-masked. Verify (a) the extra pids are exactly the + # zero-mask ones and (b) none are in masks. + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) + AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, + f"{extra_nonzero} index-only pids carry a non-zero mask (should be 0)") + elif mem: + # No masks file to diff against, but we can still assert the #306 invariant + # directly off membership: every index pid NOT in membership is zero-masked. + M = f"read_parquet('{mem}')" + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {M} m WHERE m.pid=i.pid) + AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + check("index: no-membership pids zero-masked (#306, vs membership)", extra_nonzero == 0, + f"{extra_nonzero} no-membership pids carry a non-zero mask") + print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True for name, passed, detail in R: diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 94e05bc..73b5701 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -519,6 +519,180 @@ def test_facet_masks_only_builds(tmp_path): assert (tmp_path / "t_sample_facet_masks.parquet").exists() +# --------------------------------------------------------------------------- +# sample_facet_index — COMPLETE per-pid index (#305/#306) +# Reuses the tree vocab but adds a LOCATED sample with NO tree membership +# (root-only material, null context/object) — the exact #306 class that +# sample_facet_masks silently omits. The index must include it, zero-masked. +# --------------------------------------------------------------------------- +# (pid, source, material_rids, context_rids, object_rids) +INDEX_SAMPLES = [ + ("s1", "A", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s2", "A", [3], [11], [21]), # rock, earthinterior, othersolidobject (full membership) + ("s3", "B", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s-nomem", "C", [1], None, None), # material ROOT-only -> NULL material -> NO membership; located +] + + +def build_index_fixture(wide, vocab): + """Tree fixture + a located no-membership sample (the #306 case).""" + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + ic_rows = " UNION ALL ".join( + f"SELECT 'IdentifiedConcept' AS otype, '{uri}' AS pid, {rid}::BIGINT AS row_id, NULL::VARCHAR AS n, " + f"'lbl' AS label, NULL::VARCHAR AS description, NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL AS geometry, NULL::BIGINT[] AS p__has_material_category, NULL::BIGINT[] AS p__has_context_category, " + f"NULL::BIGINT[] AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords" + for rid, uri in TREE_CONCEPTS) + msr = [] + for i, (pid, src, m, c, o) in enumerate(INDEX_SAMPLES): + msr.append( + f"SELECT 'MaterialSampleRecord' AS otype, '{pid}' AS pid, NULL::BIGINT AS row_id, '{src}' AS n, " + f"'label {pid}' AS label, 'desc {pid}' AS description, ['plc-{pid}']::VARCHAR[] AS place_name, " + f"NULL::TIMESTAMP AS result_time, ST_AsWKB(ST_Point({10.0+i},{40.0+i})) AS geometry, " + f"{_arr(m)} AS p__has_material_category, {_arr(c)} AS p__has_context_category, " + f"{_arr(o)} AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords") + con.execute(f"COPY ({ic_rows} UNION ALL {' UNION ALL '.join(msr)}) TO '{wide}' (FORMAT PARQUET)") + edges = " UNION ALL ".join( + f"SELECT '{u}' uri, {('NULL' if p is None else repr(p))}::VARCHAR broader, 'data_v1' uri_form" + for u, p in TREE_EDGES) + con.execute(f"COPY ({edges}) TO '{vocab}' (FORMAT PARQUET)") + con.close() + + +def _build_index(tmp_path, wide, vocab, tag="t"): + cmd = [sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", tag, + "--skip", "wide_h3", "--no-manifest", "--vocab-labels", vocab] + return subprocess.run(cmd, capture_output=True, text=True) + + +def test_sample_facet_index_complete_and_zero_masked(tmp_path): + """#305/#306: the index covers EVERY located pid (incl. the no-membership one, + which masks omits), carries source, and zero-masks no-membership pids while + staying bit-identical to masks for membership pids.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = _build_index(tmp_path, wide, vocab) + assert r.returncode == 0, f"index build failed:\n{r.stdout}\n{r.stderr}" + con = duckdb.connect() + IX = f"read_parquet('{tmp_path / 't_sample_facet_index.parquet'}')" + MK = f"read_parquet('{tmp_path / 't_sample_facet_masks.parquet'}')" + F = f"read_parquet('{tmp_path / 't_sample_facets_v2.parquet'}')" + + # COMPLETENESS: index has all 4 located pids; masks has only the 3 with membership + assert con.sql(f"SELECT COUNT(*) FROM {IX}").fetchone()[0] == 4 + assert con.sql(f"SELECT COUNT(*) FROM {MK}").fetchone()[0] == 3 + # index pid set == facets_v2 pid set (the located universe) + diff = con.sql(f"""SELECT (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""").fetchone()[0] + assert diff == 0, "index pid set != facets_v2 located universe" + # the no-membership located sample IS present and FULLY zero-masked (#306) + row = con.sql(f"SELECT material_mask, context_mask, object_type_mask, source FROM {IX} WHERE pid='s-nomem'").fetchone() + assert row is not None, "s-nomem (located, no membership) missing from index — the #306 bug" + assert row[:3] == (0, 0, 0), f"s-nomem should be zero-masked, got {row[:3]}" + assert row[3] == "C", f"s-nomem source should be 'C', got {row[3]!r}" + # masks bit-identical for shared pids + mk_diff = con.sql(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""").fetchone()[0] + assert mk_diff == 0, "index masks drift from sample_facet_masks for shared pids" + # build_id is structured ":" and its membership half == node_bits + bid = con.sql(f"SELECT DISTINCT build_id FROM {IX}").fetchall() + assert len(bid) == 1 and ":" in bid[0][0], f"index build_id not single/structured: {bid}" + nb_bid = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 't_facet_node_bits.parquet'}')").fetchone()[0] + assert bid[0][0].split(":", 1)[0] == nb_bid, "index build_id membership-half != node_bits build_id" + + +def test_sample_facet_index_coverage_id_changes_with_source(tmp_path): + """The coverage half of build_id must change when source/coverage changes even + though membership is identical (the silent-staleness hole the plan calls out).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab, tag="a").returncode == 0 + con = duckdb.connect() + bid_a = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'a_sample_facet_index.parquet'}')").fetchone()[0] + + # rebuild with a flipped source on s-nomem (same membership, different coverage) + wide2 = str(tmp_path / "wide2.parquet") + con.execute(f"""COPY ( + SELECT * REPLACE (CASE WHEN pid='s-nomem' THEN 'Z' ELSE n END AS n) FROM read_parquet('{wide}') + ) TO '{wide2}' (FORMAT PARQUET)""") + assert _build_index(tmp_path, wide2, vocab, tag="b").returncode == 0 + bid_b = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'b_sample_facet_index.parquet'}')").fetchone()[0] + + mem_a, cov_a = bid_a.split(":", 1) + mem_b, cov_b = bid_b.split(":", 1) + assert mem_a == mem_b, "membership half should be unchanged (membership identical)" + assert cov_a != cov_b, "coverage half should change when a source value changes" + + +def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # clean: validator (incl. the new index gates) must PASS + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", + "--min-rows", "1", "--wide", wide], capture_output=True, text=True) + assert v.returncode == 0, f"validator failed on clean index fixture:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + # corrupt 1: drop the no-membership pid (re-introduce the #306 bug) -> completeness gate fires + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT * FROM read_parquet('{ix}') WHERE pid <> 's-nomem') + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v2 = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v2.returncode != 0 and "index.pid == facets_v2.pid" in v2.stdout, \ + f"completeness gate failed to catch a dropped located pid:\n{v2.stdout}" + + +def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): + """A no-membership pid carrying a non-zero mask is a corruption the gate must catch.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN 1::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "zero-masked" in v.stdout, \ + f"zero-mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_source_drift_caught(tmp_path): + """index.source must equal facets_v2.source; a drifted source must fail the gate.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, + CASE WHEN pid='s1' THEN 'WRONG' ELSE source END AS source, + material_mask, context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "index.source == facets_v2.source" in v.stdout, \ + f"source-equality gate failed to catch drift:\n{v.stdout}" + + +def test_sample_facet_index_only_builds(tmp_path): + """--only sample_facet_index must produce the file (and its prereqs build).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", + "--no-manifest", "--vocab-labels", vocab, + "--only", "sample_facet_index"], capture_output=True, text=True) + assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" + assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ + f"--only sample_facet_index produced no file:\n{r.stdout}" + + def test_scheme_corruption_caught(tmp_path): wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") assert _build(tmp_path, wide).returncode == 0 From bf3da2a0293ece399874c6819ad89eae9654f904 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:19:10 -0700 Subject: [PATCH 03/14] #306/#305: harden index validator + fingerprint (Codex round 1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address Codex adversarial review of the sample_facet_index Phase-1 diff: - #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id halves from the written siblings (coverage from sample_facets_v2, membership from sample_facet_membership) and asserts equality — a stale or hand-edited coverage id now FAILS at build time even though node_bits is unchanged. Runtime explorer handshake documented as a Phase 2 item. - #2 validator metadata/NULL gates: schema_version must equal the exact contract version (999 was accepted); build_id must match the exact shape __:__ ("a colon somewhere" accepted :bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0). - #3 independent mask check: re-derive masks from membership + node_bits for EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the optional sample_facet_masks file. - #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is built, so --only sample_facet_index can't ship an uninterpretable mask file. - #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+ COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with a sentinel byte so NULL≠''; honest non-cryptographic caveat. - SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the dup check, broke pid joins, and was absent from the coverage hash). Tests: +5 adversarial (null mask, schema_version=999, malformed build_id, stale coverage id, null pid) — all gates bite. 34/34 pass. Refs #305 #306 --- SERIALIZATIONS.md | 21 +++-- scripts/build_frontend_derived.py | 81 ++++++++++++++----- scripts/validate_frontend_derived.py | 111 ++++++++++++++++++++------- tests/test_frontend_derived.py | 105 +++++++++++++++++++++++-- 4 files changed, 257 insertions(+), 61 deletions(-) diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index bc9e0d1..2488f09 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -325,11 +325,22 @@ for the alias when you want "latest." dimensions, **excluding the target dimension's own predicate**). This is the already-shipped cube/UI semantics; this index does not change it, it serves the same contract to the *multi-filter* case the single-filter cube cannot reach. -- **Validation**: `validate_frontend_derived.py --index ` asserts: schema, - one row per pid, `pid` set == `sample_facets_v2.pid` (the located universe), - `source` == facets_v2 source, single structured `build_id` whose membership half - matches `node_bits`, masks bit-identical to `sample_facet_masks` for shared pids, - and that every no-membership pid is zero-masked. +- **Validation**: `validate_frontend_derived.py --index ` is an INDEPENDENT + oracle — it asserts schema + NOT-NULL columns, one row per pid, `pid` set == + `sample_facets_v2.pid` (the located universe), `source` == facets_v2 source, + `schema_version` == the exact contract version, a well-formed single `build_id`, + and then **recomputes** both `build_id` halves from the written sibling files (the + coverage half from `sample_facets_v2`, the membership half from + `sample_facet_membership`) and asserts equality — so a stale/edited coverage id is + caught at build time even though `node_bits` is unchanged. Masks are re-derived + directly from `membership` + `node_bits` for **every** index pid (zero rows + included) and symmetric-diffed against the file. +- **Runtime staleness handshake (Phase 2)**: the *explorer* loads the index by URL + and so cannot, on its own, know the expected coverage half. The load-time refusal + of a stale-coverage index requires an external trusted pointer (e.g. the expected + composite id co-published in the manifest / a `current` alias the explorer reads); + that handshake lands with the count-query consumer in Phase 2. Build-time + validation (above) is the gate for now. - **Immutability**: published under a **new** versioned filename — it is a new artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index 661a78e..13a1d5f 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -208,12 +208,19 @@ def build_base_tables(con, wide, t0): n_geo = con.sql("SELECT COUNT(*) FROM samp_geo").fetchone()[0] n_dup = con.sql("SELECT COUNT(*) FROM (SELECT pid FROM samp_geo GROUP BY pid HAVING COUNT(*)>1)").fetchone()[0] n_icdup = con.sql("SELECT COUNT(*) FROM (SELECT row_id FROM ic GROUP BY row_id HAVING COUNT(*)>1)").fetchone()[0] - log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} duplicate_concept_row_ids={n_icdup:,}", t0) - if n_dup or n_icdup: - # HARD fail (Codex): non-unique keys make the output grain wrong (inflated + # #305/#306 (Codex): the GROUP BY...HAVING COUNT>1 dup check above collapses all + # NULL pids into ONE group, so a SINGLE NULL pid slips through — yet a NULL pid + # breaks every pid join (joins drop it) and is silently absent from the coverage + # fingerprint. Reject it explicitly. + n_null = con.sql("SELECT COUNT(*) FROM samp_geo WHERE pid IS NULL").fetchone()[0] + log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} " + f"null_pids={n_null:,} duplicate_concept_row_ids={n_icdup:,}", t0) + if n_dup or n_icdup or n_null: + # HARD fail (Codex): non-unique/NULL keys make the output grain wrong (inflated # facet counts, ambiguous joins, non-total ORDER BY pid). Abort before writing. raise SystemExit( - f"FATAL: non-unique keys — duplicate_pids={n_dup}, duplicate_concept_row_ids={n_icdup}. " + f"FATAL: non-unique/NULL keys — duplicate_pids={n_dup}, null_pids={n_null}, " + f"duplicate_concept_row_ids={n_icdup}. " f"Output grain/joins would be wrong; refusing to write.") @@ -499,6 +506,32 @@ def build_facet_tree_cross_filter(con, out): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +# --- generation fingerprints (#293, #305/#306) ----------------------------- +# The per-row token expressions are module constants so the VALIDATOR can +# recompute the exact same fingerprint independently from the written sibling +# files (membership, sample_facets_v2) and assert the index's build_id — i.e. +# the builder is not the sole authority on the index's generation identity. +# chr(31) (US) separates fields. A NULL source is encoded with a distinct +# sentinel byte so NULL and '' can never collide (Codex #5). +MEMBERSHIP_TOKEN_EXPR = "pid || chr(31) || facet_type || chr(31) || concept_uri" +COVERAGE_TOKEN_EXPR = ("pid || chr(31) || CASE WHEN source IS NULL THEN chr(0) ELSE chr(1) END " + "|| chr(31) || COALESCE(source, '')") + + +def _fingerprint(con, relation, token_expr): + # Order-independent generation fingerprint over `relation`. Combines THREE + # order-independent accumulators — XOR, SUM (HUGEINT, no overflow), and COUNT — + # of per-row hashes. XOR alone cancels identical rows and is vulnerable to a + # 2-row swap; SUM+COUNT defeat that. NOT a cryptographic digest: this defends + # against accidental drift / stale generations, not an adversary who engineers a + # multi-row hash collision. Grain is unique (validated), so the trio is stable. + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f" COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) " + f"FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def membership_build_id(con): # #293 (Codex P1, r2): a fingerprint of the FULL membership generation — not # just the node set. Both node_bits (positional bit assignment) and masks are @@ -506,12 +539,8 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). Order-independent XOR of per-row hashes (membership grain is - # unique per (pid,facet_type,concept_uri) — validated — so no XOR cancellation). - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || facet_type || chr(31) || concept_uri)), 0) - AS VARCHAR) - FROM membership""").fetchone()[0] + # masks file). membership grain is unique per (pid,facet_type,concept_uri). + return _fingerprint(con, "membership", MEMBERSHIP_TOKEN_EXPR) def coverage_build_id(con): @@ -519,14 +548,10 @@ def coverage_build_id(con): # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree # membership. membership_build_id() alone is blind to this: an index built from # the SAME membership but a changed source value or a changed located-pid set - # (exactly the #306 class of drift — located pids with no membership) would - # carry an identical membership id and go stale SILENTLY. Order-independent XOR - # of per-row hashes; samp_geo.pid is unique (hard-checked in build_base_tables), - # so no XOR cancellation. COALESCE(source,'') so a NULL→'' source flip is caught. - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || COALESCE(source, ''))), 0) - AS VARCHAR) - FROM samp_geo""").fetchone()[0] + # (exactly the #306 class of drift — located pids with no membership) would carry + # an identical membership id and go stale SILENTLY. NULL vs '' source are encoded + # distinctly (sentinel byte) so a NULL↔'' flip changes the fingerprint. + return _fingerprint(con, "samp_geo", COVERAGE_TOKEN_EXPR) def index_build_id(con): @@ -538,8 +563,9 @@ def index_build_id(con): # generation), and # 2. detect coverage/source drift via the coverage half (would otherwise be # invisible to a membership-only id). - # ':' never appears in either half (both are decimal hash strings), so a single - # split is unambiguous. node_bits/masks keep their plain membership id. + # Each half is "__" (decimal + underscore only); ':' never + # appears in either half, so split(':', 1) is unambiguous. node_bits/masks keep + # their plain membership id (== the membership half here, for the match gate). return f"{membership_build_id(con)}:{coverage_build_id(con)}" @@ -724,7 +750,20 @@ def emit(name, fn): # path when the two are from the same generation (Codex P1). if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): _bid = membership_build_id(con) - emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + # facet_node_bits is the bit-INTERPRETATION file: sample_facet_masks + # AND sample_facet_index store raw mask bits that are meaningless + # without it. So node_bits is FORCE-emitted whenever either consumer + # is built — even under `--only sample_facet_masks/_index` — so a + # build never ships an orphan mask file (Codex #4). The shared _bid + # lets the explorer gate the two on the same generation (Codex #293). + force_node_bits = want("sample_facet_masks") or want("sample_facet_index") + if want("facet_node_bits") or force_node_bits: + build_node = lambda o: build_facet_node_bits(con, o, _bid) + if want("facet_node_bits"): + emit("facet_node_bits", build_node) + else: + build_node(p("facet_node_bits")); produced.append(p("facet_node_bits")) + log("facet_node_bits ✓ (auto-paired with masks/index)", t0) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) # #305/#306: complete per-pid index (every located pid + source, # zero-mask for no-membership pids). Its build_id embeds the SAME diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index 17d3904..84a4294 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -487,14 +487,35 @@ def _opt(name, attr): # pid, zero-mask the no-membership pids (#306), and be bit-identical to # sample_facet_masks for pids that DO have membership — otherwise counts drift. if index: + # Import the builder's canonical schema version + fingerprint token exprs so + # the validator is an INDEPENDENT oracle (Codex #1/#5): it recomputes the + # expected build_id from the written sibling files and asserts equality, + # rather than trusting whatever the builder stamped. + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, + COVERAGE_TOKEN_EXPR) + + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + IX = f"read_parquet('{index}')" - # schema/types contract (the explorer reads these columns positionally-ish) + # schema/types contract (the explorer reads these columns by name) ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), ("schema_version", "INTEGER")] check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # NOT-NULL contract (source MAY be NULL — it mirrors facets_v2; everything + # else must be present). A NULL mask is a real corruption (Codex #2): the + # zero-mask check below uses IS DISTINCT FROM so it also rejects NULL. + nn_bad = scalar(f"""SELECT COUNT(*) FROM {IX} WHERE pid IS NULL + OR material_mask IS NULL OR context_mask IS NULL OR object_type_mask IS NULL + OR build_id IS NULL OR schema_version IS NULL""") + check("index: required columns non-NULL", nn_bad == 0, f"{nn_bad} rows with NULL pid/mask/build_id/version") # one row per pid ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") @@ -507,20 +528,42 @@ def _opt(name, attr): check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, f"{ix_diff} pids differ between index and facets_v2") # SOURCE equality: index.source must equal facets_v2.source for every pid - # (IS NOT DISTINCT FROM so NULL==NULL). A source-value drift is exactly what - # the coverage half of build_id is meant to catch — assert it directly too. + # (IS DISTINCT FROM so NULL==NULL). A source-value drift is exactly what the + # coverage half of build_id is meant to catch — assert it directly too. src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid WHERE i.source IS DISTINCT FROM f.source""") check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") - # single build_id, structured ":" + # schema_version must be EXACTLY the contract version (Codex #2: "one distinct + # value" let 999 through). The explorer refuses an index it can't interpret. + ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") + check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, + f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") + # build_id must be a SINGLE value of the EXACT shape "__:__" + # — exactly two ':'-separated halves, each exactly three '_'-separated decimals. + # (Codex #2: "a colon somewhere" accepted ":bogus:extra".) ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") - ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE build_id NOT LIKE '%:%'") - check("index: build_id is ':'", ix_bid_fmt == 0, - "build_id(s) missing the ':' coverage separator") - # single schema_version (sanity — the explorer keys behavior off it) - ix_sv = scalar(f"SELECT COUNT(DISTINCT schema_version) FROM {IX}") - check("index: single schema_version", ix_sv == 1, f"{ix_sv} distinct schema_versions") + bid_re = r'^[0-9]+_[0-9]+_[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") + check("index: build_id is well-formed ':'", ix_bid_fmt == 0, + "build_id(s) not exactly __:__") + # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is + # recomputed from facets_v2 (the located (pid, source) universe — equivalent + # to samp_geo) and, when membership is present, the membership half from it. + # This is the gate that actually catches coverage/source staleness at + # validation time — a membership-only id (or a hand-edited coverage half) + # FAILS here even though node_bits is unchanged. + if ix_bids == 1: + ix_bid = scalar(f"SELECT MIN(build_id) FROM {IX}") + exp_cov = _fp(F, COVERAGE_TOKEN_EXPR) + got_cov = ix_bid.split(":", 1)[1] if ":" in ix_bid else "" + check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, + f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") + if mem: + exp_mem = _fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) + got_mem = ix_bid.split(":", 1)[0] + check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, + f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)") if nodebits and ix_bids == 1: # the membership half of the index build_id MUST equal node_bits.build_id # so the mask bits are interpreted under the SAME assignment (else counts @@ -533,35 +576,45 @@ def _opt(name, attr): check("index build_id membership-half == node_bits build_id", bool(membership_half_matches), "index masks would be read under a different bit assignment than node_bits") - if masks: + # INDEPENDENT mask re-derivation (Codex #3): re-derive the expected masks for + # EVERY index pid directly from membership + node_bits (zero for pids with no + # membership) and symmetric-diff against the index. This validates the masks + # without trusting the optional sample_facet_masks file, AND covers the + # zero-mask rows that a masks-only diff never sees. + if mem and nodebits: + M = f"read_parquet('{mem}')" + NB = f"read_parquet('{nodebits}')" + ref = (f"WITH nb AS (SELECT facet_type, concept_uri, (1::BIGINT << bit_index) AS bitval FROM {NB}), " + f"mm AS (SELECT m.pid, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END),0)::BIGINT material_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END),0)::BIGINT context_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END),0)::BIGINT object_type_mask " + f"FROM {M} m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri GROUP BY m.pid) " + f"SELECT i.pid, COALESCE(mm.material_mask,0)::BIGINT material_mask, " + f"COALESCE(mm.context_mask,0)::BIGINT context_mask, " + f"COALESCE(mm.object_type_mask,0)::BIGINT object_type_mask " + f"FROM (SELECT pid FROM {IX}) i LEFT JOIN mm ON mm.pid=i.pid") + fil = f"SELECT pid, material_mask, context_mask, object_type_mask FROM {IX}" + mm_diff = scalar(f"SELECT (SELECT COUNT(*) FROM (({ref}) EXCEPT ({fil}))) " + f"+ (SELECT COUNT(*) FROM (({fil}) EXCEPT ({ref})))") + check("index masks == re-derived from membership+node_bits (all pids)", mm_diff == 0, + f"{mm_diff} index rows disagree with an independent re-derivation") + elif masks: + # Fallback when membership/node_bits aren't both available: diff shared + # pids against the masks file and assert no-membership extras are zero. MK = f"read_parquet('{masks}')" - # For pids that ALSO appear in masks (i.e. have membership), the three - # mask columns must be byte-identical — the index is a superset, not a - # re-computation that could drift. mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid WHERE i.material_mask IS DISTINCT FROM m.material_mask OR i.context_mask IS DISTINCT FROM m.context_mask OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, f"{mk_diff} shared pids with differing masks") - # The ONLY pids allowed to differ between the two files are the - # no-membership located pids (#306): present in index, absent from masks, - # and they MUST be zero-masked. Verify (a) the extra pids are exactly the - # zero-mask ones and (b) none are in masks. extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) - AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + AND (i.material_mask IS DISTINCT FROM 0 OR i.context_mask IS DISTINCT FROM 0 + OR i.object_type_mask IS DISTINCT FROM 0)""") check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, - f"{extra_nonzero} index-only pids carry a non-zero mask (should be 0)") - elif mem: - # No masks file to diff against, but we can still assert the #306 invariant - # directly off membership: every index pid NOT in membership is zero-masked. - M = f"read_parquet('{mem}')" - extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i - WHERE NOT EXISTS (SELECT 1 FROM {M} m WHERE m.pid=i.pid) - AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") - check("index: no-membership pids zero-masked (#306, vs membership)", extra_nonzero == 0, - f"{extra_nonzero} no-membership pids carry a non-zero mask") + f"{extra_nonzero} index-only pids carry a non-zero/NULL mask (should be 0)") print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 73b5701..a6140aa 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -211,6 +211,24 @@ def test_duplicate_pid_hard_fails(tmp_path): f"builder should hard-fail on duplicate pids:\n{r.stdout}\n{r.stderr}" +def test_null_pid_hard_fails(tmp_path): + """Codex (#305 SQL gap): a SINGLE NULL pid slips the GROUP BY...HAVING COUNT>1 + dup check but breaks every pid join + the coverage fingerprint. Reject it.""" + wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + nullp = str(tmp_path / "wide_nullpid.parquet") + # append one located row with a NULL pid + con.execute(f"""COPY ( + SELECT * FROM read_parquet('{wide}') + UNION ALL + SELECT * REPLACE (NULL::VARCHAR AS pid) FROM read_parquet('{wide}') WHERE pid='m-real-first' + ) TO '{nullp}' (FORMAT PARQUET)""") + con.close() + r = _build(tmp_path, nullp) + assert r.returncode != 0 and "null_pids=1" in (r.stdout + r.stderr).lower(), \ + f"builder should hard-fail on a NULL pid:\n{r.stdout}\n{r.stderr}" + + def test_manifest_emitted(tmp_path): import json wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") @@ -648,7 +666,8 @@ def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): - """A no-membership pid carrying a non-zero mask is a corruption the gate must catch.""" + """A no-membership pid carrying a non-zero mask is a corruption the gate must + catch. With membership+node_bits present the independent re-derivation fires.""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) assert _build_index(tmp_path, wide, vocab).returncode == 0 @@ -660,8 +679,78 @@ def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], capture_output=True, text=True) - assert v.returncode != 0 and "zero-masked" in v.stdout, \ - f"zero-mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + assert v.returncode != 0 and "re-derived from membership" in v.stdout, \ + f"mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_null_mask_caught(tmp_path): + """Codex #2: a NULL mask escaped the old `<> 0` predicate (NULL=unknown). The + NOT-NULL gate must catch it.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN NULL::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "non-NULL" in v.stdout, \ + f"NOT-NULL gate failed to catch a NULL mask:\n{v.stdout}" + + +def test_sample_facet_index_bad_schema_version_caught(tmp_path): + """Codex #2: schema_version=999 passed the old 'one distinct value' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + build_id, 999::INTEGER AS schema_version FROM read_parquet('{ix}')) + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "schema_version ==" in v.stdout, \ + f"schema_version gate failed to catch 999:\n{v.stdout}" + + +def test_sample_facet_index_malformed_build_id_caught(tmp_path): + """Codex #2: ':bogus:extra' passed the old 'a colon somewhere' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # keep a correct membership half, append junk extra ':' segments + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':bogus:extra' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "well-formed" in v.stdout, \ + f"build_id format gate failed to catch a malformed id:\n{v.stdout}" + + +def test_sample_facet_index_stale_coverage_id_caught(tmp_path): + """Codex #1: a build_id whose coverage half doesn't match the actual (pid,source) + universe (e.g. membership-only id, or hand-edited coverage) must fail — even + though node_bits is unchanged.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # replace the coverage half with a structurally-valid but WRONG fingerprint + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':1_2_3' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "coverage-half" in v.stdout, \ + f"coverage staleness gate failed to catch a wrong coverage id:\n{v.stdout}" def test_sample_facet_index_source_drift_caught(tmp_path): @@ -681,8 +770,10 @@ def test_sample_facet_index_source_drift_caught(tmp_path): f"source-equality gate failed to catch drift:\n{v.stdout}" -def test_sample_facet_index_only_builds(tmp_path): - """--only sample_facet_index must produce the file (and its prereqs build).""" +def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): + """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits + (the bit-interpretation file) is force-emitted alongside it. The resulting pair + must pass the validator (which needs node_bits to re-derive the masks).""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", @@ -690,7 +781,9 @@ def test_sample_facet_index_only_builds(tmp_path): "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ - f"--only sample_facet_index produced no file:\n{r.stdout}" + f"--only sample_facet_index produced no index file:\n{r.stdout}" + assert (tmp_path / "t_facet_node_bits.parquet").exists(), \ + f"--only sample_facet_index did not auto-pair facet_node_bits (orphan):\n{r.stdout}" def test_scheme_corruption_caught(tmp_path): From a308fe2288d678f50aac2012227e3c49a651e26f Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:24:32 -0700 Subject: [PATCH 04/14] #306/#305: close index validator fail-open gaps (Codex round 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - r2 #1: index validation now FAILS CLOSED when sample_facet_membership or facet_node_bits is absent — those siblings are what the build_id recompute and mask re-derivation depend on, so a missing sibling must gate, not run a silent partial pass (Codex reproduced a corrupt index + restamped node_bits passing once the siblings were removed). They're always co-produced, so --dir/--tag satisfies it. - r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the membership-half match — 0 or >1 distinct ids previously skipped the match entirely (fail-open). Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly; coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected; node_bits force-emit has no double-build and masks/index can't ship without node_bits (even with --skip). Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass. Refs #305 #306 --- scripts/validate_frontend_derived.py | 19 +++++++++++++-- tests/test_frontend_derived.py | 36 ++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 2 deletions(-) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index 84a4294..d4aac72 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -502,6 +502,16 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint return f"{x}_{s}_{n}" IX = f"read_parquet('{index}')" + # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated + # without BOTH sample_facet_membership and facet_node_bits — they are what + # the independent build_id recompute and mask re-derivation depend on. If + # either is absent we must FAIL rather than silently run a partial (fail-open) + # validation that would pass a corrupt index/node_bits pair. They are always + # co-produced by a build, so --dir/--tag (or passing both) satisfies this. + check("index validation has membership + node_bits (required to gate)", + bool(mem and nodebits), + "pass --membership and --node-bits (or --dir/--tag with both present); " + "the index can't be validated in isolation") # schema/types contract (the explorer reads these columns by name) ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), @@ -568,8 +578,13 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint # the membership half of the index build_id MUST equal node_bits.build_id # so the mask bits are interpreted under the SAME assignment (else counts # are computed against a stale/foreign bit layout — silently wrong). - nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}')") - if nb_bid == 1: + # (Codex r2 #2): assert node_bits has EXACTLY ONE non-NULL build_id FIRST — + # 0 or >1 distinct ids previously skipped the match entirely (fail-open). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}') WHERE build_id IS NOT NULL") + nb_null = scalar(f"SELECT COUNT(*) FROM read_parquet('{nodebits}') WHERE build_id IS NULL") + check("node_bits: exactly one non-NULL build_id", nb_bid == 1 and nb_null == 0, + f"{nb_bid} distinct non-NULL build_ids, {nb_null} NULLs (want 1 / 0)") + if nb_bid == 1 and nb_null == 0: membership_half_matches = scalar( f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index a6140aa..bcf0152 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -770,6 +770,42 @@ def test_sample_facet_index_source_drift_caught(tmp_path): f"source-equality gate failed to catch drift:\n{v.stdout}" +def test_sample_facet_index_validation_requires_siblings(tmp_path): + """Codex r2 #1: validating the index WITHOUT membership+node_bits must FAIL + (fail-closed), not silently run a partial pass. Reproduces the bypass where a + corrupt index + restamped node_bits passed once the siblings were removed.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the siblings the index validation depends on + os.remove(tmp_path / "t_sample_facet_membership.parquet") + os.remove(tmp_path / "t_facet_node_bits.parquet") + os.remove(tmp_path / "t_sample_facet_masks.parquet") + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "required to gate" in v.stdout, \ + f"index validation should fail-closed without membership+node_bits:\n{v.stdout}" + + +def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): + """Codex r2 #2: a node_bits file with !=1 non-NULL build_id previously skipped + the membership-half match entirely. It must now FAIL.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + nb = str(tmp_path / "t_facet_node_bits.parquet") + con = duckdb.connect(); tmp_nb = nb + ".tmp" + # give node_bits TWO distinct build_ids (one row restamped) + con.execute(f"""COPY (SELECT facet_type, concept_uri, bit_index, + CASE WHEN bit_index = (SELECT MIN(bit_index) FROM read_parquet('{nb}')) + THEN 'OTHER_GEN' ELSE build_id END AS build_id + FROM read_parquet('{nb}')) TO '{tmp_nb}' (FORMAT PARQUET)"""); os.replace(tmp_nb, nb) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "exactly one non-NULL build_id" in v.stdout, \ + f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" + + def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits (the bit-interpretation file) is force-emitted alongside it. The resulting pair From 23cefacb44916470204b258e07e974e19f2bbbb2 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:29:47 -0700 Subject: [PATCH 05/14] #306/#305: force-emit full fast-path bundle with index (Codex round 3) Codex r3 caught a regression my own fixes introduced: --only sample_facet_index emitted index + node_bits but NOT sample_facet_membership, so the new validator gate (which requires membership as its independent oracle) would reject that output. Index validation legitimately also needs sample_facets_v2 (the #306 completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against an existing full build, by design. - force_dep(): whenever masks or index is built, force-emit BOTH sample_facet_membership and facet_node_bits (the bundle the index validator depends on), exactly once, recorded for the manifest. - test now proves the real workflow: full build -> --only sample_facet_index rebuild into the same dir -> re-validate PASSES (not just file-existence). Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every bypass) with no new skip path. 36/36 pass. Refs #305 #306 --- scripts/build_frontend_derived.py | 47 ++++++++++++++++++------------- tests/test_frontend_derived.py | 26 +++++++++++------ 2 files changed, 46 insertions(+), 27 deletions(-) diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index 13a1d5f..bb1ac3c 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -741,29 +741,38 @@ def emit(name, fn): log("SKIP hierarchy artifacts: pass --vocab-labels ", t0) else: build_concept_membership(con, args.wide, args.vocab_labels, t0) - emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + # The mask fast-path bundle = {membership, node_bits, masks?, index}. + # sample_facet_masks AND sample_facet_index store raw mask bits that are + # uninterpretable without facet_node_bits, and that the validator can only + # gate by re-deriving from sample_facet_membership. So whenever masks or + # index is requested, FORCE-emit membership + node_bits too (even under + # `--only sample_facet_index`) — otherwise the build ships an artifact its + # own validator must reject (Codex #4 / r3). force_dep() builds a not-wanted + # artifact exactly once and records it for the manifest. + need_fastpath = want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index") + force_deps = want("sample_facet_masks") or want("sample_facet_index") + def force_dep(name, fn): + if want(name): + emit(name, fn) + elif p(name) not in produced: + fn(p(name)); produced.append(p(name)); log(f"{name} ✓ (auto-paired with masks/index)", t0) + + if force_deps: + force_dep("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + else: + emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) emit("facet_tree_summaries", lambda o: build_facet_tree_summaries(con, o)) # #290/#293 cross-filter cube — needs membership (above) + samp_geo (source). emit("facet_tree_cross_filter", lambda o: build_facet_tree_cross_filter(con, o)) - # #293 bitmask filter artifacts — needs membership (above). node_bits - # and masks share a node-set build_id so the explorer only uses the mask - # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): + # #293 bitmask filter artifacts — needs membership (above). node_bits and + # masks share a node-set build_id so the explorer only uses the mask path + # when the two are from the same generation (Codex P1). + if need_fastpath: _bid = membership_build_id(con) - # facet_node_bits is the bit-INTERPRETATION file: sample_facet_masks - # AND sample_facet_index store raw mask bits that are meaningless - # without it. So node_bits is FORCE-emitted whenever either consumer - # is built — even under `--only sample_facet_masks/_index` — so a - # build never ships an orphan mask file (Codex #4). The shared _bid - # lets the explorer gate the two on the same generation (Codex #293). - force_node_bits = want("sample_facet_masks") or want("sample_facet_index") - if want("facet_node_bits") or force_node_bits: - build_node = lambda o: build_facet_node_bits(con, o, _bid) - if want("facet_node_bits"): - emit("facet_node_bits", build_node) - else: - build_node(p("facet_node_bits")); produced.append(p("facet_node_bits")) - log("facet_node_bits ✓ (auto-paired with masks/index)", t0) + if force_deps: + force_dep("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + else: + emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) # #305/#306: complete per-pid index (every located pid + source, # zero-mask for no-membership pids). Its build_id embeds the SAME diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index bcf0152..42ac846 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -806,20 +806,30 @@ def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" -def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): - """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits - (the bit-interpretation file) is force-emitted alongside it. The resulting pair - must pass the validator (which needs node_bits to re-derive the masks).""" +def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): + """Codex #4/r3: --only sample_facet_index must NOT ship an orphan. It + force-emits its whole fast-path bundle — facet_node_bits (bit interpretation) + AND sample_facet_membership (the validator's independent oracle). A partial + rebuild into a directory holding a prior FULL build must then re-validate + cleanly (the real iterate-on-one-artifact workflow).""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) + # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # 2) rebuild ONLY the index into the same dir r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", "--no-manifest", "--vocab-labels", vocab, "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" - assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ - f"--only sample_facet_index produced no index file:\n{r.stdout}" - assert (tmp_path / "t_facet_node_bits.parquet").exists(), \ - f"--only sample_facet_index did not auto-pair facet_node_bits (orphan):\n{r.stdout}" + # the bundle the index needs to be validatable was (re)emitted, not orphaned + for sib in ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet"): + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not emit {sib}:\n{r.stdout}" + # 3) the rebuilt artifacts validate against the existing full-build siblings + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode == 0, f"--only sample_facet_index rebuild failed validation:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout def test_scheme_corruption_caught(tmp_path): From a20a94f0f691e221811ee141179b081745635951 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:32:57 -0700 Subject: [PATCH 06/14] #306/#305: prove index-only rebuild RE-EMITS its bundle (Codex round 4) Delete index+node_bits+membership after the full build and before the --only sample_facet_index rebuild, so a stale full-build file can't satisfy the existence checks. Retains the re-emission + validator-pass assertions. Codex r4 confirmed the implementation correct: index-only and masks-only emit membership+node_bits exactly once, no duplicate emissions in a full build, partial-rebuild validation passes, manifest records each forced dep once, and no invocation produces masks/index without node_bits+membership. 36/36 pass. Refs #305 #306 --- tests/test_frontend_derived.py | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 42ac846..8ef109c 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -816,15 +816,21 @@ def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): build_index_fixture(wide, vocab) # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the bundle the index-only rebuild must RE-EMIT, so a stale full-build + # file can't satisfy the existence checks (Codex r4: prove re-emission, not + # leftover presence). + bundle = ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet") + for sib in bundle: + os.remove(tmp_path / sib) # 2) rebuild ONLY the index into the same dir r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", "--no-manifest", "--vocab-labels", vocab, "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" # the bundle the index needs to be validatable was (re)emitted, not orphaned - for sib in ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", - "t_sample_facet_membership.parquet"): - assert (tmp_path / sib).exists(), f"--only sample_facet_index did not emit {sib}:\n{r.stdout}" + for sib in bundle: + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not re-emit {sib}:\n{r.stdout}" # 3) the rebuilt artifacts validate against the existing full-build siblings v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], capture_output=True, text=True) From 39dc101f20c8312d80d17cae656ba9cf0e0634d0 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 06:10:17 -0700 Subject: [PATCH 07/14] #305/#306: keep membership_build_id format-compatible with deployed node_bits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio — but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's facetIndexReady preflight matches the index's membership-half against the deployed node_bits.build_id. The trio format would never match, leaving multi-filter counts permanently "unavailable" against already-published artifacts. Revert membership_build_id to the bare order-independent XOR (grain is unique → no cancellation, as the original argued). Keep the richer trio ONLY for coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore ":__". Validator recomputes the membership half with bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape. Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id, so the index can be published standalone (new filename, non-destructive) without republishing node_bits/masks/membership. 36/36 pass. Refs #305 #306 --- scripts/build_frontend_derived.py | 22 +++++++++++++++++----- scripts/validate_frontend_derived.py | 19 ++++++++++++------- 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index bb1ac3c..dd96baf 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -539,8 +539,19 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). membership grain is unique per (pid,facet_type,concept_uri). - return _fingerprint(con, "membership", MEMBERSHIP_TOKEN_EXPR) + # masks file). + # + # FORMAT IS A DEPLOYED CONTRACT (#305): the live facet_node_bits / sample_facet_masks + # (202608) carry this id as a BARE bit_xor decimal, and the explorer's facetIndexReady + # preflight matches the index's membership-half against the DEPLOYED node_bits.build_id. + # So this MUST stay the plain order-independent XOR — do NOT switch it to the + # _fingerprint trio (that would change the string and break generation-matching + # against every already-published artifact). membership grain is unique per + # (pid,facet_type,concept_uri) — validated — so no XOR cancellation. The richer + # trio is reserved for coverage_build_id, which is a NEW id with no compat constraint. + return con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({MEMBERSHIP_TOKEN_EXPR})), 0) AS VARCHAR) FROM membership" + ).fetchone()[0] def coverage_build_id(con): @@ -563,9 +574,10 @@ def index_build_id(con): # generation), and # 2. detect coverage/source drift via the coverage half (would otherwise be # invisible to a membership-only id). - # Each half is "__" (decimal + underscore only); ':' never - # appears in either half, so split(':', 1) is unambiguous. node_bits/masks keep - # their plain membership id (== the membership half here, for the match gate). + # FORMATS DIFFER BY DESIGN: membership half is the BARE bit_xor decimal (the + # deployed node_bits/masks contract — see membership_build_id), coverage half is + # the richer "__" trio (a new id). ':' appears in neither half + # (decimals + underscores only), so split(':', 1) is unambiguous. return f"{membership_build_id(con)}:{coverage_build_id(con)}" diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index d4aac72..754c9d0 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -495,12 +495,16 @@ def _opt(name, attr): from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, COVERAGE_TOKEN_EXPR) - def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint (coverage trio) x, s, n = con.sql( f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() return f"{x}_{s}_{n}" + def _xor_fp(relation, token_expr): # mirror membership_build_id (bare XOR — deployed contract) + return str(con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({token_expr})), 0) AS VARCHAR) FROM {relation}").fetchone()[0]) + IX = f"read_parquet('{index}')" # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated # without BOTH sample_facet_membership and facet_node_bits — they are what @@ -548,15 +552,16 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") - # build_id must be a SINGLE value of the EXACT shape "__:__" - # — exactly two ':'-separated halves, each exactly three '_'-separated decimals. - # (Codex #2: "a colon somewhere" accepted ":bogus:extra".) + # build_id must be a SINGLE value of the EXACT shape ":__" + # — the membership half is the BARE bit_xor decimal (deployed node_bits contract), + # the coverage half is the three '_'-separated decimal trio. (Codex #2: "a colon + # somewhere" accepted ":bogus:extra".) ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") - bid_re = r'^[0-9]+_[0-9]+_[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + bid_re = r'^[0-9]+:[0-9]+_[0-9]+_[0-9]+$' ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") check("index: build_id is well-formed ':'", ix_bid_fmt == 0, - "build_id(s) not exactly __:__") + "build_id(s) not exactly :__") # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is # recomputed from facets_v2 (the located (pid, source) universe — equivalent # to samp_geo) and, when membership is present, the membership half from it. @@ -570,7 +575,7 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") if mem: - exp_mem = _fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) + exp_mem = _xor_fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) # bare XOR (deployed contract) got_mem = ix_bid.split(":", 1)[0] check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)") From 7b032f738f1667465f1b4c5bf0a3b3625ac1ab95 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 05:23:39 -0700 Subject: [PATCH 08/14] #304/#305 Phase 2: multi-filter facet counts at global view via mask index MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The single-filter cube (#298) serves correct cross-filtered counts for exactly ONE active filter at global view. With 2+ filters, effectiveSingleFilter() returns null, the cube is skipped, and updateCrossFilteredCounts falls to the legacy early-return that reverts every tree dim to the UNFILTERED baseline (explorer.qmd) — the #304 bug (e.g. Material=anthropogenic + Object=artifact showed 261,086 artifacts instead of the cross-filtered 145,770). Fix: route the global / full-tree-mode / no-search / multi-filter case through the complete per-pid index (sample_facet_index, #305/#306) with the node_bits bitmask predicate: - directFilterSnapshot(): reads ALL selections from the controls WITHOUT zeroing tree dims at global view (that zeroing in describeCrossFilters is the root cause). - maskIndexWhere(): per-dim cross-filter predicate (OR within a dim via the bit mask, AND across dims, exclude-self); source-impossible zeroes tree targets but not source's own histogram (the plan's special case). - applyMaskIndexCounts(): one columnar bitmask query per dim, applied ATOMICALLY after a single facetCountsReqId stale check; COUNT(*) over the one-row-per-pid index == distinct pids. - HONESTY RULE: on query failure → 'unavailable' (a "(—)" dash, .count-unavailable), NEVER the baseline. If the index/bitmap isn't usable yet (unpublished) → 'fallthrough' to the legacy paths unchanged, so this is safe to ship first. - facetIndexReady preflight: index present + schema_version==1 + build_id membership-half == node_bits generation; else stays unavailable, not wrong. Single-filter cube kept as a pure optimization. Viewport + search still take the live membership path (Phase 3). Verified on real 202608 data: index covers 6,026,242 located pids vs masks' 5,996,325 — exactly the 29,917 #306 samples recovered; full validator passes; Eric's case returns 145,770 (cross-filtered) not 261,086 (baseline). Refs #305 #304 #306 --- explorer.qmd | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 205 insertions(+) diff --git a/explorer.qmd b/explorer.qmd index d52666d..122b13d 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -578,6 +578,9 @@ format: .facet-row.zero { opacity: 0.4; } .facet-row.zero:hover { opacity: 0.65; } .facet-count.recomputing { opacity: 0.55; font-style: italic; } + /* #304/#305: counts can't be computed for this filter combo (index + unpublished/inconsistent or query failed) — honest dash, NOT a baseline. */ + .facet-count.count-unavailable { opacity: 0.5; font-style: italic; cursor: help; } /* #281/#282 Material tree (preview) — compact rows */ .filter-body .facet-treeroot { font-weight: 600; font-size: 12px; line-height: 1.5; padding: 0 0 2px; color: #444; } .filter-body .facet-treenode { font-size: 12px; line-height: 1.5; } @@ -816,6 +819,15 @@ tree_cross_filter_url = `${R2_BASE}/isamples_202608_facet_tree_cross_filter.parq // facetFilterSQL falls back to the membership scan (no regression). masks_url = `${R2_BASE}/isamples_202608_sample_facet_masks.parquet` node_bits_url = `${R2_BASE}/isamples_202608_facet_node_bits.parquet` +// #304/#305 complete per-pid facet index: one row per LOCATED sample (incl. the +// ~29,917 with no tree membership that sample_facet_masks omits, #306) + source. +// The multi-filter global-view count path scans this with the node_bits bitmask +// predicate so 2+ active filters get correct counts instead of reverting to the +// unfiltered baseline (#304). Best-effort: if it isn't published/consistent, +// facetIndexReady stays false and the multi-filter path shows "unavailable" +// rather than a misleading baseline (the honesty rule — never baseline under +// active filters). +index_url = `${R2_BASE}/isamples_202608_sample_facet_index.parquet` // Canonical palette — see issue #113. Path-relative so this works under // both isamples.org (custom domain at root) and project-pages fork @@ -1390,6 +1402,8 @@ function applyFacetCounts(facetKey, countsMap) { } el.textContent = `(${Number(count).toLocaleString()})`; el.classList.remove('recomputing'); + el.classList.remove('count-unavailable'); // #304/#305: real count clears the dash + el.removeAttribute('title'); const row = document.querySelector(`.facet-row[data-facet="${facetKey}"][data-value="${CSS.escape(value)}"]`); if (row) row.classList.toggle('zero', count === 0); @@ -1400,6 +1414,23 @@ function markFacetCountsRecomputing() { document.querySelectorAll('.facet-count').forEach(el => el.classList.add('recomputing')); } +// #304/#305: the honesty rule — when active filters are in play but we cannot +// compute correct cross-filtered counts (the complete index is unpublished / +// inconsistent, or a count query failed), we must NOT fall back to the unfiltered +// baseline (that misleading baseline IS the #304 bug). Show an explicit +// "unavailable" dash instead. Scoped to the given dim keys (default: all). +function markFacetCountsUnavailable(facetKeys = null) { + const keys = facetKeys || ['source', 'material', 'context', 'object_type']; + for (const key of keys) { + document.querySelectorAll(`.facet-count[data-facet="${key}"]`).forEach(el => { + el.textContent = '(—)'; + el.classList.remove('recomputing'); + el.classList.add('count-unavailable'); + el.title = 'Count unavailable for this filter combination'; + }); + } +} + // === URL State: encode/decode globe state in hash fragment === // (decode side — parseNum + readHash — extracted to assets/js/explorer-utils.js, PR3) function buildHash(v) { @@ -1804,10 +1835,55 @@ nodeBitsReady = { return false; } window.__nodeBits = map; + // #304/#305: stash the node_bits generation id so the facet-index preflight + // can require the index's mask bits be interpreted under THIS assignment. + window.__nodeBitsBuild = nbBuild; return true; } catch (err) { console.warn('node_bits/masks preflight failed; facetFilterSQL will use the membership fallback:', err); window.__nodeBits = null; + window.__nodeBitsBuild = null; + return false; + } +} +``` + +```{ojs} +//| echo: false +//| output: false +// #304/#305: preflight the complete per-pid facet index (sample_facet_index). The +// multi-filter global-view count path scans it; it must be (a) present/readable, +// (b) schema_version we understand, and (c) from the SAME node_bits generation as +// the bit map (its build_id's membership half == window.__nodeBitsBuild) so the +// mask bits mean what we think. Only then advertise readiness; otherwise the +// multi-filter path shows "unavailable" instead of a baseline (honesty rule). +// Depends on nodeBitsReady so window.__nodeBitsBuild is populated first. +facetIndexReady = { + const _ = nodeBitsReady; // sequence after the node_bits preflight + window.__facetIndexReady = false; + const INDEX_SCHEMA_VERSION = 1; // mirrors build_frontend_derived.INDEX_SCHEMA_VERSION + try { + const nbBuild = window.__nodeBitsBuild; + if (!nbBuild) return false; // no usable bit map → index path can't run + const rows = await db.query( + `SELECT DISTINCT build_id, schema_version FROM read_parquet('${index_url}')`); + if (!rows || rows.length !== 1) return false; // missing / mixed generations + const sv = Number(rows[0].schema_version); + if (sv !== INDEX_SCHEMA_VERSION) { + console.warn('sample_facet_index schema_version unsupported; multi-filter counts unavailable', sv); + return false; + } + const membershipHalf = String(rows[0].build_id).split(':', 1)[0]; + if (membershipHalf !== nbBuild) { + console.warn('sample_facet_index/node_bits generation mismatch; multi-filter counts unavailable', + { indexMembershipHalf: membershipHalf, nbBuild }); + return false; + } + window.__facetIndexReady = true; + return true; + } catch (err) { + console.warn('sample_facet_index preflight failed; multi-filter global counts will show unavailable:', err); + window.__facetIndexReady = false; return false; } } @@ -3422,6 +3498,116 @@ zoomWatcher = { } } + // #304/#305: a DIRECT snapshot of every active selection read from the controls — + // unlike describeCrossFilters(), it does NOT zero tree selections at global view + // (that zeroing is the #304 root cause). Returns per-dim selected values (source + // strings; tree concept_uris), an active-dim count, and the source-impossible + // flag. Tree dims contribute only in full tree mode (flat dims keep facets_v3 + // paths). source: [] means "no constraint" (all or none selected — disambiguated + // by sourceImpossible). + function directFilterSnapshot() { + const sourceTotal = document.querySelectorAll('#sourceFilter input[type="checkbox"]').length; + const sources = getActiveSources(); + const sourceImpossible = sourceTotal > 0 && sources.length === 0; + // a partial source selection is a real constraint; all/none → no positive list + const sourceVals = (sources.length > 0 && sources.length < sourceTotal) ? sources : []; + const snap = { source: sourceVals, sourceImpossible, material: [], context: [], object_type: [] }; + for (const key of TREE_DIM_KEYS) { + if (treeActive(key)) snap[key] = treeSelection(key) || []; + } + // dims that impose a constraint (drives "is there anything to cross-filter?") + let active = sourceImpossible || sourceVals.length > 0 ? 1 : 0; + for (const key of TREE_DIM_KEYS) if (snap[key].length > 0) active++; + snap.activeCount = active; + return snap; + } + + // #304/#305: build the cross-filter WHERE conditions over sample_facet_index + // (alias-free; columns resolve to the index) for the OTHER active dims, EXCLUDING + // `excludeKey` (exclude-self). Tree dims → `(_mask & ) <> 0` from the + // node_bits bit map; source → `source IN (...)`. Returns null if a tree selection + // can't be turned into a mask (bit map absent / unknown node) so the caller can + // bail honestly instead of under-constraining. A source-impossible state zeroes + // every NON-source target (the source dim's own histogram still computes). + function maskIndexWhere(snap, excludeKey) { + const bits = (typeof window !== 'undefined') ? window.__nodeBits : null; + const conds = []; + for (const key of TREE_DIM_KEYS) { + if (key === excludeKey) continue; + const sel = snap[key]; + if (!sel || sel.length === 0) continue; + if (!bits || !bits[key]) return null; + let mask = 0n; + for (const uri of sel) { + const idx = bits[key].get(uri); + if (idx === undefined) return null; // unknown node → bail + mask |= (1n << BigInt(idx)); + } + if (mask === 0n) return null; + conds.push(`(${key}_mask & ${mask}) <> 0`); // BigInt → decimal BIGINT literal + } + if (excludeKey !== 'source') { + if (snap.sourceImpossible) return ['1=0']; // no source selectable → 0 for tree targets + if (snap.source.length > 0) { + conds.push(`source IN (${snap.source.map(v => `'${escSql(v)}'`).join(',')})`); + } + } + return conds; + } + + // #304/#305: compute multi-filter cross-filtered counts for ALL dims off the + // complete per-pid index and apply them ATOMICALLY after a single stale check. + // Returns: 'ok' (applied), 'superseded' (a newer request won), 'unavailable' + // (index/bitmap usable but a query failed → honesty: caller shows the dash), or + // 'fallthrough' (index/bitmap not usable → caller uses legacy paths). + async function applyMaskIndexCounts(snap, dims, myReq) { + if (!window.__facetIndexReady || !window.__nodeBits) return 'fallthrough'; + try { + const results = await Promise.all(dims.map(async (d) => { + const conds = maskIndexWhere(snap, d.key); + if (conds === null) throw { __fallthrough: true }; // can't build masks → legacy + const whereSQL = conds.length ? `WHERE ${conds.join(' AND ')}` : ''; + let rows; + if (d.key === 'source') { + // source target: COUNT(*) per source over the index (one row/pid), + // constrained by the tree dims only (exclude-self). + const w = conds.length ? `${whereSQL} AND source IS NOT NULL` : 'WHERE source IS NOT NULL'; + rows = await db.query(` + SELECT source AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s + ${w} + GROUP BY source + `); + } else { + // tree target: JOIN node_bits; each (pid,node) the pid is a member + // of yields one row, so COUNT(*) per concept_uri == distinct pids + // with that bit. Membership encodes ancestor closure → parent ≥ child. + rows = await db.query(` + SELECT nb.concept_uri AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s + JOIN read_parquet('${node_bits_url}') nb + ON nb.facet_type = '${d.key}' + AND (s.${d.key}_mask & (1::BIGINT << nb.bit_index)) <> 0 + ${whereSQL} + GROUP BY nb.concept_uri + `); + } + const map = new Map(); + for (const r of rows) map.set(r.value, Number(r.count)); + return { key: d.key, map }; + })); + if (myReq !== facetCountsReqId) return 'superseded'; + // atomic repaint: all dims at once, only after the single stale check + for (const { key, map } of results) applyFacetCounts(key, map); + return 'ok'; + } catch (err) { + if (myReq !== facetCountsReqId) return 'superseded'; + if (err && err.__fallthrough) return 'fallthrough'; + console.warn('mask-index cross-filter counts failed; counts unavailable (not baseline):', err); + return 'unavailable'; + } + } + async function updateCrossFilteredCounts(myReq) { if (myReq !== facetCountsReqId) return; const { dims, activeDims, totalActiveValues, sourceImpossible } = describeCrossFilters(); @@ -3490,6 +3676,25 @@ zoomWatcher = { if (myReq !== facetCountsReqId) return; } + // #304/#305: MULTI-filter (or multi-value single-dim, or source-impossible, + // or a single-filter cube MISS) at global view + full tree mode + no search. + // The single-filter cube above can't serve these, and the legacy slow path + // below reverts every tree dim to the unfiltered BASELINE at global view + // (the #304 bug). Route through the complete per-pid index instead — correct + // multi-filter counts in one columnar bitmask scan, applied atomically. + // HONESTY RULE: on any failure show "unavailable", NEVER the baseline. A + // 'fallthrough' (index/bitmap not usable, e.g. not yet published) defers to + // the legacy paths unchanged so this is safe to ship before the index lands. + if (bboxSQL === null && !searchIsActive() && TREE_DIM_KEYS.every(treeActive)) { + const snap = directFilterSnapshot(); + if (snap.activeCount >= 1) { + const res = await applyMaskIndexCounts(snap, dims, myReq); + if (res === 'ok' || res === 'superseded') return; + if (res === 'unavailable') { markFacetCountsUnavailable(); return; } + // res === 'fallthrough' → continue to the legacy paths below + } + } + // Baseline early-return only applies when there is no filter AND no // spatial constraint. In a non-global view with no facet filter, B1 // still wants per-value counts scoped to what's visible — fall From 6401ea0046d6669f7cfb1f1330bf62d87c3fd1a0 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 05:32:50 -0700 Subject: [PATCH 09/14] #304/#305 Phase 2: address Codex review (honesty, boot race, coverage, decoupling) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - P1 fallthrough reintroduced #304: at global + full tree + active filters there is NO correct legacy path (it reverts to baseline), so 'unavailable' AND 'fallthrough' now both show the dash, never legacy baseline. + defensive stale check before markFacetCountsUnavailable. - P1 boot race: facetIndexReady now calls window.__onFacetIndexReady() when it settles true; the main script registers it to refreshFacetCounts() if filters are active, so a URL-restored multi-filter that computed before readiness reconciles to correct counts instead of staying on the dash. - P1/P2 runtime coverage handshake: facetIndexReady now also verifies the index row count == facets (located universe) — catches stale/partial/mixed-cache indexes (the #306 drift class a membership-only id is blind to). Full fingerprint stays a build-time gate (SERIALIZATIONS §4.12). - P2 decouple node_bits from masks: nodeBitsReady publishes __nodeBitsMap + __nodeBitsBuild as soon as node_bits is valid (count path needs only these); __nodeBits (masks-gated) stays for facetFilterSQL. A missing masks file no longer disables a valid count bundle. Count path now reads __nodeBitsMap. - P2 WASM concurrency: the 4 per-dim queries are now ONE UNION ALL statement (one connection slot) instead of 4 concurrent full-index scans. - P2 SQL injection: R2_BASE (user-controllable via ?data_base=/localStorage) is rejected if it contains a quote/backslash/control char or isn't http(s)/root- relative, so it can't break out of read_parquet('...') in any query. Re-verified on real 202608 data: single UNION ALL query returns Eric's case as 145,770 cross-filtered across all dims (vs 261,086 baseline). Codex confirmed sound: COUNT(*) per concept == distinct pids; exclude-self / OR-within / AND-across / source-impossible / multi-value-single-dim semantics; single post-Promise stale check. WASM cold/warm browser benchmark remains the Phase 4 release gate. Refs #305 #304 #306 --- explorer.qmd | 164 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 110 insertions(+), 54 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index 122b13d..0d2f8b3 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -752,9 +752,20 @@ Cesium.Ion.defaultAccessToken = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOi // localStorage ISAMPLES_DATA_BASE (sticky per-browser override) // Defaults to the production R2 origin, so shipped builds are unchanged. R2_BASE = (() => { + const DEFAULT = "https://data.isamples.org"; const raw = new URLSearchParams(location.search).get('data_base') || (typeof localStorage !== 'undefined' && localStorage.getItem('ISAMPLES_DATA_BASE')) - || "https://data.isamples.org"; + || DEFAULT; + // The base is interpolated into `read_parquet('${url}/...')` in many queries, so + // a value containing a single quote (or backslash) could break out of the SQL + // string literal (Codex P2 — data_base is user-controllable via ?data_base= / + // localStorage). Reject any such value and fall back to the default. Also require + // an http(s) or root-relative form (httpfs only range-fetches absolute http(s); + // a relative is resolved against origin below). + if (/['"\\\x00-\x1f]/.test(raw) || !(/^https?:\/\//.test(raw) || raw.startsWith('/'))) { + if (raw !== DEFAULT) console.warn('Ignoring unsafe/invalid data_base override; using default origin:', raw); + return DEFAULT; + } // DuckDB-WASM's httpfs only range-fetches ABSOLUTE http(s) URLs. A root- // relative override (?data_base=/data) is read as a virtual-filesystem // glob and fails with "No files found that match the pattern" — silently, @@ -1806,7 +1817,17 @@ db = { // null and facetFilterSQL falls back to the membership scan — so this is safe to // ship before sample_facet_masks / facet_node_bits are published. nodeBitsReady = { + // Two SEPARATE readiness signals (Codex P2 #4 — they were wrongly coupled): + // __nodeBitsMap + __nodeBitsBuild → the concept_uri→bit map + its generation. + // Valid as soon as node_bits itself is. The #304/#305 COUNT path + // (sample_facet_index) needs ONLY this, not masks. + // __nodeBits → the SAME map, but advertised only after the masks file + // preflights & generation-matches. facetFilterSQL's mask FILTER path reads + // masks_url, so it must not run until masks is proven present/consistent. + // A missing/mismatched masks file must NOT disable the (valid) count bundle. window.__nodeBits = null; + window.__nodeBitsMap = null; + window.__nodeBitsBuild = null; try { const rows = await db.query( `SELECT facet_type, concept_uri, bit_index, build_id FROM read_parquet('${node_bits_url}')`); @@ -1821,27 +1842,30 @@ nodeBitsReady = { // let a last-row-wins value coincidentally match masks — Codex r2). if (!haveBits || nbBuilds.size !== 1) return false; const nbBuild = [...nbBuilds][0]; + // node_bits is valid on its own → publish the bit map + generation NOW so the + // count path (which depends only on node_bits + the index) can run even if + // masks is absent. facetIndexReady consumes window.__nodeBitsBuild. + window.__nodeBitsMap = map; + window.__nodeBitsBuild = nbBuild; // Codex P1.1 + P1.2: PREFLIGHT the masks file AND require its build_id to // match node_bits. This proves masks is present/readable (else the mask - // query path would fail with no retry) and that both artifacts are from the - // SAME generation (else a stale-cached masks file would map bits wrong). - // Only then advertise readiness; otherwise facetFilterSQL uses membership. + // FILTER path would fail with no retry) and that both artifacts are from the + // SAME generation. Only then advertise __nodeBits for facetFilterSQL; + // otherwise facetFilterSQL uses the membership fallback (count path unaffected). const mrows = await db.query( `SELECT DISTINCT build_id FROM read_parquet('${masks_url}')`); const maskBuilds = mrows.map(r => r.build_id); if (maskBuilds.length !== 1 || maskBuilds[0] !== nbBuild) { console.warn('masks/node_bits build_id mismatch or missing; using membership fallback', { nbBuild, maskBuilds }); - return false; + return false; // count path stays enabled via __nodeBitsMap } window.__nodeBits = map; - // #304/#305: stash the node_bits generation id so the facet-index preflight - // can require the index's mask bits be interpreted under THIS assignment. - window.__nodeBitsBuild = nbBuild; return true; } catch (err) { - console.warn('node_bits/masks preflight failed; facetFilterSQL will use the membership fallback:', err); + console.warn('node_bits preflight failed; facetFilterSQL + count path use fallbacks:', err); window.__nodeBits = null; + window.__nodeBitsMap = null; window.__nodeBitsBuild = null; return false; } @@ -1853,11 +1877,15 @@ nodeBitsReady = { //| output: false // #304/#305: preflight the complete per-pid facet index (sample_facet_index). The // multi-filter global-view count path scans it; it must be (a) present/readable, -// (b) schema_version we understand, and (c) from the SAME node_bits generation as -// the bit map (its build_id's membership half == window.__nodeBitsBuild) so the -// mask bits mean what we think. Only then advertise readiness; otherwise the -// multi-filter path shows "unavailable" instead of a baseline (honesty rule). -// Depends on nodeBitsReady so window.__nodeBitsBuild is populated first. +// (b) schema_version we understand, (c) from the SAME node_bits generation as the +// bit map (build_id membership half == window.__nodeBitsBuild) so the mask bits +// mean what we think, AND (d) COVERAGE-COMPLETE — its pid count equals the located +// universe (facets) so a stale/incomplete index (the #306 drift class, which a +// membership-only id is blind to) can't silently undercount. The full coverage +// fingerprint is enforced at BUILD time (validate_frontend_derived); this is the +// cheap runtime handshake the plan calls for (SERIALIZATIONS §4.12). Only then +// advertise readiness; otherwise the multi-filter path shows "unavailable" instead +// of a baseline (honesty rule). Depends on nodeBitsReady for __nodeBitsBuild. facetIndexReady = { const _ = nodeBitsReady; // sequence after the node_bits preflight window.__facetIndexReady = false; @@ -1879,7 +1907,23 @@ facetIndexReady = { { indexMembershipHalf: membershipHalf, nbBuild }); return false; } + // (d) runtime coverage handshake: the index must cover the SAME located + // universe the counts are about. Compare its row count to facets (the + // located set). A stale/partial index (fewer pids) or a mixed-cache + // deployment fails here rather than returning convincing-but-low counts. + const cov = await db.query( + `SELECT (SELECT COUNT(*) FROM read_parquet('${index_url}')) AS idx_n, + (SELECT COUNT(*) FROM read_parquet('${facets_url}')) AS facets_n`); + const idxN = Number(cov[0].idx_n), facetsN = Number(cov[0].facets_n); + if (idxN !== facetsN) { + console.warn('sample_facet_index coverage != located universe; multi-filter counts unavailable', + { idxN, facetsN }); + return false; + } window.__facetIndexReady = true; + // readiness can flip true AFTER boot already computed counts once (lost the + // async race) — reconcile any active multi-filter now (Codex P1 boot race). + if (typeof window.__onFacetIndexReady === 'function') window.__onFacetIndexReady(); return true; } catch (err) { console.warn('sample_facet_index preflight failed; multi-filter global counts will show unavailable:', err); @@ -3530,7 +3574,9 @@ zoomWatcher = { // bail honestly instead of under-constraining. A source-impossible state zeroes // every NON-source target (the source dim's own histogram still computes). function maskIndexWhere(snap, excludeKey) { - const bits = (typeof window !== 'undefined') ? window.__nodeBits : null; + // #304/#305: the count path uses __nodeBitsMap (valid as soon as node_bits is) + // — NOT __nodeBits, which is gated on the masks FILTER artifact (Codex #4). + const bits = (typeof window !== 'undefined') ? window.__nodeBitsMap : null; const conds = []; for (const key of TREE_DIM_KEYS) { if (key === excludeKey) continue; @@ -3561,48 +3607,45 @@ zoomWatcher = { // (index/bitmap usable but a query failed → honesty: caller shows the dash), or // 'fallthrough' (index/bitmap not usable → caller uses legacy paths). async function applyMaskIndexCounts(snap, dims, myReq) { - if (!window.__facetIndexReady || !window.__nodeBits) return 'fallthrough'; + if (!window.__facetIndexReady || !window.__nodeBitsMap) return 'fallthrough'; + // Build ONE query (UNION ALL of per-dim arms) rather than 4 concurrent + // full-index scans — the documented DuckDB-WASM single-connection + // starvation risk (Codex P2). One statement = one connection slot. Each arm + // carries its OWN exclude-self cross-filter WHERE. Common output shape + // (dim, value, count) so one pass groups them. + const arms = []; + for (const d of dims) { + const conds = maskIndexWhere(snap, d.key); + if (conds === null) return 'unavailable'; // selected node missing a bit → can't compute honestly + const whereSQL = conds.length ? `WHERE ${conds.join(' AND ')}` : ''; + if (d.key === 'source') { + // source target: COUNT(*) per source over the index (one row/pid), + // constrained by the tree dims only (exclude-self). + const w = conds.length ? `${whereSQL} AND source IS NOT NULL` : 'WHERE source IS NOT NULL'; + arms.push(`SELECT 'source' AS dim, source AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s ${w} GROUP BY source`); + } else { + // tree target: JOIN node_bits; each (pid,node) the pid is a member of + // yields one row, so COUNT(*) per concept_uri == distinct pids with + // that bit. Membership encodes ancestor closure → parent ≥ child. + arms.push(`SELECT '${d.key}' AS dim, nb.concept_uri AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s + JOIN read_parquet('${node_bits_url}') nb + ON nb.facet_type = '${d.key}' + AND (s.${d.key}_mask & (1::BIGINT << nb.bit_index)) <> 0 + ${whereSQL} GROUP BY nb.concept_uri`); + } + } try { - const results = await Promise.all(dims.map(async (d) => { - const conds = maskIndexWhere(snap, d.key); - if (conds === null) throw { __fallthrough: true }; // can't build masks → legacy - const whereSQL = conds.length ? `WHERE ${conds.join(' AND ')}` : ''; - let rows; - if (d.key === 'source') { - // source target: COUNT(*) per source over the index (one row/pid), - // constrained by the tree dims only (exclude-self). - const w = conds.length ? `${whereSQL} AND source IS NOT NULL` : 'WHERE source IS NOT NULL'; - rows = await db.query(` - SELECT source AS value, COUNT(*) AS count - FROM read_parquet('${index_url}') s - ${w} - GROUP BY source - `); - } else { - // tree target: JOIN node_bits; each (pid,node) the pid is a member - // of yields one row, so COUNT(*) per concept_uri == distinct pids - // with that bit. Membership encodes ancestor closure → parent ≥ child. - rows = await db.query(` - SELECT nb.concept_uri AS value, COUNT(*) AS count - FROM read_parquet('${index_url}') s - JOIN read_parquet('${node_bits_url}') nb - ON nb.facet_type = '${d.key}' - AND (s.${d.key}_mask & (1::BIGINT << nb.bit_index)) <> 0 - ${whereSQL} - GROUP BY nb.concept_uri - `); - } - const map = new Map(); - for (const r of rows) map.set(r.value, Number(r.count)); - return { key: d.key, map }; - })); + const rows = await db.query(arms.join('\nUNION ALL\n')); if (myReq !== facetCountsReqId) return 'superseded'; + const grouped = { source: new Map(), material: new Map(), context: new Map(), object_type: new Map() }; + for (const r of rows) if (grouped[r.dim]) grouped[r.dim].set(r.value, Number(r.count)); // atomic repaint: all dims at once, only after the single stale check - for (const { key, map } of results) applyFacetCounts(key, map); + for (const d of dims) applyFacetCounts(d.key, grouped[d.key]); return 'ok'; } catch (err) { if (myReq !== facetCountsReqId) return 'superseded'; - if (err && err.__fallthrough) return 'fallthrough'; console.warn('mask-index cross-filter counts failed; counts unavailable (not baseline):', err); return 'unavailable'; } @@ -3689,9 +3732,16 @@ zoomWatcher = { const snap = directFilterSnapshot(); if (snap.activeCount >= 1) { const res = await applyMaskIndexCounts(snap, dims, myReq); - if (res === 'ok' || res === 'superseded') return; - if (res === 'unavailable') { markFacetCountsUnavailable(); return; } - // res === 'fallthrough' → continue to the legacy paths below + if (res === 'superseded' || res === 'ok') return; + // res is 'unavailable' OR 'fallthrough' (index not ready / a node has + // no bit). At global + full tree + active filters there is NO correct + // legacy path — the legacy slow path reverts tree dims to the + // unfiltered BASELINE, which IS the #304 bug. Honesty rule: show the + // dash, never baseline. (When readiness flips true later, + // __onFacetIndexReady re-runs this and the dash becomes real counts.) + if (myReq !== facetCountsReqId) return; // defensive stale check (Codex) + markFacetCountsUnavailable(); + return; } } @@ -4547,6 +4597,12 @@ zoomWatcher = { window.__onFilteredClustersReady = () => { if (hasFacetFilters()) reconcileGlobeForFilters(); }; + // #304/#305: when the facet-index preflight settles true AFTER boot already + // computed counts (async race), recompute so a multi-filter that fell back to + // the "unavailable" dash becomes correct counts without a user action. + window.__onFacetIndexReady = () => { + if (hasFacetFilters()) refreshFacetCounts(); + }; setTimeout(() => { if (window.__filteredClustersReady === true && hasFacetFilters()) { const h = viewer.camera.positionCartographic.height; From e8e18bcbde5143dbd7d2217a97ba9851071d26c0 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 05:38:27 -0700 Subject: [PATCH 10/14] #304/#305 Phase 2: fix masks-catch recoupling + source-only refresh (Codex r2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - P1: nodeBitsReady split into TWO try blocks — a masks-query failure no longer throws into the shared catch that nulled __nodeBitsMap/__nodeBitsBuild. node_bits publishes the count-path values in step 1; masks preflight is step 2 and only withholds __nodeBits (the FILTER signal) on failure. Count path now truly survives a missing/mismatched masks file. - P1: __onFacetIndexReady refreshes UNCONDITIONALLY — hasFacetFilters() excludes source, so a source-only filter that painted dashes before readiness now also reconciles. Debounce + reqId guard make a redundant no-filter refresh harmless. - P2 staleness: coverage handshake upgraded from bare row-count to per-SOURCE histogram equality (index vs facets_v3) — catches equal-cardinality PID replacement + source drift, not just a smaller index. Verified mismatch=0 on real 202608 data (GEOME/OPENCONTEXT/SESAR/SMITHSONIAN all match). Still a proxy, not the full fingerprint (build-time gate); generation-id match guards mask corruption. Codex r2 confirmed otherwise sound: no global active-filter route reaches baseline; UNION ALL exclude-self + dim regrouping correct; unqualified cols bind to index `s`; all R2_BASE forms pass; 36 py + 13 node tests pass. Refs #305 #304 #306 --- explorer.qmd | 76 +++++++++++++++++++++++++++++++--------------------- 1 file changed, 46 insertions(+), 30 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index 0d2f8b3..307963c 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -1828,10 +1828,15 @@ nodeBitsReady = { window.__nodeBits = null; window.__nodeBitsMap = null; window.__nodeBitsBuild = null; + let map, nbBuild; + // STEP 1: node_bits — its OWN try, so the count-path values it publishes are + // never cleared by a downstream masks failure (Codex r2 P1: the masks query + // previously threw into a shared catch that nulled __nodeBitsMap/__nodeBitsBuild, + // re-coupling the two and disabling valid counts). try { const rows = await db.query( `SELECT facet_type, concept_uri, bit_index, build_id FROM read_parquet('${node_bits_url}')`); - const map = { material: new Map(), context: new Map(), object_type: new Map() }; + map = { material: new Map(), context: new Map(), object_type: new Map() }; const nbBuilds = new Set(); for (const r of rows) { if (map[r.facet_type]) map[r.facet_type].set(r.concept_uri, Number(r.bit_index)); @@ -1841,32 +1846,37 @@ nodeBitsReady = { // node_bits must carry exactly ONE build_id (a mixed file is corrupt; don't // let a last-row-wins value coincidentally match masks — Codex r2). if (!haveBits || nbBuilds.size !== 1) return false; - const nbBuild = [...nbBuilds][0]; + nbBuild = [...nbBuilds][0]; // node_bits is valid on its own → publish the bit map + generation NOW so the - // count path (which depends only on node_bits + the index) can run even if - // masks is absent. facetIndexReady consumes window.__nodeBitsBuild. + // count path (which depends only on node_bits + the index) runs even if masks + // is absent. facetIndexReady consumes window.__nodeBitsBuild. window.__nodeBitsMap = map; window.__nodeBitsBuild = nbBuild; - // Codex P1.1 + P1.2: PREFLIGHT the masks file AND require its build_id to - // match node_bits. This proves masks is present/readable (else the mask - // FILTER path would fail with no retry) and that both artifacts are from the - // SAME generation. Only then advertise __nodeBits for facetFilterSQL; - // otherwise facetFilterSQL uses the membership fallback (count path unaffected). + } catch (err) { + console.warn('node_bits preflight failed; facetFilterSQL + count path use fallbacks:', err); + window.__nodeBits = null; + window.__nodeBitsMap = null; + window.__nodeBitsBuild = null; + return false; + } + // STEP 2: masks — a SEPARATE try. A masks failure here must leave __nodeBitsMap / + // __nodeBitsBuild intact (count path stays enabled) and only withhold __nodeBits + // (the masks-gated FILTER signal for facetFilterSQL). Codex P1.1 + P1.2: masks + // must be present/readable AND the SAME generation as node_bits, else + // facetFilterSQL uses the membership fallback. + try { const mrows = await db.query( `SELECT DISTINCT build_id FROM read_parquet('${masks_url}')`); const maskBuilds = mrows.map(r => r.build_id); if (maskBuilds.length !== 1 || maskBuilds[0] !== nbBuild) { - console.warn('masks/node_bits build_id mismatch or missing; using membership fallback', + console.warn('masks/node_bits build_id mismatch; using membership fallback (count path unaffected)', { nbBuild, maskBuilds }); - return false; // count path stays enabled via __nodeBitsMap + return false; } window.__nodeBits = map; return true; } catch (err) { - console.warn('node_bits preflight failed; facetFilterSQL + count path use fallbacks:', err); - window.__nodeBits = null; - window.__nodeBitsMap = null; - window.__nodeBitsBuild = null; + console.warn('masks preflight failed; facetFilterSQL uses membership fallback (count path unaffected):', err); return false; } } @@ -1908,16 +1918,21 @@ facetIndexReady = { return false; } // (d) runtime coverage handshake: the index must cover the SAME located - // universe the counts are about. Compare its row count to facets (the - // located set). A stale/partial index (fewer pids) or a mixed-cache - // deployment fails here rather than returning convincing-but-low counts. - const cov = await db.query( - `SELECT (SELECT COUNT(*) FROM read_parquet('${index_url}')) AS idx_n, - (SELECT COUNT(*) FROM read_parquet('${facets_url}')) AS facets_n`); - const idxN = Number(cov[0].idx_n), facetsN = Number(cov[0].facets_n); - if (idxN !== facetsN) { - console.warn('sample_facet_index coverage != located universe; multi-filter counts unavailable', - { idxN, facetsN }); + // universe the counts are about. Compare the per-SOURCE histogram of the + // index to facets (the located set) — a symmetric diff that catches a + // stale/partial index (fewer pids), an equal-cardinality PID replacement, + // AND source drift, all of which a bare row-count check would miss (Codex r2). + // It is still a cardinality/source proxy, NOT the full coverage fingerprint + // (that's the build-time gate, SERIALIZATIONS §4.12) — a same-source-histogram + // mask corruption could pass, which the generation-id match above guards. + const cov = await db.query(` + WITH i AS (SELECT source, COUNT(*) c FROM read_parquet('${index_url}') GROUP BY source), + f AS (SELECT source, COUNT(*) c FROM read_parquet('${facets_url}') GROUP BY source) + SELECT (SELECT COUNT(*) FROM (SELECT * FROM i EXCEPT SELECT * FROM f)) + + (SELECT COUNT(*) FROM (SELECT * FROM f EXCEPT SELECT * FROM i)) AS mismatch`); + if (Number(cov[0].mismatch) !== 0) { + console.warn('sample_facet_index per-source coverage != located universe; multi-filter counts unavailable', + { mismatch: Number(cov[0].mismatch) }); return false; } window.__facetIndexReady = true; @@ -4598,11 +4613,12 @@ zoomWatcher = { if (hasFacetFilters()) reconcileGlobeForFilters(); }; // #304/#305: when the facet-index preflight settles true AFTER boot already - // computed counts (async race), recompute so a multi-filter that fell back to - // the "unavailable" dash becomes correct counts without a user action. - window.__onFacetIndexReady = () => { - if (hasFacetFilters()) refreshFacetCounts(); - }; + // computed counts (async race), recompute so any filter that fell back to the + // "unavailable" dash becomes correct counts without a user action. Refresh + // UNCONDITIONALLY — hasFacetFilters() excludes SOURCE selections (Codex r2 P1), + // and refreshFacetCounts is debounced + facetCountsReqId-guarded, so a redundant + // no-filter refresh just recomputes the baseline harmlessly. + window.__onFacetIndexReady = () => { refreshFacetCounts(); }; setTimeout(() => { if (window.__filteredClustersReady === true && hasFacetFilters()) { const h = viewer.camera.positionCartographic.height; From 0832091239e6c8b15624aec73b193289d14b50ec Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 05:41:25 -0700 Subject: [PATCH 11/14] #304/#305 Phase 2: correct coverage-handshake claim (Codex r3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The per-source histogram handshake does NOT catch a same-source, same-cardinality PID swap (mismatch stays 0) — correct the overclaim. That residual is closed by (i) the generation-id match (the membership half hashes pid, so any swap of a pid WITH membership changes it — leaving only #306 no-membership pids) and (ii) the authoritative BUILD-TIME coverage fingerprint + pid-set equality gate. Runtime stays a cheap source/cardinality proxy by design (boot-query budget); a future co-published expected-id pointer would let it assert the full fingerprint. Codex r3 confirmed all P1s closed: masks failure never clears the count-path values; map/nbBuild correctly scoped; unconditional readiness refresh can't loop; coverage SQL valid + NULL-safe + cheap. Refs #305 #304 #306 --- explorer.qmd | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index 307963c..f6dc97f 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -1920,11 +1920,18 @@ facetIndexReady = { // (d) runtime coverage handshake: the index must cover the SAME located // universe the counts are about. Compare the per-SOURCE histogram of the // index to facets (the located set) — a symmetric diff that catches a - // stale/partial index (fewer pids), an equal-cardinality PID replacement, - // AND source drift, all of which a bare row-count check would miss (Codex r2). - // It is still a cardinality/source proxy, NOT the full coverage fingerprint - // (that's the build-time gate, SERIALIZATIONS §4.12) — a same-source-histogram - // mask corruption could pass, which the generation-id match above guards. + // stale/partial index (a per-source count drift) and SOURCE drift, which a + // bare total-row-count check would miss (Codex r2). This is a deliberately + // CHEAP proxy (a one-time projected source-column scan), NOT a complete + // staleness check: it does NOT detect a same-source, same-cardinality PID + // swap (mismatch stays 0). That residual is closed by (i) the generation-id + // match above — the membership half is a hash over membership *including pid*, + // so any swap of a pid that HAS membership changes it — leaving only swaps of + // the #306 no-membership pids, and (ii) the BUILD-TIME full coverage + // fingerprint + pid-set equality gate (validate_frontend_derived, + // SERIALIZATIONS §4.12), which is the authoritative coverage check. A future + // co-published expected-composite-id pointer would let the runtime assert the + // full fingerprint directly (deferred). const cov = await db.query(` WITH i AS (SELECT source, COUNT(*) c FROM read_parquet('${index_url}') GROUP BY source), f AS (SELECT source, COUNT(*) c FROM read_parquet('${facets_url}') GROUP BY source) From fad4201450f9a98041c8851797fb173f2c184ff8 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 06:49:05 -0700 Subject: [PATCH 12/14] #304/#305 Phase 2: make the facetIndexReady coverage check cheap (boot-perf) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In-browser DuckDB-WASM verification against the live index showed the per-source coverage handshake cost ~8.6s at boot — it scanned the 60MB facets_v3. Compare the index's per-source histogram to facet_summaries' source rows instead: the builder computes those as GROUP BY source over the SAME located set (samp_geo), the file is ~2KB and already loaded at boot, and it's verified equivalent (mismatch=0 on real 202608 data). This removes a multi-second connection hold at boot — exactly the single-connection starvation the app guards against — with no loss of coverage signal. Index source filter mirrors the builder's NULLIF(TRIM(source),'') so empty-string sources can't cause a false mismatch. Verified end-to-end in WASM: preflight PASS, Eric's case = 145,770, cold 1121ms / warm 174ms. Refs #305 #304 --- explorer.qmd | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index f6dc97f..2ea754c 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -1919,22 +1919,26 @@ facetIndexReady = { } // (d) runtime coverage handshake: the index must cover the SAME located // universe the counts are about. Compare the per-SOURCE histogram of the - // index to facets (the located set) — a symmetric diff that catches a - // stale/partial index (a per-source count drift) and SOURCE drift, which a - // bare total-row-count check would miss (Codex r2). This is a deliberately - // CHEAP proxy (a one-time projected source-column scan), NOT a complete - // staleness check: it does NOT detect a same-source, same-cardinality PID - // swap (mismatch stays 0). That residual is closed by (i) the generation-id - // match above — the membership half is a hash over membership *including pid*, - // so any swap of a pid that HAS membership changes it — leaving only swaps of - // the #306 no-membership pids, and (ii) the BUILD-TIME full coverage - // fingerprint + pid-set equality gate (validate_frontend_derived, - // SERIALIZATIONS §4.12), which is the authoritative coverage check. A future - // co-published expected-composite-id pointer would let the runtime assert the - // full fingerprint directly (deferred). + // index to facet_summaries' source rows (which the builder computes as + // GROUP BY source over the SAME located set, samp_geo) — a symmetric diff + // that catches a stale/partial index (per-source count drift) and SOURCE + // drift, which a bare total-row-count check would miss (Codex r2). + // facet_summaries is ~2 KB and already loaded at boot, so this is a near-free + // check; an earlier draft scanned the 60 MB facets_v3 and cost ~8.6 s of + // DuckDB-WASM connection time at boot (measured in-browser) — exactly the + // single-connection starvation this app guards against. It is still a CHEAP + // proxy, NOT a complete staleness check: it does NOT detect a same-source, + // same-cardinality PID swap (mismatch stays 0). That residual is closed by + // (i) the generation-id match above — the membership half is a hash over + // membership *including pid*, so any swap of a pid that HAS membership changes + // it, leaving only swaps of the #306 no-membership pids — and (ii) the + // BUILD-TIME full coverage fingerprint + pid-set equality gate + // (validate_frontend_derived, SERIALIZATIONS §4.12), the authoritative check. const cov = await db.query(` - WITH i AS (SELECT source, COUNT(*) c FROM read_parquet('${index_url}') GROUP BY source), - f AS (SELECT source, COUNT(*) c FROM read_parquet('${facets_url}') GROUP BY source) + WITH i AS (SELECT source AS v, COUNT(*) c FROM read_parquet('${index_url}') + WHERE NULLIF(TRIM(source), '') IS NOT NULL GROUP BY source), + f AS (SELECT facet_value AS v, count AS c + FROM read_parquet('${facet_summaries_url}') WHERE facet_type = 'source') SELECT (SELECT COUNT(*) FROM (SELECT * FROM i EXCEPT SELECT * FROM f)) + (SELECT COUNT(*) FROM (SELECT * FROM f EXCEPT SELECT * FROM i)) AS mismatch`); if (Number(cov[0].mismatch) !== 0) { From c2b3553124ff36ee0c4f29ae0f31d25bec5daa5a Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 06:19:35 -0700 Subject: [PATCH 13/14] #304/#305 Phase 3: extend mask-index counts to viewport + search MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2 routed global, no-search multi-filter counts through sample_facet_index. Phase 3 extends the SAME path to the two cases it gated OUT: - VIEWPORT: the index has no coordinates, so a non-global view JOINs samples_map_lite (l.latitude/l.longitude) on pid (one row/pid → no grain change) and appends the same VIEWPORT_PAD_FACTOR=0.3 padded bbox the table, point loader and heatmap use. - SEARCH: appends the search_pids semi-join (s.pid IN (SELECT pid FROM search_pids)). Both fold into the single UNION ALL statement via a bare-predicate WHERE builder. Index columns are now s.-qualified because lite ALSO has a `source` column. Gate generalized: the index path now runs whenever there's ANY constraint (cross-filter, viewport, or search) in full tree mode. Fallback differs by view (honesty rule): at GLOBAL+no-search an index miss shows the dash (legacy = baseline = #304); at VIEWPORT/SEARCH an index miss falls through to the legacy membership path, which is CORRECT there. Both paths coexist — the membership count path is RETAINED this phase; retirement is deferred to a browser-verified follow-up (plan: retire only after parity is confirmed in-app). Characterization (native DuckDB parity, real 202608 data): object_type counts for material=anthropogenic scoped to a Mediterranean bbox are IDENTICAL between the index path and the membership path (COUNT(*) over the mask join == COUNT(DISTINCT pid) over membership). The search semi-join is structurally identical to the membership path's, so search parity follows from the proven mask≡membership set identity. Refs #305 #304 --- explorer.qmd | 85 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 56 insertions(+), 29 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index 2ea754c..881cf9e 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -3616,12 +3616,15 @@ zoomWatcher = { mask |= (1n << BigInt(idx)); } if (mask === 0n) return null; - conds.push(`(${key}_mask & ${mask}) <> 0`); // BigInt → decimal BIGINT literal + // s.-qualified: Phase 3 JOINs samples_map_lite (which ALSO has a `source` + // column) for viewport scoping, so unqualified `source` would be ambiguous. + // Mask columns exist only on the index, but qualify everything for clarity. + conds.push(`(s.${key}_mask & ${mask}) <> 0`); // BigInt → decimal BIGINT literal } if (excludeKey !== 'source') { if (snap.sourceImpossible) return ['1=0']; // no source selectable → 0 for tree targets if (snap.source.length > 0) { - conds.push(`source IN (${snap.source.map(v => `'${escSql(v)}'`).join(',')})`); + conds.push(`s.source IN (${snap.source.map(v => `'${escSql(v)}'`).join(',')})`); } } return conds; @@ -3632,24 +3635,40 @@ zoomWatcher = { // Returns: 'ok' (applied), 'superseded' (a newer request won), 'unavailable' // (index/bitmap usable but a query failed → honesty: caller shows the dash), or // 'fallthrough' (index/bitmap not usable → caller uses legacy paths). - async function applyMaskIndexCounts(snap, dims, myReq) { + async function applyMaskIndexCounts(snap, dims, myReq, bboxSQL = null) { if (!window.__facetIndexReady || !window.__nodeBitsMap) return 'fallthrough'; // Build ONE query (UNION ALL of per-dim arms) rather than 4 concurrent // full-index scans — the documented DuckDB-WASM single-connection // starvation risk (Codex P2). One statement = one connection slot. Each arm // carries its OWN exclude-self cross-filter WHERE. Common output shape // (dim, value, count) so one pass groups them. + // + // #305 Phase 3 — viewport + search: the index carries no coordinates, so a + // viewport-scoped count JOINs samples_map_lite (l.latitude/l.longitude) on + // pid (one row/pid → no grain change) and appends the SAME padded bbox the + // table/heatmap use; an active search appends the search_pids semi-join. + // bboxSQL (when set) is already built against l.latitude/l.longitude by the + // caller. These are bare predicates (no leading AND) so the WHERE assembles + // cleanly even when there is no cross-filter constraint (viewport-only). + const liteJoin = bboxSQL ? `JOIN read_parquet('${lite_url}') l ON l.pid = s.pid` : ''; + const bboxBare = bboxSQL ? bboxSQL.replace(/^\s*AND\s+/i, '') : null; + const searchBare = searchIsActive() ? 's.pid IN (SELECT pid FROM search_pids)' : null; + const buildWhere = (conds, extra = []) => { + const preds = [...conds, ...extra]; + if (bboxBare) preds.push(bboxBare); + if (searchBare) preds.push(searchBare); + return preds.length ? `WHERE ${preds.join(' AND ')}` : ''; + }; const arms = []; for (const d of dims) { const conds = maskIndexWhere(snap, d.key); if (conds === null) return 'unavailable'; // selected node missing a bit → can't compute honestly - const whereSQL = conds.length ? `WHERE ${conds.join(' AND ')}` : ''; if (d.key === 'source') { // source target: COUNT(*) per source over the index (one row/pid), - // constrained by the tree dims only (exclude-self). - const w = conds.length ? `${whereSQL} AND source IS NOT NULL` : 'WHERE source IS NOT NULL'; - arms.push(`SELECT 'source' AS dim, source AS value, COUNT(*) AS count - FROM read_parquet('${index_url}') s ${w} GROUP BY source`); + // constrained by the tree dims only (exclude-self) + viewport/search. + arms.push(`SELECT 'source' AS dim, s.source AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s ${liteJoin} + ${buildWhere(conds, ['s.source IS NOT NULL'])} GROUP BY s.source`); } else { // tree target: JOIN node_bits; each (pid,node) the pid is a member of // yields one row, so COUNT(*) per concept_uri == distinct pids with @@ -3659,7 +3678,8 @@ zoomWatcher = { JOIN read_parquet('${node_bits_url}') nb ON nb.facet_type = '${d.key}' AND (s.${d.key}_mask & (1::BIGINT << nb.bit_index)) <> 0 - ${whereSQL} GROUP BY nb.concept_uri`); + ${liteJoin} + ${buildWhere(conds)} GROUP BY nb.concept_uri`); } } try { @@ -3745,29 +3765,36 @@ zoomWatcher = { if (myReq !== facetCountsReqId) return; } - // #304/#305: MULTI-filter (or multi-value single-dim, or source-impossible, - // or a single-filter cube MISS) at global view + full tree mode + no search. - // The single-filter cube above can't serve these, and the legacy slow path - // below reverts every tree dim to the unfiltered BASELINE at global view - // (the #304 bug). Route through the complete per-pid index instead — correct - // multi-filter counts in one columnar bitmask scan, applied atomically. - // HONESTY RULE: on any failure show "unavailable", NEVER the baseline. A - // 'fallthrough' (index/bitmap not usable, e.g. not yet published) defers to - // the legacy paths unchanged so this is safe to ship before the index lands. - if (bboxSQL === null && !searchIsActive() && TREE_DIM_KEYS.every(treeActive)) { + // #304/#305: route facet counts through the complete per-pid index whenever + // there is ANY constraint (cross-filter, viewport, or search) in full tree + // mode. Phase 2 covered global+no-search multi-filter; Phase 3 extends the + // SAME index path to VIEWPORT (JOIN samples_map_lite + padded bbox) and + // SEARCH (search_pids semi-join). The single-filter cube above stays a pure + // global fast-path optimization. + // + // FALLBACK SEMANTICS differ by view (the honesty rule): + // - GLOBAL + no search: the legacy slow path reverts tree dims to the + // unfiltered BASELINE — that IS the #304 bug. So on index miss/error here + // we show the "unavailable" dash, NEVER baseline. (__onFacetIndexReady + // re-runs this when readiness flips, turning the dash into real counts.) + // - VIEWPORT or SEARCH: the legacy membership path below computes the + // CORRECT viewport/search counts, so an index miss/error falls through to + // it (both paths coexist until the membership path is retired — Phase 3 + // end, after characterization confirms parity). + if (TREE_DIM_KEYS.every(treeActive)) { const snap = directFilterSnapshot(); - if (snap.activeCount >= 1) { - const res = await applyMaskIndexCounts(snap, dims, myReq); + const atGlobalNoSearch = (bboxSQL === null && !searchIsActive()); + const hasConstraint = snap.activeCount >= 1 || bboxSQL !== null || searchIsActive(); + if (hasConstraint) { + const res = await applyMaskIndexCounts(snap, dims, myReq, bboxSQL); if (res === 'superseded' || res === 'ok') return; - // res is 'unavailable' OR 'fallthrough' (index not ready / a node has - // no bit). At global + full tree + active filters there is NO correct - // legacy path — the legacy slow path reverts tree dims to the - // unfiltered BASELINE, which IS the #304 bug. Honesty rule: show the - // dash, never baseline. (When readiness flips true later, - // __onFacetIndexReady re-runs this and the dash becomes real counts.) if (myReq !== facetCountsReqId) return; // defensive stale check (Codex) - markFacetCountsUnavailable(); - return; + if (atGlobalNoSearch) { + // no correct legacy path here → honest dash, never baseline (#304) + markFacetCountsUnavailable(); + return; + } + // viewport/search: fall through to the correct legacy membership path } } From 4aa989bef0295f2deefe4d353110608a0518fec1 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 06:23:53 -0700 Subject: [PATCH 14/14] #304/#305 Phase 3: fix global-search honesty fallback (Codex P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit At global view WITH active search, an index miss/error fell through (atGlobalNoSearch was false because search was active) to the legacy tree path, which hits `if (treeActive && !bboxSQL) applyFacetCounts(d.key, null)` — the unfiltered BASELINE, ignoring search. Honesty violation. The correct pivot is whether a VIEWPORT BBOX exists, not whether search is active: at global (bboxSQL === null) there is no correct legacy tree path regardless of search, so an index miss now shows the dash. Fall through to the legacy membership path ONLY when a viewport bbox exists (where it is correct). Drops atGlobalNoSearch in favor of `bboxSQL === null`. Codex confirmed everything else sound: grain (lite join one-row-per-pid under the validated PID-set-equality contract), column binding, bbox regex (strips only the first leading AND, anti-meridian inner ANDs intact), escaping, viewport-only histograms, concurrency/atomicity. The requested regression test (global+search + forced index-unavailable → tree counts "(—)" not baseline) is recorded as a Phase 4 Playwright case (#43) — the count engine runs in-browser, so it needs the Playwright harness, not a headless unit test. Refs #305 #304 --- explorer.qmd | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/explorer.qmd b/explorer.qmd index 881cf9e..412d372 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -3772,29 +3772,32 @@ zoomWatcher = { // SEARCH (search_pids semi-join). The single-filter cube above stays a pure // global fast-path optimization. // - // FALLBACK SEMANTICS differ by view (the honesty rule): - // - GLOBAL + no search: the legacy slow path reverts tree dims to the - // unfiltered BASELINE — that IS the #304 bug. So on index miss/error here - // we show the "unavailable" dash, NEVER baseline. (__onFacetIndexReady - // re-runs this when readiness flips, turning the dash into real counts.) - // - VIEWPORT or SEARCH: the legacy membership path below computes the - // CORRECT viewport/search counts, so an index miss/error falls through to - // it (both paths coexist until the membership path is retired — Phase 3 - // end, after characterization confirms parity). + // FALLBACK SEMANTICS differ by view (the honesty rule). The pivot is whether + // a viewport bbox exists, NOT whether search is active (Codex P3): + // - GLOBAL (bboxSQL === null), with OR without search: the legacy tree path + // below hits `if (treeActive && !bboxSQL) applyFacetCounts(d.key, null)` — + // the unfiltered BASELINE — which IGNORES search and the cross-filters + // (#304). There is NO correct legacy global tree path. So on index + // miss/error at global we show the "unavailable" dash, NEVER baseline. + // (__onFacetIndexReady re-runs this when readiness flips → real counts.) + // - VIEWPORT (bboxSQL !== null), ±search: the legacy membership path JOINs + // samples_map_lite + bbox + the search semi-join and is CORRECT, so an + // index miss/error falls through to it (both paths coexist until the + // membership path is retired — after in-app parity is confirmed). if (TREE_DIM_KEYS.every(treeActive)) { const snap = directFilterSnapshot(); - const atGlobalNoSearch = (bboxSQL === null && !searchIsActive()); const hasConstraint = snap.activeCount >= 1 || bboxSQL !== null || searchIsActive(); if (hasConstraint) { const res = await applyMaskIndexCounts(snap, dims, myReq, bboxSQL); if (res === 'superseded' || res === 'ok') return; if (myReq !== facetCountsReqId) return; // defensive stale check (Codex) - if (atGlobalNoSearch) { - // no correct legacy path here → honest dash, never baseline (#304) + if (bboxSQL === null) { + // global (incl. global+search): no correct legacy tree path → + // honest dash, never baseline (#304). markFacetCountsUnavailable(); return; } - // viewport/search: fall through to the correct legacy membership path + // viewport (±search): fall through to the correct legacy membership path } }