Summary
sample_facet_masks (and the sample_facet_membership it derives from) omit ~29,917 located samples that have coordinates but no hierarchical facet membership. Any count or filter computed from the masks therefore silently undercounts the true located universe.
Evidence
Measured on the deployed 202608 data:
facets_v3 (located pid universe): 6,026,242 pids
sample_facet_masks / membership pid universe: 5,996,325 pids
- Missing from masks: 29,917 located pids
Root cause
build_sample_facet_masks() (scripts/build_frontend_derived.py:537) starts from membership:
FROM membership m JOIN nb ON ...
GROUP BY m.pid
A located sample that asserts no concept in any of the three SKOS trees (material / context / object_type) never appears in membership, so it gets no mask row at all — rather than a row with all-zero masks. It's invisible to the bitmask path.
Why it matters now
It's latent today (the masks are only used for filtering, where a no-membership sample legitimately matches no tree node). But the #305 plan computes facet counts and the located-sample baseline from a masks-based index. Those counts must reflect the full located universe (incl. zero-membership samples, e.g. the "source" facet count and any "samples in view" total), so this gap would produce convincing-but-wrong numbers.
Fix
Build the per-pid index starting from samp_geo (the located universe), LEFT JOIN the aggregated masks, and emit zero masks for samples with no hierarchical membership:
samp_geo LEFT JOIN (aggregated masks) -> COALESCE(mask, 0)
Carry this into the new complete index described in #305 Phase 1 (pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version), and add a validator gate asserting index.pid_count == facets_v3.pid_count.
Verification
- Validator: pid-set equality between the new index and
facets_v3; source equality; mask≡membership for samples that DO have membership.
- Confirm the 29,917 previously-missing pids appear with all-zero masks.
Found during the #305 plan review (Codex). Related: #305, #299 (introduced the masks), #304.
Summary
sample_facet_masks(and thesample_facet_membershipit derives from) omit ~29,917 located samples that have coordinates but no hierarchical facet membership. Any count or filter computed from the masks therefore silently undercounts the true located universe.Evidence
Measured on the deployed 202608 data:
facets_v3(located pid universe): 6,026,242 pidssample_facet_masks/ membership pid universe: 5,996,325 pidsRoot cause
build_sample_facet_masks()(scripts/build_frontend_derived.py:537) starts frommembership:A located sample that asserts no concept in any of the three SKOS trees (material / context / object_type) never appears in
membership, so it gets no mask row at all — rather than a row with all-zero masks. It's invisible to the bitmask path.Why it matters now
It's latent today (the masks are only used for filtering, where a no-membership sample legitimately matches no tree node). But the #305 plan computes facet counts and the located-sample baseline from a masks-based index. Those counts must reflect the full located universe (incl. zero-membership samples, e.g. the "source" facet count and any "samples in view" total), so this gap would produce convincing-but-wrong numbers.
Fix
Build the per-pid index starting from
samp_geo(the located universe), LEFT JOIN the aggregated masks, and emit zero masks for samples with no hierarchical membership:Carry this into the new complete index described in #305 Phase 1 (
pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version), and add a validator gate assertingindex.pid_count == facets_v3.pid_count.Verification
facets_v3; source equality; mask≡membership for samples that DO have membership.Found during the #305 plan review (Codex). Related: #305, #299 (introduced the masks), #304.