Skip to content

Data: sample_facet_masks omits ~29,917 located samples with no tree membership (built from membership, not samp_geo) #306

@rdhyee

Description

@rdhyee

Summary

sample_facet_masks (and the sample_facet_membership it derives from) omit ~29,917 located samples that have coordinates but no hierarchical facet membership. Any count or filter computed from the masks therefore silently undercounts the true located universe.

Evidence

Measured on the deployed 202608 data:

  • facets_v3 (located pid universe): 6,026,242 pids
  • sample_facet_masks / membership pid universe: 5,996,325 pids
  • Missing from masks: 29,917 located pids

Root cause

build_sample_facet_masks() (scripts/build_frontend_derived.py:537) starts from membership:

FROM membership m JOIN nb ON ...
GROUP BY m.pid

A located sample that asserts no concept in any of the three SKOS trees (material / context / object_type) never appears in membership, so it gets no mask row at all — rather than a row with all-zero masks. It's invisible to the bitmask path.

Why it matters now

It's latent today (the masks are only used for filtering, where a no-membership sample legitimately matches no tree node). But the #305 plan computes facet counts and the located-sample baseline from a masks-based index. Those counts must reflect the full located universe (incl. zero-membership samples, e.g. the "source" facet count and any "samples in view" total), so this gap would produce convincing-but-wrong numbers.

Fix

Build the per-pid index starting from samp_geo (the located universe), LEFT JOIN the aggregated masks, and emit zero masks for samples with no hierarchical membership:

samp_geo  LEFT JOIN (aggregated masks)  ->  COALESCE(mask, 0)

Carry this into the new complete index described in #305 Phase 1 (pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version), and add a validator gate asserting index.pid_count == facets_v3.pid_count.

Verification

  • Validator: pid-set equality between the new index and facets_v3; source equality; mask≡membership for samples that DO have membership.
  • Confirm the 29,917 previously-missing pids appear with all-zero masks.

Found during the #305 plan review (Codex). Related: #305, #299 (introduced the masks), #304.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingexplorerInteractive Explorer featuresinfrastructureHosting, CI/CD, domain, Cloudflare

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions