Skip to content

Bound defined-term extraction and gate by reference type to prevent unbounded work#2086

Merged
JSv4 merged 5 commits into
mainfrom
codex/fix-resource-exhaustion-in-defined-term-extraction
Jul 1, 2026
Merged

Bound defined-term extraction and gate by reference type to prevent unbounded work#2086
JSv4 merged 5 commits into
mainfrom
codex/fix-resource-exhaustion-in-defined-term-extraction

Conversation

@JSv4

@JSv4 JSv4 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Motivation

  • Fix an availability/DoS risk where defined-term extraction materialized and sorted every regex match in a document before applying MAX_DEFINED_TERMS, allowing an attacker to force excessive CPU/memory use.
  • Ensure default scans do not pay the cost of extracting opt-in DEFINED_TERM candidates when DEFAULT_REFERENCE_TYPES excludes them.

Description

  • Added an optional reference_types parameter to ReferenceExtractor.extract(...) and short-circuited individual grammars so callers can skip unwanted extraction work (e.g. skip _terms() when DEFINED_TERM is not requested).
  • Rewrote the defined-term logic to a single streaming regex (_TERM_RE) that iterates in document order and stops after examining/ emitting up to MAX_DEFINED_TERMS without materializing or sorting all matches.
  • Updated the enrichment orchestration to pass the wanted reference types into the registry extractor so enrichment/scan paths avoid running _terms() unless requested.
  • Added a unit test test_reference_type_filter_skips_unwanted_defined_terms to assert that _terms() is not invoked when the default reference set is used and document references still get extracted.

Testing

  • Successfully type-checked/compiled the modified modules with python3 -m py_compile opencontractserver/enrichment/extractor.py opencontractserver/enrichment/services/enrichment_service.py opencontractserver/tests/test_enrichment_extractor.py.
  • Added a unit test for the reference-type gating in opencontractserver/tests/test_enrichment_extractor.py.
  • Attempted to run pytest opencontractserver/tests/test_enrichment_extractor.py -q but the test run was blocked by the environment missing Django (ModuleNotFoundError: No module named 'django').

Codex Task

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/enrichment/extractor.py 95.45% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Code Review

This PR fixes a real availability risk in defined-term extraction — the old code materialized and sorted all regex matches before applying MAX_DEFINED_TERMS, making the cost proportional to document length. The reference_types gate is a clean addition that avoids running sub-extractors entirely when their output type is not requested. Three correctness/quality issues survived verification and are worth fixing before merge.


Finding 1 — examined counter exhausts the cap on duplicate matches (HIGH)

File: opencontractserver/enrichment/extractor.py line 241

The examined counter increments on every regex hit, including duplicates immediately skipped via slug in seen. A document with 50 occurrences of (the "Company") followed by (the "Borrower") returns only one unique term: match 51 triggers examined > 50 and returns before "Borrower" is ever examined. The old code capped only on emitted, so duplicates consumed no budget and both terms would have been found.

The docstring still says "the first N definition sites win", but the actual semantics are now "the first N raw regex hits are scanned" — silent under-extraction when early hits are repetitive.

Fix: Remove the examined > C.MAX_DEFINED_TERMS guard and keep only emitted >= C.MAX_DEFINED_TERMS. If a raw-scan hard ceiling for DoS prevention is still desired, use a separate larger budget (e.g. 10 * MAX_DEFINED_TERMS) so duplicates do not consume the unique-term quota.


Finding 2 — emitted >= C.MAX_DEFINED_TERMS is a dead condition (MEDIUM)

File: opencontractserver/enrichment/extractor.py line 241

Because emitted increments at most once per iteration, emitted <= examined - 1 is an invariant at the guard. emitted >= 50 therefore requires examined >= 51, but examined > 50 short-circuits the or first. The right-hand side can never fire independently — it is unreachable dead code that misleads readers.

Fix: Either drop or emitted >= C.MAX_DEFINED_TERMS entirely, or fix Finding 1 by removing the examined guard and keeping only the emitted check.


Finding 3 — _TERM_PAREN_RE and _TERM_MEANS_RE are dead code (MEDIUM)

File: opencontractserver/enrichment/extractor.py lines 84-91

Both constants are still compiled at module load but _terms() now uses _TERM_RE exclusively. A codebase-wide grep confirms no other call sites exist. A developer patching _TERM_PAREN_RE expecting to affect extraction will be silently ignored.

CLAUDE.md: "No dead code — once it's no longer in use, delete it."

Fix: Delete lines 84-91.


Finding 4 — GenericCitationExtractor.extract() is not gated by reference_types (MEDIUM)

File: opencontractserver/enrichment/services/enrichment_service.py line 160

GenericCitationExtractor.extract(text: str) has no reference_types parameter (grammars.py:351). All nine grammar passes (_usc, _cfr, _fedreg, etc.) always fire regardless of wanted. When wanted contains only REF_DEFINED_TERM, all grammar work is wasted — the REF_LAW candidates produced do not survive the filter at line 166. The optimization introduced for ReferenceExtractor is only half-applied.

Fix: Add reference_types: set[str] | None = None to GenericCitationExtractor.extract(), gate sub-extractors with if wanted is None or C.REF_LAW in wanted, and update the call site to generic.extract(doc_text, reference_types=wanted).


Finding 5 — Missing changelog fragment (LOW)

File: changelog.d/ (absent from the diff)

This PR fixes a production availability risk and adds a new public parameter to ReferenceExtractor.extract(). CLAUDE.md: "Always record significant changes — add a changelog fragment under changelog.d/."

Fix: Add e.g. changelog.d/2086-extractor-gating.fixed.md (DoS fix) and changelog.d/2086-extractor-gating.changed.md (new reference_types parameter).


# Severity Location Issue
1 HIGH extractor.py:241 examined cap fires on duplicates, silently under-extracts unique terms
2 MEDIUM extractor.py:241 emitted >= MAX is unreachable dead code in the or guard
3 MEDIUM extractor.py:84-91 _TERM_PAREN_RE and _TERM_MEANS_RE defined but never used
4 MEDIUM enrichment_service.py:160 generic.extract() has no reference_types gate — runs all grammar passes unconditionally
5 LOW changelog.d/ No changelog fragment for a security-relevant behavioral change

JSv4 added 2 commits June 30, 2026 06:57
… reference type

- Cap _terms on emitted (unique) terms, not raw regex hits, so a duplicate-heavy
  document no longer exhausts MAX_DEFINED_TERMS before later distinct terms are
  reached (silent under-extraction). Add separate MAX_DEFINED_TERM_SCAN raw-scan
  ceiling and remove the unreachable emitted>=MAX branch from the old or-guard.
- Delete dead _TERM_PAREN_RE / _TERM_MEANS_RE (superseded by combined _TERM_RE).
- Add reference_types gate to GenericCitationExtractor.extract(); skip all grammar
  passes when REF_LAW is not wanted and pass wanted through from enrichment_service.
- Changelog fragments for the fix and the new reference_types parameter.
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Code Review

This PR fixes two related problems in defined-term extraction: a silent under-extraction bug where the cap was applied to raw regex hits (including duplicates) rather than unique emitted terms, and a DoS risk where a request for only DEFAULT_REFERENCE_TYPES would still run the expensive _terms() pass. The combined-regex rewrite to _TERM_RE is clean and the new reference_types gating is correct across both ReferenceExtractor and GenericCitationExtractor. A few gaps below.


Confirmed — test gap for the core bug fix

opencontractserver/tests/test_enrichment_extractor.py:148 and 163

Both cap tests (test_defined_term_cap_is_total_in_document_order, test_defined_term_cap_drops_matches_beyond_position_cap) use all-distinct terms, so neither can catch a regression that restores the old "cap on raw hits" behaviour. The specific scenario that was broken — many duplicate copies of an early term followed by a later distinct term — is exactly what needs a dedicated test. Something like:

def test_defined_term_cap_on_unique_not_raw_hits(self):
    # 60 copies of "Company" (would exhaust the old raw-hit cap at 50)
    # followed by a distinct "Change of Control" definition.
    text = '(the "Company") ' * 60 + '"Change of Control" means a transfer of control.'
    cands = self.extractor.extract(text)
    terms = [c for c in cands if c.reference_type == C.REF_DEFINED_TERM]
    # Both should appear: duplicates must not consume the unique-term quota.
    assert any(c.canonical_key == "term:company" for c in terms)
    assert any(c.canonical_key == "term:change-of-control" for c in terms)

Without this, a future refactor that reintroduces the raw-hit counter has no safety net.


Confirmed — no test for GenericCitationExtractor.extract(reference_types=…)

opencontractserver/enrichment/grammars.py (new param)

test_generic_grammars.py exercises GenericCitationExtractor extensively but has zero coverage of the new reference_types parameter. The analogous ReferenceExtractor test exists (test_reference_type_filter_skips_unwanted_defined_terms), but there is no mirror test verifying that generic.extract(text, reference_types={C.REF_SECTION}) short-circuits all 9 grammar passes (which exclusively emit REF_LAW). A patch.object guard like the one already written for _terms would work here.


Confirmed (minor) — wanted normalisation duplicated

opencontractserver/enrichment/extractor.py:124 and opencontractserver/enrichment/grammars.py:359

# identical in both files
wanted = set(reference_types) if reference_types is not None else None

If the sentinel or type contract changes (e.g. to accept frozenset, or to treat an empty collection differently), both sites must be updated in sync. A small helper _normalise_wanted in a shared location, or at minimum a type alias, would make the contract a single-edit-point. Minor, but flagged because CLAUDE.md calls for DRY.


Confirmed (minor) — bare multiplier in MAX_DEFINED_TERM_SCAN

opencontractserver/enrichment/constants.py:88

MAX_DEFINED_TERM_SCAN = 10 * MAX_DEFINED_TERMS

The 10 is well-explained in the adjacent comment, but per CLAUDE.md ("no magic numbers") it should be a named constant, e.g. DEFINED_TERM_SCAN_MULTIPLIER = 10, so a future tuner knows where to look.


Everything else is solid: the combined _TERM_RE alternation correctly handles the two definition forms in document order; the start/end offset arithmetic (m.start(term_group) - 1, m.end(term_group) + 1) is consistent with the old two-regex approach; all callers in enrichment_service.py pass reference_types by keyword so the new positional parameter causes no shadowing; and the changelog fragments follow the required format.

…ce-type gate

Adds test_defined_term_cap_counts_unique_terms_not_raw_hits, reproducing the
exact duplicate-then-distinct scenario that the raw-hit cap bug previously
broke (verified against the old "cap on examined hits" guard: it fails
without the fix). Adds ReferenceTypeFilterTests for
GenericCitationExtractor.extract(reference_types=...), which had no coverage.

Also extracts the wanted-reference-types normalization duplicated between
ReferenceExtractor.extract and GenericCitationExtractor.extract into a single
normalize_reference_types helper, and names the raw-scan multiplier in
MAX_DEFINED_TERM_SCAN as DEFINED_TERM_SCAN_MULTIPLIER instead of a bare 10.
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review

Solid, well-scoped fix. The core change — replacing "materialize both regex's matches into a list, sort, then cap" with a single combined-alternation regex streamed via finditer and capped on unique emitted terms (with a separate MAX_DEFINED_TERM_SCAN raw-hit ceiling) — correctly addresses both issues called out in the PR description:

  1. Unbounded materialize+sort removed. _terms() (opencontractserver/enrichment/extractor.py:232-277) now early-returns once emitted >= MAX_DEFINED_TERMS or examined > MAX_DEFINED_TERM_SCAN, so finditer's generator is abandoned rather than fully drained/sorted. Good.
  2. Cap-on-unique-terms fix is real and well-tested. test_defined_term_cap_counts_unique_terms_not_raw_hits reproduces the exact "N duplicates starve a later distinct term" scenario and would fail under the old "cap on examined raw hits" logic — good regression coverage.
  3. reference_types gating is correctly wired end-to-end. Verified the only two production call sites (opencontractserver/enrichment/services/enrichment_service.py:153-164, both in _resolve_doc, used by both the sync per-doc path and the concurrent LLM-apply orchestrator) pass reference_types=wanted through to both ReferenceExtractor.extract and GenericCitationExtractor.extract. Since every GenericCitationExtractor grammar pass emits only REF_LAW (confirmed by reading all nine pass functions in grammars.py), short-circuiting the whole family when REF_LAW isn't wanted is correct, not just an approximation.
  4. Regex correctness of the merged _TERM_RE: the parenthetical branch requires a leading literal ( and the "means" branch starts directly on the quote character, so the two alternatives can never both match at the same start offset — merging them into one alternation preserves the original two-regex semantics for the common case. m.start(term_group)/m.end(term_group) are only called on the group that actually participated (guarded via m.group("paren_term") is not None), avoiding the -1 footgun for non-participating named groups in an alternation.
  5. Dead code (_TERM_PAREN_RE, _TERM_MEANS_RE) fully removed with no leftover references. normalize_reference_types cleanly dedupes the sentinel-handling logic that was duplicated between the two extract() methods.

Minor / non-blocking notes

  • Silent under-extraction can still happen in the duplicate-heavy case, now via a different mechanism. If a document has >500 raw defined-term-shaped hits before accumulating fewer than MAX_DEFINED_TERMS unique ones (e.g., pathological repetition of many different near-duplicate phrasings, not just one repeated term), MAX_DEFINED_TERM_SCAN will cut the scan short and later distinct terms will be silently dropped — the same class of bug this PR fixes, just pushed to a higher threshold and now a deliberate tradeoff rather than a bug. Given _doc_text() elsewhere in this module logs a warning on skip (enrichment_service.py:107-117), consider a logger.debug/info when the scan cap (as opposed to the unique-term cap) trips, so this bound is observable in production rather than a fully silent truncation.
  • Redundant final filter. enrichment_service.py:167-171 still does if cand.reference_type in wanted after both extractors already gate on wanted internally. Harmless (and still correct, since LLM candidates merged in afterward aren't pre-filtered), but worth a comment noting it's now belt-and-suspenders for the LLM-candidate path specifically, since a reader diffing this PR might wonder why the filter wasn't removed.
  • Type hint duplication. set[str] | tuple[str, ...] | list[str] | None is repeated verbatim across normalize_reference_types, ReferenceExtractor.extract, and GenericCitationExtractor.extract. A ReferenceTypes = set[str] | tuple[str, ...] | list[str] | None alias in extractor.py would DRY this up per CLAUDE.md's DRY guidance — very minor.
  • Changelog fragments are thorough and well-written per changelog.d/README.md conventions; they cite files but not line numbers (CLAUDE.md's changelog guidance suggests including line numbers) — not a blocker.

Test coverage

Good — test_defined_term_cap_counts_unique_terms_not_raw_hits (regression for the actual bug), test_reference_type_filter_skips_unwanted_defined_terms, and the new ReferenceTypeFilterTests class covering all nine grammar passes being skipped/run correctly. The PR notes the sandbox couldn't run pytest directly (missing Django); worth confirming the full suite (especially test_enrichment_extractor.py and test_generic_grammars.py) is green in CI before merge.

Security

This is itself a security/availability fix (unbounded CPU/memory from unfiltered regex-match materialization+sort), and it does not introduce new injection/XSS/permission surface — no DB/GraphQL/permission code is touched.

No blocking issues found.

@JSv4 JSv4 merged commit 5082a07 into main Jul 1, 2026
13 checks passed
@JSv4 JSv4 deleted the codex/fix-resource-exhaustion-in-defined-term-extraction branch July 1, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant