Bound defined-term extraction and gate by reference type to prevent unbounded work#2086
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…haustion-in-defined-term-extraction
Code ReviewThis PR fixes a real availability risk in defined-term extraction — the old code materialized and sorted all regex matches before applying Finding 1 —
|
| # | Severity | Location | Issue |
|---|---|---|---|
| 1 | HIGH | extractor.py:241 |
examined cap fires on duplicates, silently under-extracts unique terms |
| 2 | MEDIUM | extractor.py:241 |
emitted >= MAX is unreachable dead code in the or guard |
| 3 | MEDIUM | extractor.py:84-91 |
_TERM_PAREN_RE and _TERM_MEANS_RE defined but never used |
| 4 | MEDIUM | enrichment_service.py:160 |
generic.extract() has no reference_types gate — runs all grammar passes unconditionally |
| 5 | LOW | changelog.d/ |
No changelog fragment for a security-relevant behavioral change |
…haustion-in-defined-term-extraction
… reference type - Cap _terms on emitted (unique) terms, not raw regex hits, so a duplicate-heavy document no longer exhausts MAX_DEFINED_TERMS before later distinct terms are reached (silent under-extraction). Add separate MAX_DEFINED_TERM_SCAN raw-scan ceiling and remove the unreachable emitted>=MAX branch from the old or-guard. - Delete dead _TERM_PAREN_RE / _TERM_MEANS_RE (superseded by combined _TERM_RE). - Add reference_types gate to GenericCitationExtractor.extract(); skip all grammar passes when REF_LAW is not wanted and pass wanted through from enrichment_service. - Changelog fragments for the fix and the new reference_types parameter.
Code ReviewThis PR fixes two related problems in defined-term extraction: a silent under-extraction bug where the cap was applied to raw regex hits (including duplicates) rather than unique emitted terms, and a DoS risk where a request for only Confirmed — test gap for the core bug fix
Both cap tests ( def test_defined_term_cap_on_unique_not_raw_hits(self):
# 60 copies of "Company" (would exhaust the old raw-hit cap at 50)
# followed by a distinct "Change of Control" definition.
text = '(the "Company") ' * 60 + '"Change of Control" means a transfer of control.'
cands = self.extractor.extract(text)
terms = [c for c in cands if c.reference_type == C.REF_DEFINED_TERM]
# Both should appear: duplicates must not consume the unique-term quota.
assert any(c.canonical_key == "term:company" for c in terms)
assert any(c.canonical_key == "term:change-of-control" for c in terms)Without this, a future refactor that reintroduces the raw-hit counter has no safety net. Confirmed — no test for
|
…ce-type gate Adds test_defined_term_cap_counts_unique_terms_not_raw_hits, reproducing the exact duplicate-then-distinct scenario that the raw-hit cap bug previously broke (verified against the old "cap on examined hits" guard: it fails without the fix). Adds ReferenceTypeFilterTests for GenericCitationExtractor.extract(reference_types=...), which had no coverage. Also extracts the wanted-reference-types normalization duplicated between ReferenceExtractor.extract and GenericCitationExtractor.extract into a single normalize_reference_types helper, and names the raw-scan multiplier in MAX_DEFINED_TERM_SCAN as DEFINED_TERM_SCAN_MULTIPLIER instead of a bare 10.
ReviewSolid, well-scoped fix. The core change — replacing "materialize both regex's matches into a list, sort, then cap" with a single combined-alternation regex streamed via
Minor / non-blocking notes
Test coverageGood — SecurityThis is itself a security/availability fix (unbounded CPU/memory from unfiltered regex-match materialization+sort), and it does not introduce new injection/XSS/permission surface — no DB/GraphQL/permission code is touched. No blocking issues found. |
Motivation
MAX_DEFINED_TERMS, allowing an attacker to force excessive CPU/memory use.DEFINED_TERMcandidates whenDEFAULT_REFERENCE_TYPESexcludes them.Description
reference_typesparameter toReferenceExtractor.extract(...)and short-circuited individual grammars so callers can skip unwanted extraction work (e.g. skip_terms()whenDEFINED_TERMis not requested)._TERM_RE) that iterates in document order and stops after examining/ emitting up toMAX_DEFINED_TERMSwithout materializing or sorting all matches.wantedreference types into the registry extractor so enrichment/scan paths avoid running_terms()unless requested.test_reference_type_filter_skips_unwanted_defined_termsto assert that_terms()is not invoked when the default reference set is used and document references still get extracted.Testing
python3 -m py_compile opencontractserver/enrichment/extractor.py opencontractserver/enrichment/services/enrichment_service.py opencontractserver/tests/test_enrichment_extractor.py.opencontractserver/tests/test_enrichment_extractor.py.pytest opencontractserver/tests/test_enrichment_extractor.py -qbut the test run was blocked by the environment missing Django (ModuleNotFoundError: No module named 'django').Codex Task