Runaway whitespace fix#2
Open
will-exaforce wants to merge 1 commit into
Open
Conversation
Harden the LLM meta-analyzer against unreliable served structured output and tighten the model-output normalization contract. Reliability / correctness: - run_batches / arun_batches skip a batch whose structured call fails after retries (StructuredOutputError) instead of aborting the whole pass; other exceptions (e.g. credential ValueError) still propagate. arun_batches uses gather(return_exceptions=True) so one failure no longer discards sibling results. - apply_filter distinguishes "evaluated but not confirmed" (drop) from "never evaluated" because its batch failed (preserve un-enriched), at batch granularity. A security tool must not silently drop an unreviewed finding. - Drop the lenient truncation-salvage parser: a truncated stringified findings array now fails cleanly -> retried -> batch skipped -> findings preserved, which is safer than salvaging a partial array and dropping the rest. Normalization: - Replace fragile substring impact/intent coercion with a shared token-scan + negation guard (_coerce_to_enum). Fixes 'unsafe'->benign, 'non-critical'-> critical, 'medium-low'->low, and 'follow-up'->low, while still covering phrases like 'high impact'. Ambiguous compounds bias to the higher level. Cleanup: - Remove dead OverallAssessment / overall_assessment (never consumed); shrinks the model's required output, reducing truncation surface. - Dedup the sync/async retry methods via shared helpers and add an explicit terminal raise so they can never silently return None. - Use dataclasses.replace when splitting batches; correct the StructuredOutputError docstring (LengthFinishReasonError is not a ValueError). Tests cover failed-batch preservation, run-loop resilience, non-structured error propagation, and the coercion regressions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves the output generated by the model and the model output normalization when llm is enabled.
This was part of an effort to address issues I observed while validating SkillSpector LLM support locally.
All tests pass locally. I'll get another PR up later to add the proper workflow. In the meantime:
627 passed, 12 skipped, 26 deselected, 1 warning in 1.86sThe benchmark tool was used to compare accuracy and performance against main
Finding summary
Comparison performed across 50 (or all if < 50) samples of each type of data set in https://github.com/lxyeternal/MalSkillBench/tree/main/Dataset.
Performance
runaway-fix completed ~30% faster
runaway-fix - 35:07
main - 46:54
Errors
runaway-fix:
main:
False Negatives
Unchanged across PRs
False Positives
runaway-fix (5):
main (1):