Runaway whitespace fix by will-exaforce · Pull Request #2 · ExaForce/SkillSpector

will-exaforce · 2026-06-18T18:04:32Z

This PR improves the output generated by the model and the model output normalization when llm is enabled.

This was part of an effort to address issues I observed while validating SkillSpector LLM support locally.

All tests pass locally. I'll get another PR up later to add the proper workflow. In the meantime:

627 passed, 12 skipped, 26 deselected, 1 warning in 1.86s

The benchmark tool was used to compare accuracy and performance against main

Finding summary

Comparison performed across 50 (or all if < 50) samples of each type of data set in https://github.com/lxyeternal/MalSkillBench/tree/main/Dataset.

Performance

runaway-fix completed ~30% faster
runaway-fix - 35:07
main - 46:54

Errors

runaway-fix:

None

main:

15 (All timeout)
Impacted files
- 7 of the 50 sampled malicious skills
- 4 of the 50 sampled malicious python json files

False Negatives

Unchanged across PRs

False Positives

runaway-fix (5):

Skills/benign/4chan-reader
Skills/benign/abn-skill
Skills/benign/aaveclaw
Skills/benign/aap-passport
Skills/benign/2captcha

main (1):

Skills/benign/abn-skill

Harden the LLM meta-analyzer against unreliable served structured output and tighten the model-output normalization contract. Reliability / correctness: - run_batches / arun_batches skip a batch whose structured call fails after retries (StructuredOutputError) instead of aborting the whole pass; other exceptions (e.g. credential ValueError) still propagate. arun_batches uses gather(return_exceptions=True) so one failure no longer discards sibling results. - apply_filter distinguishes "evaluated but not confirmed" (drop) from "never evaluated" because its batch failed (preserve un-enriched), at batch granularity. A security tool must not silently drop an unreviewed finding. - Drop the lenient truncation-salvage parser: a truncated stringified findings array now fails cleanly -> retried -> batch skipped -> findings preserved, which is safer than salvaging a partial array and dropping the rest. Normalization: - Replace fragile substring impact/intent coercion with a shared token-scan + negation guard (_coerce_to_enum). Fixes 'unsafe'->benign, 'non-critical'-> critical, 'medium-low'->low, and 'follow-up'->low, while still covering phrases like 'high impact'. Ambiguous compounds bias to the higher level. Cleanup: - Remove dead OverallAssessment / overall_assessment (never consumed); shrinks the model's required output, reducing truncation surface. - Dedup the sync/async retry methods via shared helpers and add an explicit terminal raise so they can never silently return None. - Use dataclasses.replace when splitting batches; correct the StructuredOutputError docstring (LengthFinishReasonError is not a ValueError). Tests cover failed-batch preservation, run-loop resilience, non-structured error propagation, and the coercion regressions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

will-exaforce requested review from pupapaik and smoy June 18, 2026 18:04

will-exaforce changed the base branch from main to sync-upstream June 18, 2026 19:43

will-exaforce changed the base branch from sync-upstream to main June 18, 2026 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runaway whitespace fix#2

Runaway whitespace fix#2
will-exaforce wants to merge 1 commit into
mainfrom
runaway-fix

will-exaforce commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

will-exaforce commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Finding summary

Performance

Errors

runaway-fix:

main:

False Negatives

False Positives

runaway-fix (5):

main (1):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

will-exaforce commented Jun 18, 2026 •

edited

Loading