Skip to content

fix(intrinsics): restore aLoRA offsets and fix groundedness parser#1292

Merged
ajbozarth merged 8 commits into
mainfrom
issue-1291
Jun 23, 2026
Merged

fix(intrinsics): restore aLoRA offsets and fix groundedness parser#1292
ajbozarth merged 8 commits into
mainfrom
issue-1291

Conversation

@planetf1

@planetf1 planetf1 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Pull Request

Issue

Fixes #1291
Fixes #1286

Description

Four issues causing intrinsic tests to fail on GPU hardware, fixed in one PR:

  1. aLoRA silently disabledmodel.generate(inputs=) instead of input_ids= meant PEFT could never compute aLoRA offsets, falling back to plain LoRA weights and producing wrong outputs
  2. Groundedness parser bug_parse_batch_support_output() only handled flat JSON; granite-4.0-micro produces nested format ~70% of the time, silently downgrading valid citations to NOT_SUPPORTED
  3. GPU non-determinismrequirement_check_alora and context-attribution tests used exact float/list equality; relaxed to direction check (> 0.5) and subset check respectively
  4. context_relevance_alora truncation — HF io.yaml missing max_completion_tokens, model hits transformers default max_length=153 and truncates JSON mid-string; xfailed pending upstream fix (see context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301)

Changes

mellea/formatters/granite/base/util.py

inputs=input_ids= in generate_with_transformers(). One-line fix that restores aLoRA for all intrinsic tests. Absorbed from PR #1294.

mellea/stdlib/requirements/rag.py

  • Extended _parse_batch_support_output() to handle nested citation format with pessimistic aggregation (NOT_SUPPORTED > PARTIALLY > FULLY)
  • Added _norm() helper so near-miss labels like "FULLY SUPPORTED" (space) normalise correctly on both the flat and nested paths
  • Improved prompt with explicit flat-format example; added (no nested arrays) instruction

test/formatters/granite/test_intrinsics_formatters.py

  • requirement_check_alora: direction check (> 0.5, matching production requirement_check_to_bool) instead of exact float tolerance
  • uncertainty_alora: direction check (>= 0.5) instead of exact float tolerance
  • context-attribution: subset check instead of exact list equality; _key() uses 5-field tuple including attribution_doc_id
  • context_relevance_alora: pytest.xfail before transform call (JSONDecodeError from truncation) — see context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301
  • context_relevance (non-alora): pytest.xfail instead of bare continue so non-deterministic GPU results surface in the report
  • Nit: context_relevance condition tightened to exact == match; uncertainty f-string operator corrected to >=0.5?

test/stdlib/requirements/test_groundedness_requirement.py

Four new unit tests covering the parser fix:

  • test_parse_batch_support_output_nested_citations — basic nested format, one citation per span
  • test_parse_batch_support_output_nested_citations_pessimistic_aggregate — mixed levels on one span → NOT_SUPPORTED wins
  • test_parse_batch_support_output_nested_multiple_citationsFULLY + NOT → NOT; empty citations list → NOT_SUPPORTED
  • test_parse_batch_support_output_nested_label_variants"FULLY SUPPORTED" (space) normalises to FULLY_SUPPORTED

HPC verification (single full sweep, granite-4.1-3b + granite-4.0-micro, GPU p4-r30-n1)

Phase Tests Result
test_run_transformers (15 combos) 15 13 passed, 2 xfailed ✅
test_core.py (qualitative) 5 3 passed, 2 xfailed ✅
test_rag.py (qualitative) 14 14 passed ✅
test_guardian.py (qualitative) 6 6 passed ✅
test_groundedness_requirement_documents_in_message 1 1 passed ✅

xfails: hallucination_detection (pre-existing Transformers 5.0), context_relevance_alora (#1301), context-attribution ×2 in test_core.py (pre-existing).

Follow-up

Supersedes #1294

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code was added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

  • Component
  • Requirement
  • Sampling Strategy
  • Tool

@github-actions github-actions Bot added the bug Something isn't working label Jun 18, 2026
@planetf1 planetf1 marked this pull request as ready for review June 18, 2026 14:45
@planetf1 planetf1 requested review from a team as code owners June 18, 2026 14:45
@planetf1 planetf1 marked this pull request as draft June 18, 2026 15:39

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that you marked this draft again, but heres a quick first pass review from Claude

Comment thread mellea/stdlib/requirements/rag.py
Comment thread test/formatters/granite/test_intrinsics_formatters.py
Comment thread mellea/stdlib/requirements/rag.py Outdated
@planetf1 planetf1 changed the title fix(rag): handle nested JSON format in LLM judge parser + relax GPU-non-deterministic tests fix(intrinsics): fix aLoRA generation, parser bug, and flaky GPU tests Jun 19, 2026
@planetf1 planetf1 changed the title fix(intrinsics): fix aLoRA generation, parser bug, and flaky GPU tests fix(intrinsics): restore aLoRA offsets and fix groundedness parser Jun 19, 2026
planetf1 added 4 commits June 19, 2026 09:14
The model frequently mirrors the input span structure from the prompt and
nests support_level inside a 'citations' array instead of the flat format
specified in the instructions. The parser only handled the flat path,
causing correct LLM judgments to be silently downgraded to NOT_SUPPORTED.

Also improve the prompt to add explicit flat-format example to reduce
structure confusion.

Verified on HPC: granite-4.0-micro e2e test 10/10 passes (was 5/10).

See issue #1291 for deep analysis.

Assisted-by: opencode
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The HF adapter io.yaml is missing max_completion_tokens, causing
transformers to fall back to max_length=153. The model sometimes
generates extra fields beyond the required JSON and gets truncated
mid-string, producing a JSONDecodeError in result_processor.transform().

Also clean up the stale TODO comment in the context_relevance branch.

See issue #1301 for full analysis and upstream fix decision.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…te aLoRA offsets

generate_with_transformers() used inputs= instead of input_ids= when
calling model.generate(), causing pefts aLoRA hook to silently fall back
to standard LoRA. This affected context_relevance and uncertainty
intrinsics which produce wrong or truncated output without aLoRA.

Closes #1286

Assisted-by: opencode
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…mples

Replace first-wins break in nested-citations fallback with a pessimistic
aggregate (NOT_SUPPORTED > PARTIALLY_SUPPORTED > FULLY_SUPPORTED) so the
result is deterministic regardless of citation order from the LLM.

Drop literal placeholder syntax (`, ...` and `...` values) from the two
GroundednessRequirement prompt examples; granite-4.0-micro sometimes echoes
placeholder text verbatim, producing unparseable output.

Clarify context-attribution test comment: extras in actual are intentionally
tolerated, not an oversight.

Closes #1286

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1

Copy link
Copy Markdown
Contributor Author

CI status note

The code-checks / quality (3.13) job is currently failing with a >900s timeout on test/telemetry/test_metrics_backend.py::test_ollama_generate_from_raw_metrics_integration. This is a pre-existing flaky failure unrelated to this PR — the same timeout appeared in the same CI batch on worktree-issue-1136 and has been seen before on the 3.13 leg when Ollama runs in CPU-only mode on the ephemeral runner.

Tracked in #1272.

@planetf1 planetf1 marked this pull request as ready for review June 19, 2026 13:20

@akihikokuroda akihikokuroda left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three previous comments addressed cleanly — pessimistic aggregate is the right call. Two new findings on the latest commits: one reliability bug and one test-coverage gap.

Comment thread mellea/stdlib/requirements/rag.py
Comment thread test/stdlib/requirements/test_groundedness_requirement.py
@ajbozarth

Copy link
Copy Markdown
Contributor

I think that review above is valid, Claude went off the rails when I asked it to re-review and posted that without my permission and without even telling me its findings.

planetf1 added a commit to planetf1/mellea that referenced this pull request Jun 19, 2026
When an LLM emits `"citations": null`, `judgment.get("citations", [])`
returns `None` (the default only applies when the key is absent). The
set comprehension then raises `TypeError`, which the surrounding
`except (JSONDecodeError, ValueError)` does not catch — propagating
out of the parser instead of falling to the conservative NOT_SUPPORTED
default.

Fix by using `(judgment.get("citations") or [])` so an explicit null
is treated the same as an absent key.

Also adds a unit test for the pessimistic-aggregate path (NOT_SUPPORTED
beats PARTIALLY_SUPPORTED beats FULLY_SUPPORTED across mixed nested
citations on a single span) that was missing from the previous commit.

Closes review comment from ajbozarth on PR generative-computing#1292.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
Comment thread test/stdlib/requirements/test_groundedness_requirement.py
Comment thread test/stdlib/requirements/test_groundedness_requirement.py
@planetf1 planetf1 enabled auto-merge June 22, 2026 11:42

@psschwei psschwei left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the intrinsics fixes. The two production changes (aLoRA input_ids= and the nested-citation parser) are correct and root-caused — I verified the aLoRA diagnosis against PEFT's get_alora_offsets_for_generate, which silently disables aLoRA when input_ids is absent from kwargs. Leaving three non-blocking suggestions inline, each with a proposed fix. Note the rag.py and test-variant suggestions are paired: the new space-variant assertion fails until the parser normalizes nested labels.

Comment thread mellea/stdlib/requirements/rag.py
Comment thread test/stdlib/requirements/test_groundedness_requirement.py
Comment thread test/formatters/granite/test_intrinsics_formatters.py Outdated

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up — 4b649f4 looks right (both fixes are exactly what we asked for), but its
parent is 2178bf1, not 71b5611. So it sits on a branch off main rather than on
issue-1291. The PR head is still 71b5611 and both findings still apply on what GitHub
shows.

To land it: git cherry-pick 4b649f42 onto issue-1291 and push (force-push needed since
the existing 71b5611 is what's on remote).

Once that is fixed I think this will LGTM

@planetf1 planetf1 marked this pull request as draft June 22, 2026 16:41
auto-merge was automatically disabled June 22, 2026 16:41

Pull request was converted to draft

@jakelorocco jakelorocco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving; nothing here actually needs my review; I will defer to other reviewers. I'll add the do-not-merge label so that others can re-review their suggested changes

@jakelorocco jakelorocco added the do-not-merge/hold Block merging this PR label Jun 22, 2026
planetf1 added 3 commits June 23, 2026 08:56
When an LLM emits `"citations": null`, `judgment.get("citations", [])`
returns `None` (the default only applies when the key is absent). The
set comprehension then raises `TypeError`, which the surrounding
`except (JSONDecodeError, ValueError)` does not catch — propagating
out of the parser instead of falling to the conservative NOT_SUPPORTED
default.

Fix by using `(judgment.get("citations") or [])` so an explicit null
is treated the same as an absent key.

Also adds a unit test for the pessimistic-aggregate path (NOT_SUPPORTED
beats PARTIALLY_SUPPORTED beats FULLY_SUPPORTED across mixed nested
citations on a single span) that was missing from the previous commit.

Closes review comment from ajbozarth on PR #1292.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
- Add _norm() helper (hoisted before judgment loop) to normalise nested
  citation labels through the same substring logic the flat path uses, so
  near-miss labels like "FULLY SUPPORTED" (space) no longer silently
  downgrade to NOT_SUPPORTED.
- Add test_parse_batch_support_output_nested_multiple_citations: exercises
  the pessimistic-aggregation loop with mixed levels and empty citations [].
- Add test_parse_batch_support_output_nested_label_variants: regression
  guard for space-variant label normalisation; would fail without _norm().
- Replace bare `continue` with pytest.xfail for context_relevance (non-alora)
  so the case surfaces in the report rather than passing silently.
- Fix requirement_check direction threshold: >= 0.5 → > 0.5 to match
  requirement_check_to_bool() at requirements/requirement.py:52.
- Correct uncertainty comment: production check_certainty() returns the raw
  float; the >= 0.5 bucket is test-only tolerance, not a production mirror.
- Restore "with one object for each span" count guidance to the support-batch
  prompt (accidentally dropped when concrete example was added; necessity
  prompt already has it).

Closes review threads from psschwei on PR #1292.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
_key() previously matched records on span offsets only. A record pointing
to the right span but the wrong source document would pass silently.

The existing comment limits the stated non-determinism to count — how many
spans get attributed — not to which document a matched span points to. A
doc_id mismatch on a present record is therefore a real regression, not
tolerable hardware variance.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1

Copy link
Copy Markdown
Contributor Author

@ajbozarth — thanks for catching the branch issue. The cherry-pick is in place: 4b649f4 landed on issue-1291 as 9d52875, so the PR head now reflects both your findings. The three points from psschwei's latest review are addressed in 18ef90c and c102724. Happy for another look when you have a moment.

- uncertainty f-string: ">0.5?" → ">=0.5?" to match the actual >= comparison above
- context_relevance branch: substring "in" → exact "==" to decouple from the
  early alora xfail above it; mirrors how context-attribution already uses ==

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
@planetf1

Copy link
Copy Markdown
Contributor Author

Intermittent CI failure — not caused by this PR

The `code-checks / quality (3.11)` job failed on run 28017020057 with:

test/telemetry/test_tracing_backend.py::test_multiple_generations_separate_spans - httpx.ReadTimeout

This test was not touched in this branch. It fails when Ollama is under load late in the CI run (the model gets evicted by `OLLAMA_KEEP_ALIVE=5m` and reload exceeds the 300 s timeout introduced by PR #1302). The root cause and fix are tracked in issue #1318. Retriggering CI.

@planetf1 planetf1 enabled auto-merge June 23, 2026 11:55
@planetf1

Copy link
Copy Markdown
Contributor Author

@ajbozarth One for you to check. thanks

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing all the review

@planetf1 planetf1 added this pull request to the merge queue Jun 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 23, 2026
@ajbozarth ajbozarth added this pull request to the merge queue Jun 23, 2026
Merged via the queue into main with commit 87b39f0 Jun 23, 2026
14 of 16 checks passed
@ajbozarth ajbozarth deleted the issue-1291 branch June 23, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

5 participants