fix(intrinsics): restore aLoRA offsets and fix groundedness parser by planetf1 · Pull Request #1292 · generative-computing/mellea

planetf1 · 2026-06-18T14:24:09Z

Pull Request

Issue

Fixes #1291
Fixes #1286

Description

Four issues causing intrinsic tests to fail on GPU hardware, fixed in one PR:

aLoRA silently disabled — model.generate(inputs=) instead of input_ids= meant PEFT could never compute aLoRA offsets, falling back to plain LoRA weights and producing wrong outputs
Groundedness parser bug — _parse_batch_support_output() only handled flat JSON; granite-4.0-micro produces nested format ~70% of the time, silently downgrading valid citations to NOT_SUPPORTED
GPU non-determinism — requirement_check_alora and context-attribution tests used exact float/list equality; relaxed to direction check (> 0.5) and subset check respectively
context_relevance_alora truncation — HF io.yaml missing max_completion_tokens, model hits transformers default max_length=153 and truncates JSON mid-string; xfailed pending upstream fix (see context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301)

Changes

`mellea/formatters/granite/base/util.py`

inputs= → input_ids= in generate_with_transformers(). One-line fix that restores aLoRA for all intrinsic tests. Absorbed from PR #1294.

`mellea/stdlib/requirements/rag.py`

Extended _parse_batch_support_output() to handle nested citation format with pessimistic aggregation (NOT_SUPPORTED > PARTIALLY > FULLY)
Added _norm() helper so near-miss labels like "FULLY SUPPORTED" (space) normalise correctly on both the flat and nested paths
Improved prompt with explicit flat-format example; added (no nested arrays) instruction

`test/formatters/granite/test_intrinsics_formatters.py`

requirement_check_alora: direction check (> 0.5, matching production requirement_check_to_bool) instead of exact float tolerance
uncertainty_alora: direction check (>= 0.5) instead of exact float tolerance
context-attribution: subset check instead of exact list equality; _key() uses 5-field tuple including attribution_doc_id
context_relevance_alora: pytest.xfail before transform call (JSONDecodeError from truncation) — see context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301
context_relevance (non-alora): pytest.xfail instead of bare continue so non-deterministic GPU results surface in the report
Nit: context_relevance condition tightened to exact == match; uncertainty f-string operator corrected to >=0.5?

`test/stdlib/requirements/test_groundedness_requirement.py`

Four new unit tests covering the parser fix:

test_parse_batch_support_output_nested_citations — basic nested format, one citation per span
test_parse_batch_support_output_nested_citations_pessimistic_aggregate — mixed levels on one span → NOT_SUPPORTED wins
test_parse_batch_support_output_nested_multiple_citations — FULLY + NOT → NOT; empty citations list → NOT_SUPPORTED
test_parse_batch_support_output_nested_label_variants — "FULLY SUPPORTED" (space) normalises to FULLY_SUPPORTED

HPC verification (single full sweep, granite-4.1-3b + granite-4.0-micro, GPU p4-r30-n1)

Phase	Tests	Result
`test_run_transformers` (15 combos)	15	13 passed, 2 xfailed ✅
`test_core.py` (qualitative)	5	3 passed, 2 xfailed ✅
`test_rag.py` (qualitative)	14	14 passed ✅
`test_guardian.py` (qualitative)	6	6 passed ✅
`test_groundedness_requirement_documents_in_message`	1	1 passed ✅

xfails: hallucination_detection (pre-existing Transformers 5.0), context_relevance_alora (#1301), context-attribution ×2 in test_core.py (pre-existing).

Follow-up

context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301 — context_relevance_alora io.yaml missing max_completion_tokens (upstream fix needed)
fix(rag): groundedness support prompt presents nested citations in input while requesting flat output #1316 — groundedness support prompt presents nested citations in input while requesting flat output (root cause analysis; the parser fallback is sufficient for now)

Supersedes #1294

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

Component
Requirement
Sampling Strategy
Tool

ajbozarth

I missed that you marked this draft again, but heres a quick first pass review from Claude

The model frequently mirrors the input span structure from the prompt and nests support_level inside a 'citations' array instead of the flat format specified in the instructions. The parser only handled the flat path, causing correct LLM judgments to be silently downgraded to NOT_SUPPORTED. Also improve the prompt to add explicit flat-format example to reduce structure confusion. Verified on HPC: granite-4.0-micro e2e test 10/10 passes (was 5/10). See issue #1291 for deep analysis. Assisted-by: opencode Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

The HF adapter io.yaml is missing max_completion_tokens, causing transformers to fall back to max_length=153. The model sometimes generates extra fields beyond the required JSON and gets truncated mid-string, producing a JSONDecodeError in result_processor.transform(). Also clean up the stale TODO comment in the context_relevance branch. See issue #1301 for full analysis and upstream fix decision. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…te aLoRA offsets generate_with_transformers() used inputs= instead of input_ids= when calling model.generate(), causing pefts aLoRA hook to silently fall back to standard LoRA. This affected context_relevance and uncertainty intrinsics which produce wrong or truncated output without aLoRA. Closes #1286 Assisted-by: opencode Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…mples Replace first-wins break in nested-citations fallback with a pessimistic aggregate (NOT_SUPPORTED > PARTIALLY_SUPPORTED > FULLY_SUPPORTED) so the result is deterministic regardless of citation order from the LLM. Drop literal placeholder syntax (`, ...` and `...` values) from the two GroundednessRequirement prompt examples; granite-4.0-micro sometimes echoes placeholder text verbatim, producing unparseable output. Clarify context-attribution test comment: extras in actual are intentionally tolerated, not an oversight. Closes #1286 Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 · 2026-06-19T09:03:05Z

CI status note

The code-checks / quality (3.13) job is currently failing with a >900s timeout on test/telemetry/test_metrics_backend.py::test_ollama_generate_from_raw_metrics_integration. This is a pre-existing flaky failure unrelated to this PR — the same timeout appeared in the same CI batch on worktree-issue-1136 and has been seen before on the 3.13 leg when Ollama runs in CPU-only mode on the ephemeral runner.

Tracked in #1272.

akihikokuroda

LGTM

ajbozarth

Three previous comments addressed cleanly — pessimistic aggregate is the right call. Two new findings on the latest commits: one reliability bug and one test-coverage gap.

ajbozarth · 2026-06-19T15:36:33Z

I think that review above is valid, Claude went off the rails when I asked it to re-review and posted that without my permission and without even telling me its findings.

When an LLM emits `"citations": null`, `judgment.get("citations", [])` returns `None` (the default only applies when the key is absent). The set comprehension then raises `TypeError`, which the surrounding `except (JSONDecodeError, ValueError)` does not catch — propagating out of the parser instead of falling to the conservative NOT_SUPPORTED default. Fix by using `(judgment.get("citations") or [])` so an explicit null is treated the same as an absent key. Also adds a unit test for the pessimistic-aggregate path (NOT_SUPPORTED beats PARTIALLY_SUPPORTED beats FULLY_SUPPORTED across mixed nested citations on a single span) that was missing from the previous commit. Closes review comment from ajbozarth on PR generative-computing#1292. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code

psschwei

Reviewed the intrinsics fixes. The two production changes (aLoRA input_ids= and the nested-citation parser) are correct and root-caused — I verified the aLoRA diagnosis against PEFT's get_alora_offsets_for_generate, which silently disables aLoRA when input_ids is absent from kwargs. Leaving three non-blocking suggestions inline, each with a proposed fix. Note the rag.py and test-variant suggestions are paired: the new space-variant assertion fails until the parser normalizes nested labels.

ajbozarth

Heads up — 4b649f4 looks right (both fixes are exactly what we asked for), but its
parent is 2178bf1, not 71b5611. So it sits on a branch off main rather than on
issue-1291. The PR head is still 71b5611 and both findings still apply on what GitHub
shows.

To land it: git cherry-pick 4b649f42 onto issue-1291 and push (force-push needed since
the existing 71b5611 is what's on remote).

Once that is fixed I think this will LGTM

jakelorocco

approving; nothing here actually needs my review; I will defer to other reviewers. I'll add the do-not-merge label so that others can re-review their suggested changes

When an LLM emits `"citations": null`, `judgment.get("citations", [])` returns `None` (the default only applies when the key is absent). The set comprehension then raises `TypeError`, which the surrounding `except (JSONDecodeError, ValueError)` does not catch — propagating out of the parser instead of falling to the conservative NOT_SUPPORTED default. Fix by using `(judgment.get("citations") or [])` so an explicit null is treated the same as an absent key. Also adds a unit test for the pessimistic-aggregate path (NOT_SUPPORTED beats PARTIALLY_SUPPORTED beats FULLY_SUPPORTED across mixed nested citations on a single span) that was missing from the previous commit. Closes review comment from ajbozarth on PR #1292. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code

- Add _norm() helper (hoisted before judgment loop) to normalise nested citation labels through the same substring logic the flat path uses, so near-miss labels like "FULLY SUPPORTED" (space) no longer silently downgrade to NOT_SUPPORTED. - Add test_parse_batch_support_output_nested_multiple_citations: exercises the pessimistic-aggregation loop with mixed levels and empty citations []. - Add test_parse_batch_support_output_nested_label_variants: regression guard for space-variant label normalisation; would fail without _norm(). - Replace bare `continue` with pytest.xfail for context_relevance (non-alora) so the case surfaces in the report rather than passing silently. - Fix requirement_check direction threshold: >= 0.5 → > 0.5 to match requirement_check_to_bool() at requirements/requirement.py:52. - Correct uncertainty comment: production check_certainty() returns the raw float; the >= 0.5 bucket is test-only tolerance, not a production mirror. - Restore "with one object for each span" count guidance to the support-batch prompt (accidentally dropped when concrete example was added; necessity prompt already has it). Closes review threads from psschwei on PR #1292. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

_key() previously matched records on span offsets only. A record pointing to the right span but the wrong source document would pass silently. The existing comment limits the stated non-determinism to count — how many spans get attributed — not to which document a matched span points to. A doc_id mismatch on a present record is therefore a real regression, not tolerable hardware variance. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 · 2026-06-23T09:14:46Z

@ajbozarth — thanks for catching the branch issue. The cherry-pick is in place: 4b649f4 landed on issue-1291 as 9d52875, so the PR head now reflects both your findings. The three points from psschwei's latest review are addressed in 18ef90c and c102724. Happy for another look when you have a moment.

- uncertainty f-string: ">0.5?" → ">=0.5?" to match the actual >= comparison above - context_relevance branch: substring "in" → exact "==" to decouple from the early alora xfail above it; mirrors how context-attribution already uses == Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code

planetf1 · 2026-06-23T11:39:43Z

Intermittent CI failure — not caused by this PR

The `code-checks / quality (3.11)` job failed on run 28017020057 with:

test/telemetry/test_tracing_backend.py::test_multiple_generations_separate_spans - httpx.ReadTimeout

This test was not touched in this branch. It fails when Ollama is under load late in the CI run (the model gets evicted by `OLLAMA_KEEP_ALIVE=5m` and reload exceeds the 300 s timeout introduced by PR #1302). The root cause and fix are tracked in issue #1318. Retriggering CI.

planetf1 · 2026-06-23T16:08:43Z

@ajbozarth One for you to check. thanks

ajbozarth

LGTM, thanks for addressing all the review

github-actions Bot added the bug Something isn't working label Jun 18, 2026

planetf1 mentioned this pull request Jun 18, 2026

test: GPU score non-determinism causes flaky failures in context-attribution, requirement_check_alora, groundedness_e2e #1291

Closed

planetf1 marked this pull request as ready for review June 18, 2026 14:45

planetf1 requested review from a team as code owners June 18, 2026 14:45

planetf1 requested review from AngeloDanducci, ajbozarth and akihikokuroda June 18, 2026 14:45

planetf1 marked this pull request as draft June 18, 2026 15:39

ajbozarth reviewed Jun 18, 2026

View reviewed changes

Comment thread mellea/stdlib/requirements/rag.py

Comment thread test/formatters/granite/test_intrinsics_formatters.py

Comment thread mellea/stdlib/requirements/rag.py Outdated

planetf1 mentioned this pull request Jun 19, 2026

context_relevance aLoRA adapter: missing max_completion_tokens causes JSONDecodeError in test; consider deprecating #1301

Open

planetf1 changed the title ~~fix(rag): handle nested JSON format in LLM judge parser + relax GPU-non-deterministic tests~~ fix(intrinsics): fix aLoRA generation, parser bug, and flaky GPU tests Jun 19, 2026

planetf1 changed the title ~~fix(intrinsics): fix aLoRA generation, parser bug, and flaky GPU tests~~ fix(intrinsics): restore aLoRA offsets and fix groundedness parser Jun 19, 2026

planetf1 mentioned this pull request Jun 19, 2026

feat(intrinsics): shim old adapters onto Adapter dataclass; call_intrinsic rewritten (Epic #929 Phase 1) #1269

Merged

1 task

planetf1 added 4 commits June 19, 2026 09:14

planetf1 force-pushed the issue-1291 branch from ca5963e to 71b5611 Compare June 19, 2026 08:15

planetf1 mentioned this pull request Jun 19, 2026

test: test_streaming_span_duration hangs >900s on slow CPU-only Ollama runner #1272

Closed

planetf1 marked this pull request as ready for review June 19, 2026 13:20

akihikokuroda approved these changes Jun 19, 2026

View reviewed changes

ajbozarth reviewed Jun 19, 2026

View reviewed changes

Comment thread mellea/stdlib/requirements/rag.py

Comment thread test/stdlib/requirements/test_groundedness_requirement.py

psschwei reviewed Jun 19, 2026

View reviewed changes

Comment thread test/stdlib/requirements/test_groundedness_requirement.py

Comment thread test/stdlib/requirements/test_groundedness_requirement.py

planetf1 enabled auto-merge June 22, 2026 11:42

psschwei reviewed Jun 22, 2026

View reviewed changes

Comment thread mellea/stdlib/requirements/rag.py

Comment thread test/stdlib/requirements/test_groundedness_requirement.py

Comment thread test/formatters/granite/test_intrinsics_formatters.py Outdated

ajbozarth requested changes Jun 22, 2026

View reviewed changes

planetf1 marked this pull request as draft June 22, 2026 16:41

auto-merge was automatically disabled June 22, 2026 16:41
Pull request was converted to draft

jakelorocco approved these changes Jun 22, 2026

View reviewed changes

jakelorocco added the do-not-merge/hold Block merging this PR label Jun 22, 2026

planetf1 added 3 commits June 23, 2026 08:56

planetf1 removed the do-not-merge/hold Block merging this PR label Jun 23, 2026

planetf1 mentioned this pull request Jun 23, 2026

fix(rag): groundedness support prompt presents nested citations in input while requesting flat output #1316

Open

planetf1 requested a review from ajbozarth June 23, 2026 09:42

planetf1 marked this pull request as ready for review June 23, 2026 09:46

This was referenced Jun 23, 2026

Unstable test suite when running locally #1314

Open

test: backend tracing tests flake with httpx.ReadTimeout on loaded CI runners #1318

Open

planetf1 enabled auto-merge June 23, 2026 11:55

ajbozarth approved these changes Jun 23, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Jun 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 23, 2026

ajbozarth added this pull request to the merge queue Jun 23, 2026

Merged via the queue into main with commit 87b39f0 Jun 23, 2026
14 of 16 checks passed

ajbozarth deleted the issue-1291 branch June 23, 2026 18:18

Conversation

planetf1 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Issue

Description

Changes

mellea/formatters/granite/base/util.py

mellea/stdlib/requirements/rag.py

test/formatters/granite/test_intrinsics_formatters.py

test/stdlib/requirements/test_groundedness_requirement.py

HPC verification (single full sweep, granite-4.1-3b + granite-4.0-micro, GPU p4-r30-n1)

Follow-up

Testing

Attribution

Adding a new component, requirement, sampling strategy, or tool?

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

planetf1 commented Jun 19, 2026

CI status note

Uh oh!

akihikokuroda left a comment

Choose a reason for hiding this comment

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 commented Jun 23, 2026

Uh oh!

planetf1 commented Jun 23, 2026

Uh oh!

planetf1 commented Jun 23, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

planetf1 commented Jun 18, 2026 •

edited

Loading

`mellea/formatters/granite/base/util.py`

`mellea/stdlib/requirements/rag.py`

`test/formatters/granite/test_intrinsics_formatters.py`

`test/stdlib/requirements/test_groundedness_requirement.py`