add streaming beam search for cache aware models to NeMo inference by lilithgrigoryan · Pull Request #15768 · NVIDIA-NeMo/NeMo

lilithgrigoryan · 2026-06-08T19:02:15Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds MALSD beam search to cache-aware RNNT streaming inference, including K-beam hypothesis carry-over across chunks, cumulative hypothesis publishing, per-stream phrase boosting, and optional n-gram LM fusion.

In beam mode, a single stream may contain multiple utterances, and hypothesis state is managed as follows:

Between chunks: keep the top K hypotheses active and publish the current best hypothesis as the partial transcript.
At EOU / pause detection: finalize the best beam for the current utterance, commit it to the cumulative transcript, clear only the in-flight suffix, and continue processing the same stream. Decoder state is preserved so the next utterance can continue in context.
At stream end: clear all beam and transcript state for that stream, matching greedy-mode cleanup behavior.

Endpointing behavior is unchanged and still relies on the existing silence/VAD logic. Beam mode only changes what happens when a segment ends: the best beam is selected and committed as the finalized utterance.

Collection: asr (inference pipelines, model wrappers, streaming state, decoding)

Changelog

Add specific line by line info of high level changes in this PR.

Usage

Example usage during N-Gram LM fusion:

python examples/asr/asr_streaming_inference/asr_streaming_infer.py \
  --config-path=../conf/asr_streaming_inference \
  --config-name=cache_aware_rnnt \
  asr.decoding.strategy=malsd_batch \
  asr.decoding.beam.beam_size=4 \
  asr.decoding.beam.ngram_lm_model=/path/to/lm.nemo \
  asr.decoding.beam.ngram_lm_alpha=0.5 \
  asr.decoding.beam.enable_per_stream_biasing=true \
  audio_file=/path/to/manifest.json

Cache-aware streaming WER — Riva word-boosting eval

Model: nvidia/nemotron-speech-streaming-en-0.6b
Manifest: internal set
Word boosting: internal
Batch size: 256
Metric: Whisper-normalized WER (%)
Beam search: MALSD, beam size = 4 (malsd_batch)

WER (%) — lower is better

Decoder	[70, 13]	[70, 6]	[70, 1]	[70, 0]
Greedy	25.54	26.50	29.57	32.80
Greedy + WB	22.40	23.74	26.92	30.30
MALSD (b=4)	24.98	25.89	29.07	32.77
MALSD + WB	18.70	20.05	23.78	28.27
Δ WB (greedy)	−3.14	−2.76	−2.65	−2.50
Δ WB (beam)	−6.28	−5.84	−5.29	−4.50

RTFx — higher is faster

Decoder	[70, 13]	[70, 6]	[70, 1]	[70, 0]
Greedy	546.40	394.50	179.26	109.63
Greedy + WB	522.08	390.60	177.78	108.94
MALSD (b=4)	512.49	349.99	141.53	77.91
MALSD + WB	474.22	330.63	133.34	74.56

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

github-actions · 2026-06-08T22:51:10Z

[🤖]: Hi @lilithgrigoryan 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

…/streaming-beam-search-niva-cache-aware

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

naymaraq

This is really great work. Thanks @lilithgrigoryan

I’ll do another round of review after this one, since I see the PR is still a work in progress.

naymaraq · 2026-06-18T16:51:31Z

-            prompt_vectors: (Tensor | None) Optional prompt vectors of shape [B, num_prompts].
-        Returns:
-            (tuple[list[Hypothesis], CacheAwareContext]) best hypothesis and new context.
+        Run the cache-aware encoder for one streaming chunk, returning the (trimmed)


For consistency, please bring back the argument descriptions in the docstring.

I believe, I kept the original docstring for execute_step unchanged. Git’s diff makes the execute_step and encoder_step (a new function) docstrings look confusing for some reason.

LMK, if I miss something.

Argument descriptions are missing for encoder_step

github-actions · 2026-06-18T18:04:25Z

[🤖]: Hi @lilithgrigoryan 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

…/streaming-beam-search-niva-cache-aware

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

github-actions · 2026-06-18T21:56:05Z

[🤖]: Hi @lilithgrigoryan 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

naymaraq · 2026-06-19T12:44:00Z

@@ -0,0 +1,105 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


I assume, 2025 should be 2026

naymaraq · 2026-06-19T12:48:26Z

+        self.init_decoding_computer()
+        strategy = str(getattr(cfg.asr.decoding, "strategy", "greedy_batch"))
+        if strategy not in {"greedy_batch", "malsd_batch"}:
+            raise ValueError(


The error should be raised as early as possible, before the model is loaded.

naymaraq · 2026-06-19T12:54:07Z

The rest looks good to me! Just 2–3 minor comments.

Ideally, for completeness, you could also add a comparison of greedy decoding, beam search, and beam search with nGPULM, but that's up to you.

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

…/streaming-beam-search-niva-cache-aware

naymaraq

LGTM! Great work. Thanks @lilithgrigoryan !
Just make sure the tests are passing; otherwise, I don’t have any other comments.

merge

3a6f29e

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

github-actions Bot added the ASR label Jun 8, 2026

copy-pr-bot Bot temporarily deployed to public June 8, 2026 19:03 Inactive

copy-pr-bot Bot temporarily deployed to public June 8, 2026 19:06 Inactive

lilithgrigoryan added 5 commits June 8, 2026 23:18

add n-chunk reseting working

03937b0

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

saving config

e889eea

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

add n-chunk reseting working

e51ee3c

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

add eou resetting

3765acc

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

clean up debug prints

0e11a4f

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 8, 2026 21:31 Inactive

copy-pr-bot Bot temporarily deployed to test June 8, 2026 21:32 Inactive

copy-pr-bot Bot temporarily deployed to public June 8, 2026 21:34 Inactive

lilithgrigoryan added 2 commits June 16, 2026 16:49

Merge branch 'main' of https://github.com/NVIDIA/NeMo into lgrigoryan…

913000a

…/streaming-beam-search-niva-cache-aware

typecast fix

6dc1423

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:20 Inactive

copy-pr-bot Bot had a problem deploying to test June 16, 2026 16:22 Error

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:25 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 16:47 Inactive

clean up

0a69dee

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 16, 2026 17:48 Inactive

isort and black + clean up

e051b12

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot had a problem deploying to test June 16, 2026 17:49 Error

copy-pr-bot Bot temporarily deployed to public June 16, 2026 17:51 Inactive

copy-pr-bot Bot had a problem deploying to test June 16, 2026 17:52 Error

github-advanced-security AI found potential problems Jun 16, 2026

View reviewed changes

Comment thread nemo/collections/asr/inference/streaming/state/cache_aware_rnnt_state.py Fixed

copy-pr-bot Bot temporarily deployed to public June 16, 2026 17:54 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 17:55 Inactive

naymaraq requested changes Jun 18, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 18, 2026 17:43 Inactive

lilithgrigoryan added 7 commits June 18, 2026 23:29

restore docstring

a299ee4

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

move malsd stream step to model wrapper

957084f

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

clean up

49eb9fe

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

refactor per-stream biasing, add utils

7be3088

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

add malsd-only warning

629de90

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

isort and black

c59bc00

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA/NeMo into lgrigoryan…

36e1702

…/streaming-beam-search-niva-cache-aware

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:14 Inactive

copy-pr-bot Bot had a problem deploying to test June 18, 2026 20:16 Error

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:19 Inactive

lilithgrigoryan added 2 commits June 19, 2026 00:21

restore releasing biaing models

3656938

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

minor clean up

7840a22

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:37 Inactive

copy-pr-bot Bot had a problem deploying to test June 18, 2026 20:39 Error

clean up

fadef4e

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:41 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:42 Inactive

lilithgrigoryan added 2 commits June 19, 2026 00:42

isort and black

b2b7116

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

clean up

205e85c

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

naymaraq requested changes Jun 19, 2026

View reviewed changes

lilithgrigoryan added 2 commits June 22, 2026 17:10

minor changes

211632e

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA/NeMo into lgrigoryan…

58c7df9

…/streaming-beam-search-niva-cache-aware

naymaraq approved these changes Jun 23, 2026

View reviewed changes

		@@ -0,0 +1,105 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

lilithgrigoryan commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

Cache-aware streaming WER — Riva word-boosting eval

WER (%) — lower is better

RTFx — higher is faster

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

naymaraq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

naymaraq Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

lilithgrigoryan Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

naymaraq Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

naymaraq Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

naymaraq Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

naymaraq commented Jun 19, 2026

Uh oh!

naymaraq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lilithgrigoryan commented Jun 8, 2026 •

edited

Loading