Skip to content

feat(run-engine,webapp): always report worker queue length metrics#4029

Merged
ericallam merged 4 commits into
mainfrom
feat/worker-queue-length-observer
Jun 24, 2026
Merged

feat(run-engine,webapp): always report worker queue length metrics#4029
ericallam merged 4 commits into
mainfrom
feat/worker-queue-length-observer

Conversation

@ericallam

@ericallam ericallam commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

The runqueue.workerQueue.length gauge only reported a worker queue's depth while runs were being dequeued from it. When dequeues stop, the metric goes stale or missing, so a queue that has backed up because nothing is draining it can't be alerted on. This adds a small observer that refreshes the observed set of worker queues from the WorkerInstanceGroup records on an interval, so every active worker queue (and its scheduled split variant) keeps reporting its length regardless of dequeue activity.

The observer is off by default and enabled per service via RUN_ENGINE_WORKER_QUEUE_OBSERVER_ENABLED, reads from the read replica, and skips a configurable set of cloud providers (RUN_ENGINE_WORKER_QUEUE_OBSERVER_EXCLUDED_CLOUD_PROVIDERS, default digitalocean). When enabled it is the source of truth for the observed set, so the per-dequeue registration is skipped on that instance, and it groups by worker queue so the per-instance duplicates collapse to the true depth.

Also removes the unused GET/POST /api/v1/workers endpoints. Their only consumer was a CLI command group that is no longer registered.

Verification

Verified end to end against a local stack: the gauge reports each worker queue's length with no dequeues happening, excludes the configured providers, includes hidden groups, and the removed endpoints return as if they never existed. Added a run-engine test (workerQueueObservation.test.ts).

The runqueue.workerQueue.length gauge only observed worker queues that a
dequeue had registered, so a queue's depth stopped being reported once
dequeues stopped (or was never reported for a queue that backed up before
anything dequeued from it). A periodic observer now refreshes the observed
set from the WorkerInstanceGroup records instead, so every active worker
queue (and its scheduled split variant) keeps reporting its length
regardless of dequeue activity. Off by default; enable per-service via an
env var. Reads from the replica and skips configured cloud providers.
The GET and POST /api/v1/workers endpoints backed a CLI command group that
is no longer registered, so they had no reachable consumer. Remove them.
… timeout

Disable the execution workers and batch consumers in the worker queue
observation test. It only needs enqueue + processMasterQueue + the observer
gauge, and the extra workers add Redis connections and make engine.quit()
hang on worker shutdown when the shard's Redis is under pressure, timing the
test out in CI.
@changeset-bot

changeset-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 6495e99

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR adds a continuous worker-queue observer to RunEngine. A new workerQueueObserver option is added to RunEngineOptions, and RunQueue gains a setObservableWorkerQueues method. RunEngine implements refreshWorkerQueueObservation(), which queries workerInstanceGroup records from the read replica, filters by excluded cloud providers, appends configured queue suffixes, and calls setObservableWorkerQueues. An interval-driven loop (startWorkerQueueObserver) runs this refresh continuously, controlled by an AbortController that is aborted on quit(). When the observer is enabled, per-dequeue queue registration in dequeueFromWorkerQueue is skipped. Three new environment variables (RUN_ENGINE_WORKER_QUEUE_OBSERVER_ENABLED, _INTERVAL_MS, _EXCLUDED_CLOUD_PROVIDERS) wire the feature into the webapp. The unused GET and POST /api/v1/workers route handlers are also removed.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description omits required template sections like Closes, checklist, Testing, Changelog, and Screenshots. Add the missing template sections, including Closes #, checklist items, test steps, changelog, and screenshots if applicable.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the main change: always reporting worker queue length metrics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/worker-queue-length-observer

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed: private package registry requires authentication. Disable ESLint in CodeRabbit settings or use public packages.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam marked this pull request as ready for review June 24, 2026 14:39

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cad9367d-98cd-40ba-8954-3cc7d0eee931

📥 Commits

Reviewing files that changed from the base of the PR and between 9c5633b and 6495e99.

📒 Files selected for processing (2)
  • internal-packages/run-engine/src/engine/index.ts
  • internal-packages/run-engine/src/engine/tests/workerQueueObservation.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal-packages/run-engine/src/engine/tests/workerQueueObservation.test.ts
  • internal-packages/run-engine/src/engine/index.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout. (26)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: 🛡️ E2E Auth Tests (full)
  • GitHub Check: Analyze (javascript-typescript)

Walkthrough

This PR adds a continuous worker-queue observer to RunEngine. A new workerQueueObserver option is added to RunEngineOptions, and RunQueue gains a setObservableWorkerQueues method. RunEngine implements refreshWorkerQueueObservation(), which queries workerInstanceGroup records from the read replica, filters by excluded cloud providers, appends configured queue suffixes, and calls setObservableWorkerQueues. An interval-driven loop runs this refresh continuously and is aborted on quit(). When enabled, per-dequeue queue registration is skipped. New environment variables wire the feature into the webapp. The unused GET and POST /api/v1/workers route handlers are removed, and matching server-change notes are added.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change and verification, but it misses the required issue link, checklist, testing, changelog, and screenshots sections. Add the Closes #issue line and the template sections for checklist, testing steps, changelog, and screenshots, even if some are brief or marked N/A.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly states the main change: continuously reporting worker queue length metrics across run-engine and webapp.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/worker-queue-length-observer

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed: private package registry requires authentication. Disable ESLint in CodeRabbit settings or use public packages.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment thread .server-changes/remove-worker-create-endpoint.md
@ericallam ericallam merged commit 8890d7a into main Jun 24, 2026
46 checks passed
@ericallam ericallam deleted the feat/worker-queue-length-observer branch June 24, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants