Skip to content

Add new HookMetricEmit that can emit arbitrary metrics#1285

Open
brandur wants to merge 1 commit into
masterfrom
brandur-metrics-hook
Open

Add new HookMetricEmit that can emit arbitrary metrics#1285
brandur wants to merge 1 commit into
masterfrom
brandur-metrics-hook

Conversation

@brandur

@brandur brandur commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I was thinking about observability in River the other day, and how if
one were trying to track River performance degradation due to dead
tuples, how you might do so. Surprisingly, I don't think it's really
possible under our current framework -- the best proxy for it is time to
lock jobs, a metric which we don't provide any way of extracting out of
River.

Here, add a new hook called HookMetricEmit. The idea behind it is that it
may receive an arbitrary metric that River starts to emit, and handle it
by passing it onto a metrics system of choice, like OTEL, DataDog, or
whatever.

Within HookMetricEmit, add two specific metrics to get us started:

  • JobGetAvailableDurationMetric: Time to lock available jobs. This
    will start to show significant degradation in case of dead tuples.

  • JobGetAvailableCountMetric: Number of jobs locked while getting
    available. This is somewhat useful in seeing how well batch locking is
    working out in a program (i.e. are we locking one job at a time or
    100?).

@brandur brandur changed the title Add new HookMetric that can emit arbitrary metrics Add new HookMetricEmit that can emit arbitrary metrics Jun 12, 2026
brandur added a commit to riverqueue/rivercontrib that referenced this pull request Jun 12, 2026
Here, build on the proposal in [1] to add a metric hook to River, and
start emitting metrics for the time to lock jobs and the number of jobs
locked. Especially the first metric is generally very useful for looking
for queue degradation due to dead tuples.

[1] riverqueue/river#1285
@brandur brandur force-pushed the brandur-metrics-hook branch 2 times, most recently from c3c56ed to e54c7ad Compare June 12, 2026 15:09
@brandur brandur force-pushed the brandur-metrics-hook branch 2 times, most recently from 945be6d to e4a4b69 Compare June 20, 2026 20:38
@brandur

brandur commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4a4b69bd8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread producer.go
return
}

if len(p.metricEmitHooks) > 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip metrics when no JobGetAvailable query runs

When a queue is paused, innerFetchLoop still calls dispatchWork with count == 0, and StandardPilot.JobGetAvailable returns immediately for MaxToLock <= 0 without issuing the DB query. This block still emits both metrics, so paused queues produce near-zero job_get_available_duration samples and zero-count fetches that never actually locked anything, skewing the observability signal this hook is meant to expose; gate emission on count > 0 or on an actual fetch being attempted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified dispatchWork to return early in case of a count <= 0 (i.e. paused queue).

I was thinking about observability in River the other day, and how if
one were trying to track River performance degradation due to dead
tuples, how you might do so. Surprisingly, I don't think it's really
possible under our current framework -- the best proxy for it is time to
lock jobs, a metric which we don't provide any way of extracting out of
River.

Here, add a new hook called `HookMetricEmit`. The idea behind it is that it
may receive an arbitrary metric that River starts to emit, and handle it
by passing it onto a metrics system of choice, like OTEL, DataDog, or
whatever.

Within `HookMetricEmit`, add two specific metrics to get us started:

* `JobGetAvailableDurationMetric`: Time to lock available jobs. This
  will start to show significant degradation in case of dead tuples.

* `JobGetAvailableCountMetric`: Number of jobs locked while getting
  available. This is somewhat useful in seeing how well batch locking is
  working out in a program (i.e. are we locking one job at a time or
  100?).
@brandur brandur force-pushed the brandur-metrics-hook branch from e4a4b69 to 1717713 Compare June 22, 2026 21:19
@brandur

brandur commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1717713632

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread producer.go

var startedAt time.Time
if len(p.metricEmitHooks) > 0 {
startedAt = time.Now()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use the service clock for fetch duration metrics

When Config.Test.Time supplies a stubbed time generator, this captures real wall-clock time instead of River's configured service clock, and the later time.Since(startedAt) keeps the new metric nondeterministic under time stubbing. Metric hooks used in tests will see durations that don't advance with p.Time, unlike the rest of the producer's timing/database timestamps; capture and subtract with p.Time.Now() on both sides instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's okay to use the stub-free version for metric durations. I did it on purpose since the code is made a little clearer and I had no plans to sub the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant