Skip to content

columnar: register metrics for hub#10941

Merged
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
yongman:fix-columnar-metrics
Jul 1, 2026
Merged

columnar: register metrics for hub#10941
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
yongman:fix-columnar-metrics

Conversation

@yongman

@yongman yongman commented Jun 30, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #10943

Problem Summary:
Metrics are not registed to columnar hub runtime and the store stats not collected.

What is changed and how it works?

Register the metrics during startup and add store stats collect in store pd task.


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Manual Test

image image

Summary by CodeRabbit

Summary by CodeRabbit

Summary

  • New Features

    • Added one-time initialization for core runtime metrics during proxy startup.
    • Made the Prometheus metric name prefix configurable via PROMETHEUS_METRIC_NAME_PREFIX (default: tiflash_proxy_), and included it in the proxy’s version/debug output.
  • Bug Fixes

    • Improved store heartbeat disk capacity/availability reporting by preferring engine-store derived statistics when available.
    • Preserved the “skip heartbeat when capacity is zero” behavior, with a fallback to the prior on-disk data source when engine-store stats can’t be read.

Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot

ti-chi-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Jun 30, 2026
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ac86b66c-2548-4082-905b-632e7375ba29

📥 Commits

Reviewing files that changed from the base of the PR and between 2dde5e4 and 99b1a70.

📒 Files selected for processing (1)
  • contrib/tiflash-proxy-cmake/CMakeLists.txt

📝 Walkthrough

Walkthrough

Adds engine-store stats access, one-time metrics startup, version-string prefix reporting, heartbeat disk-stat fallback, and build-time wiring for the columnar hub.

Changes

Store Stats & Metrics Init

Layer / File(s) Summary
Engine-store stats FFI
contrib/tiflash-columnar-hub/hub-runtime/src/engine_store_helper.rs
Declares and implements handle_compute_store_stats(&self) -> StoreStats in EngineStoreServerHelperExt via fn_handle_compute_store_stats.
Prometheus build wiring
contrib/tiflash-columnar-hub/.cargo/config.toml, contrib/tiflash-columnar-hub/Cargo.toml, contrib/tiflash-columnar-hub/Makefile, contrib/tiflash-columnar-hub/hub-runtime/Cargo.toml, contrib/tiflash-proxy-cmake/CMakeLists.txt
Adds the prometheus workspace dependency and patch, sets the metric prefix, and wires the Cargo config and linker flag into the proxy build.
Startup metrics initialization
contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Adds init_metrics() with Once, registers TiKV metrics monitors, and calls it in run_proxy after config-check handling.
Version prefix reporting
contrib/tiflash-columnar-hub/hub-runtime/src/lib.rs
Adds a Prometheus Prefix field to proxy_version_info() from the environment with an Unknown fallback.
Engine-store heartbeat disk stats
contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Adds engine-store-backed store space collection, prefers it in heartbeat stats, and updates the disk-usage unit test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pingcap/tiflash#10869: Overlaps in contrib/tiflash-columnar-hub/hub-runtime startup and version plumbing, including proxy_version_info() and run_proxy.

Suggested reviewers

  • JinheLin
  • Lloyd-Pottiger

Poem

🐇 I hopped with metrics, bright and keen,
A prefix tagged the proxy stream.
The heartbeat found its space to breathe,
While engine-store stats spun underneath.
With linker flags and startup cheer,
The columnar hub rang crisp and clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: registering metrics for the columnar hub.
Description check ✅ Passed The description covers the problem, issue reference, changed behavior, test status, and release-note section.
Linked Issues check ✅ Passed The changes address the linked issue by registering metrics during startup and collecting store stats for the columnar compute node.
Out of Scope Changes check ✅ Passed The additional build and runtime config changes appear directly tied to metric registration and store-stat collection.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 30, 2026
Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-linked-issue labels Jun 30, 2026
@yongman yongman marked this pull request as ready for review June 30, 2026 06:44
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
contrib/tiflash-columnar-hub/.cargo/config.toml (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the rationale for allowing multiple symbol definitions.

The --allow-multiple-definition linker flag weakens a standard safety check and can mask real symbol collisions. Given this is an FFI-heavy codebase interacting with TiKV/C++ components, the flag may be necessary due to duplicate extern "C" definitions across crate boundaries, but the reason is not self-evident from the config.

Add a comment explaining:

  1. Why duplicates occur (e.g., metrics symbols, FFI helpers).
  2. That this is a temporary workaround or permanent architectural necessity.
  3. Any risks if definitions diverge.

As per coding guidelines, Rust projects should use cargo clippy for linting; while this is config, consider if the root cause can be fixed instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/tiflash-columnar-hub/.cargo/config.toml` around lines 1 - 2, The
Cargo config enables the linker’s allow-multiple-definition flag without
explaining why, so add a clear comment near the rustflags setting in the tiFlash
columnar hub config describing the specific duplicate symbol sources (for
example shared FFI helpers or metrics symbols across crates), whether this is a
temporary workaround or required by the current architecture, and the risk that
divergent definitions could hide real collisions; also note in the surrounding
context that if the duplicates are unintentional, the better fix is to resolve
the root cause rather than relying on the linker workaround.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/run.rs`:
- Around line 940-945: The issue is in
collect_store_space_stats_from_engine_store: it currently returns Some even when
stats.fs_stats.capacity_size is zero, which prevents the fallback path in the
heartbeat logic from calling collect_store_space_stats(data_dir). Update this
helper to treat zero capacity as unavailable by returning None when fs_stats.ok
is zero or capacity_size is zero, so the existing fallback in the surrounding
store-space selection logic can use the local filesystem stats instead.

---

Nitpick comments:
In `@contrib/tiflash-columnar-hub/.cargo/config.toml`:
- Around line 1-2: The Cargo config enables the linker’s
allow-multiple-definition flag without explaining why, so add a clear comment
near the rustflags setting in the tiFlash columnar hub config describing the
specific duplicate symbol sources (for example shared FFI helpers or metrics
symbols across crates), whether this is a temporary workaround or required by
the current architecture, and the risk that divergent definitions could hide
real collisions; also note in the surrounding context that if the duplicates are
unintentional, the better fix is to resolve the root cause rather than relying
on the linker workaround.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9dc6fcf4-c0ac-4703-96f7-113cd1a55acc

📥 Commits

Reviewing files that changed from the base of the PR and between 0fff0b7 and 6c6686d.

📒 Files selected for processing (3)
  • contrib/tiflash-columnar-hub/.cargo/config.toml
  • contrib/tiflash-columnar-hub/hub-runtime/src/engine_store_helper.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs

Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Signed-off-by: yongman <yming0221@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@contrib/tiflash-proxy-cmake/CMakeLists.txt`:
- Around line 132-133: The ENABLE_NEXT_GEN_COLUMNAR branch in the CMakeLists
logic is hardcoding the Rust metric prefix, which breaks the existing build-time
override contract. Update the TIFLASH_RUST_ENV setup to read and pass through
the exported PROMETHEUS_METRIC_NAME_PREFIX value from the build
environment/Makefile instead of always injecting tiflash_proxy_, and keep the
change localized near the ENABLE_NEXT_GEN_COLUMNAR handling so the
Makefile/CMake integration remains consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fea451a0-2cc6-43c6-878f-534309e5631e

📥 Commits

Reviewing files that changed from the base of the PR and between 6c6686d and 51e24fe.

⛔ Files ignored due to path filters (1)
  • contrib/tiflash-columnar-hub/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • contrib/tiflash-columnar-hub/.cargo/config.toml
  • contrib/tiflash-columnar-hub/Cargo.toml
  • contrib/tiflash-columnar-hub/Makefile
  • contrib/tiflash-columnar-hub/hub-runtime/Cargo.toml
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
  • contrib/tiflash-proxy-cmake/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • contrib/tiflash-columnar-hub/.cargo/config.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs

Comment thread contrib/tiflash-proxy-cmake/CMakeLists.txt
Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 30, 2026
Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2026
@yongman

yongman commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

/test pull-unit-test

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jul 1, 2026
@JaySon-Huang JaySon-Huang requested a review from JinheLin July 1, 2026 01:42
@ti-chi-bot

ti-chi-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 1, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-07-01 01:42:50.04727033 +0000 UTC m=+176511.747649753: ☑️ agreed by JaySon-Huang.
  • 2026-07-01 02:17:29.580969688 +0000 UTC m=+178591.281349111: ☑️ agreed by Lloyd-Pottiger.

@ti-chi-bot ti-chi-bot Bot merged commit eb0af29 into pingcap:master Jul 1, 2026
11 checks passed
@yongman

yongman commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick release-nextgen-202603

@yongman yongman deleted the fix-columnar-metrics branch July 1, 2026 02:23
@ti-chi-bot

Copy link
Copy Markdown
Member

@yongman: new pull request created to branch release-nextgen-202603: #10944.

Details

In response to this:

/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

some metrics loss in columnar compute node

4 participants