Skip to content

fix: prevent service crashes from unhandled promise rejections#6314

Open
vshvets-bc wants to merge 2 commits into
hashgraph:developfrom
Climission:fix/unhandled-rejection-hardening
Open

fix: prevent service crashes from unhandled promise rejections#6314
vshvets-bc wants to merge 2 commits into
hashgraph:developfrom
Climission:fix/unhandled-rejection-hardening

Conversation

@vshvets-bc

Copy link
Copy Markdown
Collaborator

Problem

No service registers a process-level error handler, so a single stray promise rejection terminates the whole Node process (exit 1). We hit this in production: guardian-service crashed and cascaded to api-gateway.

Two concrete paths trigger it:

  1. NATS reply handler — in common/src/mq/nats-service.ts, the reply-subject callback awaits this.codec.decode(msg.data) outside a try/catch. For large payloads, ZipCodec.decode fetches a directLink over HTTP; if the responding service died mid-request, that fetch throws ECONNREFUSED out of the async subscribe callback → unhandledRejection → process exit.
  2. PolicyEngine.generateModel — the ready-callback settles its promise more than once (no return after reject) and only cleans up on the resolve path.

Because there is no process.on('unhandledRejection' | 'uncaughtException') anywhere in the codebase, either path crashes the service.

Changes

Global safety net

  • New shared helper common/src/helpers/process-error-handlers.ts (registerGlobalErrorHandlers / setGlobalErrorLogger / markServiceBooted).
  • Registered before bootstrap in guardian-service and api-gateway (via a side-effect module imported ahead of app.js, so handlers are active before any bootstrap rejection).
  • After boot, unhandledRejection is logged and swallowed so one failed operation cannot take down the process; before boot it exits(1) for a clean restart. uncaughtException always logs and exits(1).

NATS reply decode guard

  • Wrap codec.decode() in the reply handler so a failed reply (e.g. a large-payload directLink fetch failure) fails just that request — the caller's sendMessage promise settles with an error — instead of throwing out of the async callback. The rest of the handler is unchanged; validation still returns 401.

generateModel ready-callback

  • Delete the callback first and settle exactly once (return after reject), so a duplicate ready-event can't double-settle.

Tests

New unit tests in common/tests/unit-tests/:

  • process-error-handlers.test.mjs — pre-boot exit vs post-boot swallow, non-Error reason normalization, logger routing, uncaughtException exit.
  • nats-service-reply-decode.test.mjs — decode failure routed to caller as 500 without throwing; happy path still delivers the body; unknown-callback replies don't throw.

interfaces, common, guardian-service, and api-gateway all build clean; the new tests pass.

No service registers a process-level error handler, so a single stray promise
rejection terminates the whole Node process (exit 1). Two concrete paths can
trigger this:

- The NATS reply handler awaits `codec.decode()` outside a try/catch. For large
  payloads decode fetches a directLink over HTTP; if the responding service died
  mid-request the fetch throws (ECONNREFUSED) out of the async subscribe
  callback and becomes an unhandledRejection.
- PolicyEngine.generateModel's ready-callback settles its promise more than once
  (no `return` after `reject`) and only cleans up on the resolve path.

Changes:

- Add global unhandledRejection/uncaughtException handlers (new shared helper in
  @guardian/common), registered before bootstrap in guardian-service and
  api-gateway. After boot, unhandled rejections are logged and swallowed so one
  failed operation cannot kill the process; before boot they exit for a clean
  restart. uncaughtException always logs and exits.
- Guard `codec.decode()` in the NATS reply handler so a failed reply fails just
  that request (returned to the caller) instead of throwing out of the callback.
- Settle the generateModel ready-callback exactly once and always clean it up.

Adds unit tests for the global handlers and the reply-decode guard.

Signed-off-by: Volodymyr Shvets <volodymyr.shvets@climission.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vshvets-bc vshvets-bc requested review from a team as code owners July 1, 2026 11:44
@Pyatakov Pyatakov self-assigned this Jul 1, 2026
- sendMessage/sendRawMessage: move async work (sign/encode/publish) out of the
  Promise executor. A rejection inside an async executor is neither caught nor
  settles the promise - it became an unhandledRejection and left the caller
  hanging. Now it rejects the promise and clears the pending response callback.
- Reply decode failure: return a generic message to the caller instead of the
  raw exception text (which could contain the internal directLink URL); the
  detail is still logged server-side.
- registerGlobalErrorHandlers: add an idempotency guard so repeated calls don't
  attach duplicate process listeners.
- Post-boot swallowed rejections are now counted and logged with a running
  total (getSwallowedRejectionCount) for observability.

Extends the unit tests accordingly.

Signed-off-by: Volodymyr Shvets <volodymyr.shvets@climission.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants