fix: prevent service crashes from unhandled promise rejections#6314
Open
vshvets-bc wants to merge 2 commits into
Open
fix: prevent service crashes from unhandled promise rejections#6314vshvets-bc wants to merge 2 commits into
vshvets-bc wants to merge 2 commits into
Conversation
No service registers a process-level error handler, so a single stray promise rejection terminates the whole Node process (exit 1). Two concrete paths can trigger this: - The NATS reply handler awaits `codec.decode()` outside a try/catch. For large payloads decode fetches a directLink over HTTP; if the responding service died mid-request the fetch throws (ECONNREFUSED) out of the async subscribe callback and becomes an unhandledRejection. - PolicyEngine.generateModel's ready-callback settles its promise more than once (no `return` after `reject`) and only cleans up on the resolve path. Changes: - Add global unhandledRejection/uncaughtException handlers (new shared helper in @guardian/common), registered before bootstrap in guardian-service and api-gateway. After boot, unhandled rejections are logged and swallowed so one failed operation cannot kill the process; before boot they exit for a clean restart. uncaughtException always logs and exits. - Guard `codec.decode()` in the NATS reply handler so a failed reply fails just that request (returned to the caller) instead of throwing out of the callback. - Settle the generateModel ready-callback exactly once and always clean it up. Adds unit tests for the global handlers and the reply-decode guard. Signed-off-by: Volodymyr Shvets <volodymyr.shvets@climission.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- sendMessage/sendRawMessage: move async work (sign/encode/publish) out of the Promise executor. A rejection inside an async executor is neither caught nor settles the promise - it became an unhandledRejection and left the caller hanging. Now it rejects the promise and clears the pending response callback. - Reply decode failure: return a generic message to the caller instead of the raw exception text (which could contain the internal directLink URL); the detail is still logged server-side. - registerGlobalErrorHandlers: add an idempotency guard so repeated calls don't attach duplicate process listeners. - Post-boot swallowed rejections are now counted and logged with a running total (getSwallowedRejectionCount) for observability. Extends the unit tests accordingly. Signed-off-by: Volodymyr Shvets <volodymyr.shvets@climission.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
No service registers a process-level error handler, so a single stray promise rejection terminates the whole Node process (exit 1). We hit this in production:
guardian-servicecrashed and cascaded toapi-gateway.Two concrete paths trigger it:
common/src/mq/nats-service.ts, the reply-subject callback awaitsthis.codec.decode(msg.data)outside a try/catch. For large payloads,ZipCodec.decodefetches adirectLinkover HTTP; if the responding service died mid-request, that fetch throwsECONNREFUSEDout of the async subscribe callback →unhandledRejection→ process exit.returnafterreject) and only cleans up on the resolve path.Because there is no
process.on('unhandledRejection' | 'uncaughtException')anywhere in the codebase, either path crashes the service.Changes
Global safety net
common/src/helpers/process-error-handlers.ts(registerGlobalErrorHandlers/setGlobalErrorLogger/markServiceBooted).guardian-serviceandapi-gateway(via a side-effect module imported ahead ofapp.js, so handlers are active before any bootstrap rejection).unhandledRejectionis logged and swallowed so one failed operation cannot take down the process; before boot it exits(1) for a clean restart.uncaughtExceptionalways logs and exits(1).NATS reply decode guard
codec.decode()in the reply handler so a failed reply (e.g. a large-payloaddirectLinkfetch failure) fails just that request — the caller'ssendMessagepromise settles with an error — instead of throwing out of the async callback. The rest of the handler is unchanged; validation still returns401.generateModel ready-callback
returnafterreject), so a duplicate ready-event can't double-settle.Tests
New unit tests in
common/tests/unit-tests/:process-error-handlers.test.mjs— pre-boot exit vs post-boot swallow, non-Error reason normalization, logger routing,uncaughtExceptionexit.nats-service-reply-decode.test.mjs— decode failure routed to caller as500without throwing; happy path still delivers the body; unknown-callback replies don't throw.interfaces,common,guardian-service, andapi-gatewayall build clean; the new tests pass.