Skip to content

feat(qwp): connect timeout, ingest callbacks, and write-only mode on the QuestDB facade#60

Open
bluestreak01 wants to merge 14 commits into
mainfrom
feat/connect-timeout
Open

feat(qwp): connect timeout, ingest callbacks, and write-only mode on the QuestDB facade#60
bluestreak01 wants to merge 14 commits into
mainfrom
feat/connect-timeout

Conversation

@bluestreak01

@bluestreak01 bluestreak01 commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

Three related ergonomics/resilience improvements for the QWP (WebSocket) client:

  1. Application-level TCP connect timeout (transport-wide) — bound the connect itself instead of riding the OS-level timeout.
  2. Expose the ingest callbacks on the QuestDB facadeerrorHandler / connectionListener, previously unreachable from the pooled facade.
  3. Write-only facade mode — start the handle even when the server / read primary is down (the reader pool is otherwise fail-fast and sinks the whole facade).

Each is independently usable and off by default — no behaviour change unless opted into.


1. Configurable TCP connect timeout

A connect to a black-holed/firewalled host blocks on the OS-level TCP connect timeout (60–120s): the socket is created blocking, connect() runs, then it's switched to non-blocking. The code calls this out:

// SenderPool.java
// connect to a black-holed/firewalled host blocks on the OS connect timeout
// (the transport exposes no application-level connect timeout to clamp it).

Approach (native, cross-platform): non-blocking connect() (EINPROGRESS) → poll/select for writability bounded by the caller's budget → confirm via getsockopt(SO_ERROR). A sentinel (CONNECT_TIMEOUT = -3) lets Java raise a timeout-flagged exception. Generalises the existing handleEintrInConnect helper.

Sender.builder("https::addr=host:9000;connect_timeout=5000;")...   // or .connectTimeoutMillis(5000)
QwpQueryClient.fromConfig("ws::addr=host:9000;connect_timeout=5000;");

Touches: native share/net.c + windows/net.c + net.h; Net / NetworkFacade(Impl); HttpClientConfiguration.getConnectTimeout(); HttpClient.connect() / WebSocketClient.doConnect(); ConfigSchema COMMON key connect_timeout; Sender builder + both parsers; QwpWebSocketSender / QwpQueryClient (withConnectTimeout). Bounds only the TCP connect; TLS/upgrade stay under auth_timeout_ms.


2. Ingest callbacks on the QuestDB facade

The facade built ingest senders from config strings only (SenderPool → Sender.fromConfig), so the programmatic SenderErrorHandler / SenderConnectionListener were unreachable — a facade user got the default loud-not-silent handlers with no way to observe async ingest errors or connection transitions.

QuestDB.builder()
    .ingestConfig("ws::addr=host:9000;")
    .queryConfig("ws::addr=host:9000;")
    .errorHandler(myErrorHandler)
    .connectionListener(myConnectionListener)
    .build();

QuestDBImpl / SenderPool each gain a full constructor carrying the callbacks; the white-box test-seam constructors are preserved as delegating shims. SenderPool.applyUserCallbacks() applies them to every pooled sender (non-SF and SF paths); internal recovery delegates are excluded. Defaults null.


3. Write-only facade mode

The facade always built a reader (QueryClientPool), which prewarms synchronously and fail-fast (default query_pool_min=1; the query client has no async connect). So a down server / read primary failed the whole facade build — taking the write side with it.

QuestDBBuilder.writeOnly() builds an ingest-only handle: the query pool is never created (read side can't fail startup), a query config is not required (any set is ignored), and query() / newQuery() throw a clear IllegalStateException.

// starts with no server present; query() is disabled
QuestDB db = QuestDB.builder()
    .ingestConfig("ws::addr=host:9000;initial_connect_retry=async;")
    .writeOnly()
    .build();

QuestDBImpl gains a write-only constructor + a writeOnly flag on the full constructor (12-arg test seam unchanged); PoolHousekeeper tolerates a null query pool. Pair with initial_connect_retry=async (or sender_pool_min=0) so the write side doesn't fail-fast either.


Testing

  • NetConnectTimeoutTest — loopback success, refused-vs-timeout disambiguation, black-hole timeout within budget.
  • QuestDBFacadeCallbacksTest — facade-wired errorHandler receives the async budget-exhaustion SenderError; connectionListener observes connection events (no server needed).
  • QuestDBWriteOnlyTest — facade builds with no server, query()/newQuery() disabled, no query config required, async warm sender buffers a write while serverless.
  • QuestDBServerRecoveryTest — full lifecycle: server down → facade starts → client writes (buffered) → server starts → write side reconnects and the reader connects on the first query. (The mock can't serve real SELECT rows, so the read step asserts the query client connects, not row contents.)

Full impl + network + facade suites pass locally. The JDK 8 CI build is green (the rebuilt native libraries ship the new connectAddrInfoTimeout symbol).

CI / native

  • ci(native): the rebuild_native_libs.yml linux-x86-64 job moved from manylinux2014 (glibc 2.17) to manylinux_2_28 (glibc 2.28), mirroring linux-aarch64 — GitHub now forces actions onto Node 24 (glibc ≥ 2.27), which couldn't run in the 2.17 container (pre-existing breakage, unrelated to the C change).
  • Native libraries rebuilt and committed for all five platforms (Rebuild CXX libraries).

Compatibility

Fully backward compatible. No behaviour change unless connect_timeout, a facade callback, or writeOnly() is used.

Note: the PR bundles three independent features. Happy to split any of them into standalone PRs off main if that's easier to review.

bluestreak01 and others added 5 commits June 28, 2026 21:41
Establish a real, cross-platform connect timeout for the HTTP and
WebSocket (QWP) transports. Previously a connect to a black-holed or
firewalled host blocked on the OS-level TCP connect timeout (often
60-120s) because the socket was created blocking and only switched to
non-blocking *after* connect; the transport exposed no knob to clamp it.

Approach: a new native primitive switches the socket to non-blocking
*before* connect, so connect() returns EINPROGRESS immediately, then
polls for writability bounded by the caller's budget and confirms the
outcome via SO_ERROR. A distinct return code (CONNECT_TIMEOUT, -3) lets
the Java layer raise a timeout-flagged exception rather than decode errno.

Native:
- share/net.c: connectAddrInfoTimeout + awaitConnectComplete (poll +
  getsockopt(SO_ERROR), monotonic-clock EINTR handling)
- windows/net.c: Winsock equivalent (select write/except sets)
- share/net.h: ECONNTIMEOUT (-3) sentinel

Java:
- Net / NetworkFacade(Impl): connectAddrInfoTimeout + CONNECT_TIMEOUT
- HttpClientConfiguration.getConnectTimeout() (default 0 = OS fallback)
- HttpClient.connect() / WebSocketClient.doConnect() honor it and throw a
  timeout-flagged HttpClientException on CONNECT_TIMEOUT
- Sender builder: connectTimeoutMillis() + connect_timeout connect-string
  key (legacy http and ws/wss parsers) + ConfigSchema COMMON key
- QwpWebSocketSender / QwpQueryClient: thread the value through to their
  WebSocketClient (adds QwpQueryClient.withConnectTimeout)

Default is unset (0): behaviour is unchanged unless connect_timeout is
configured.

Tests: NetConnectTimeoutTest covers loopback success, refused-vs-timeout
disambiguation, and a black-hole timeout that fires within budget;
config-honored drift guards updated for the new COMMON key.
On a runner with no route to TEST-NET-1 (192.0.2.0/24) connect() fails
fast with ENETUNREACH instead of dropping the SYN, so the timeout path
can't be exercised. Skip (Assume) in that case rather than asserting a
timeout, while still proving the call never blocked on the OS connect
timeout.
GitHub now forces actions onto Node 24 (glibc >= 2.27), which cannot run
inside the manylinux2014 (glibc 2.17) container the linux-x86-64 native
build used; actions/checkout failed before compilation. The old
Node-20-glibc-217 override only patched /__e/node20, not /__e/node24.

Switch the job to quay.io/pypa/manylinux_2_28_x86_64 (glibc 2.28, runs
stock Node 24) and drop the Node hack, nasm src.rpm rebuild, and manual
CMake download, mirroring the linux-aarch64 job that already builds on
manylinux_2_28.
The pooled QuestDB facade built its ingest Senders from config strings
only (SenderPool -> Sender.fromConfig), so the programmatic ingest
callbacks -- SenderErrorHandler and SenderConnectionListener -- were
unreachable: a facade user got the default loud-not-silent handlers with
no way to observe async ingest errors or connection transitions.

Expose both as QuestDBBuilder setters and thread them to every pooled
Sender:
- QuestDBBuilder.errorHandler(...) / .connectionListener(...)
- QuestDBImpl gains a full constructor carrying the callbacks; the public
  constructor forwards them and the 12-arg white-box test-seam constructor
  is preserved as a delegating shim (null callbacks).
- SenderPool gains a full constructor + applyUserCallbacks() that applies
  the callbacks to every sender it builds (both the non-SF and SF paths);
  the 8-arg test-seam constructor is preserved as a shim.

Recovery delegates (internal, short-lived, OFF-mode drain senders) are
deliberately excluded so the user's callbacks never see events from
internal machinery.

Defaults are null -> behaviour is unchanged unless a callback is set.

Tests: QuestDBFacadeCallbacksTest prewarms one ingest sender at a dead
port in async mode with a tight reconnect budget and asserts the
facade-wired errorHandler receives the budget-exhaustion SenderError and
the facade-wired connectionListener observes connection events -- no
server required.
@bluestreak01 bluestreak01 changed the title feat(net): add application-level TCP connect timeout feat: TCP connect timeout + expose ingest callbacks on the QuestDB facade Jun 28, 2026
The QuestDB facade always built a reader (QueryClientPool), which prewarms
synchronously and fail-fast (default query_pool_min=1, QwpQueryClient has no
async connect). So a down server / read primary sank the whole facade build,
taking the write side with it.

Add QuestDBBuilder.writeOnly(): build an ingest-only handle that never
constructs the query pool, so the read side cannot fail startup. A query
config is no longer required in this mode (any query config set is ignored),
and query()/newQuery() throw a clear "write-only" IllegalStateException.

- QuestDBImpl gains a write-only public constructor + a writeOnly flag on the
  full constructor; the 12-arg white-box test-seam constructor stays unchanged
  (delegates with writeOnly=false). queryPool/queryThreadLocal are null in
  write-only mode.
- PoolHousekeeper tolerates a null query pool.
- QuestDBBuilder.buildWriteOnly() validates + resolves only the sender/shared
  pool knobs from the ingest config.

Pair with initial_connect_retry=async (or sender_pool_min=0) on the ingest
config so the write side does not fail-fast either -> the facade starts with
no server present.

Tests: QuestDBWriteOnlyTest proves the facade builds with no server, that
query()/newQuery() are disabled, that no query config is required, and that an
async warm sender can buffer a write while serverless.
@bluestreak01 bluestreak01 changed the title feat: TCP connect timeout + expose ingest callbacks on the QuestDB facade feat: connect timeout, ingest callbacks, and write-only mode on the QuestDB facade Jun 28, 2026
…nnects

End-to-end resilience test for the QuestDB facade: build with the server
down (ingest initial_connect_retry=async + query_pool_min=0), buffer a
write, then bring the server up and assert the write side reconnects and
the previously-deferred reader connects on the first query.

Uses two TestWebSocketServers bound-but-not-accepting to model a reachable
-but-down server (handshakeCount stays 0 until start()). The mock cannot
serve real SELECT rows, so the read step asserts the query client connects
once the server is up, not the row contents. Stable across repeated runs.
@bluestreak01 bluestreak01 changed the title feat: connect timeout, ingest callbacks, and write-only mode on the QuestDB facade feat(qwp): connect timeout, ingest callbacks, and write-only mode on the QuestDB facade Jun 28, 2026
Remove the committed Linux/Windows native binaries (libquestdb.so,
libquestdb.dll) and compile them locally during the Azure test CI.

- New ci/build_native.yaml template compiles libquestdb on the runner:
  Linux (cmake+nasm+build-essential) and Windows (MinGW-w64+NASM via choco).
  macOS keeps using the committed .dylib. Inits the zstd submodule first.
- Output is copied into src/main/resources/.../bin/<platform>/ so mvn install
  packages it into the client jar for both client and OSS server tests; the
  loader also picks up the CMake bin-local output directly.
- Wired the template into run_tests_pipeline.yaml before client install.

Committed binaries are still produced by the release GitHub Action.
Remove the committed darwin-aarch64/darwin-x86-64 libquestdb.dylib and build
them on the macOS runners, matching the Linux/Windows approach. No native
binaries remain committed; all are compiled during the test CI.

- build_native.yaml: add a macOS build step (brew cmake/nasm,
  MACOSX_DEPLOYMENT_TARGET=13.0), detect darwin-aarch64 vs darwin-x86-64 via
  uname -m, and copy the dylib into src/main/resources/.../bin/<platform>/.
- Init the zstd submodule on all platforms (it was skipped on Darwin).

Release artifacts are still produced by the release GitHub Action.
The macos-15 (x64) agent hardware no longer exists, so remove the mac-x64
matrix entry. macOS is now tested on mac-aarch64 only. The darwin-x86-64
.dylib is still produced by the release GitHub Action, and build_native.yaml
keeps its uname-based arch detection so an x64 macOS runner would still build
correctly if ever reintroduced.
The GitHub Actions build-jdk8 job ran the full test suite against the
committed native libraries, which are now removed. Without the .so the
io.questdb.client.std.{Os,Files,Unsafe,...} static initializers fail with
NoClassDefFound (1289 errors).

Compile the native .so from source first (zstd submodule + cmake/nasm/
build-essential), against the JDK 8 JNI headers, and copy it into
src/main/resources/.../bin/linux-x86-64 so it survives 'mvn clean' and loads
via the production bin/<platform> path. Update the now-stale comment.
glibc 2.17 moved clock_gettime() into libc under a new GLIBC_2.17 version
node. Building the release .so in a modern container (manylinux_2_28) binds
clock_gettime@GLIBC_2.17, which raises the whole library's glibc floor to 2.17
and breaks loading on glibc 2.14-2.16 hosts.

Add src/main/c/share/glibc_compat.h with a .symver directive forcing the
reference back to clock_gettime@GLIBC_2.2.5 (x86-64 glibc only; no-op on
aarch64/macOS/Windows), include it from net.c and os.c, list it in the
CMake sources, and document the glibc floor in rebuild_native_libs.yml.
The Coverage Report job runs 'mvn -P jacoco test' on core but had no native
build step, so after dropping the committed binaries it failed to load
libquestdb.so (NoClassDefFound in io.questdb.client.std.*). Add the
build_native.yaml template before the coverage test run, matching the
BuildAndTest job. The job runs on Linux, so it compiles libquestdb.so.
@mtopolnik

Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 78 / 95 (82.11%)

file detail

path covered line new line coverage
🔵 io/questdb/client/impl/PoolHousekeeper.java 0 2 00.00%
🔵 io/questdb/client/cutlass/http/client/WebSocketClient.java 7 11 63.64%
🔵 io/questdb/client/Sender.java 11 16 68.75%
🔵 io/questdb/client/cutlass/http/client/HttpClient.java 5 7 71.43%
🔵 io/questdb/client/impl/QuestDBImpl.java 9 12 75.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpQueryClient.java 11 12 91.67%
🔵 io/questdb/client/network/NetworkFacadeImpl.java 1 1 100.00%
🔵 io/questdb/client/HttpClientConfiguration.java 1 1 100.00%
🔵 io/questdb/client/QuestDBBuilder.java 18 18 100.00%
🔵 io/questdb/client/impl/ConfigSchema.java 1 1 100.00%
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 3 3 100.00%
🔵 io/questdb/client/impl/SenderPool.java 11 11 100.00%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants