diff --git a/design-proposals/external-database-exposure/README.md b/design-proposals/external-database-exposure/README.md new file mode 100644 index 0000000..54aac9e --- /dev/null +++ b/design-proposals/external-database-exposure/README.md @@ -0,0 +1,264 @@ + +# External database exposure via Gateway API TLS-passthrough (SNI) and end-to-end TLS + +- **Title:** `External database exposure via Gateway API TLS-passthrough (SNI) and end-to-end TLS` +- **Author(s):** `@lexfrei` +- **Date:** `2026-06-24` +- **Status:** Draft + +## Overview + +Today every managed database a tenant exposes externally gets its own `LoadBalancer` Service, and therefore its own public IP. A tenant with ten external databases burns ten IPs, and there is no managed end-to-end TLS story that does not terminate somewhere in the middle. This proposal exposes managed databases through the Gateway API TLS-passthrough listeners that Cozystack already operates, routed by SNI on each engine's native port. Multiple databases per tenant collapse onto a single LoadBalancer IP, and because a passthrough listener never terminates TLS, the certificate the external client validates is byte-identical to the operator-issued server certificate the database already presents inside the cluster — end-to-end TLS with no second certificate, no re-encryption hop, and no private key held at the edge. + +This is the design-proposal artifact required by `cozystack/cozystack#2816`, and it records the decision that issue asks for. The trade-off that issue frames is CNI mesh encryption (datapath lock-in) versus application-level TLS: this proposal chooses **application-level, operator-owned TLS carried through a non-terminating gateway** for the external leg — no edge termination, no second certificate, no private key at the edge, and no dependency on a particular CNI. The other half of that framing — in-cluster (east-west) pod-to-pod encryption, which never leaves the cluster and is necessarily datapath-specific — is recorded and executed separately under `cozystack/cozystack#2977` (PR `cozystack/cozystack#2984`); it complements this proposal rather than competing with it. It is the external-exposure half of epic `cozystack/cozystack#2811`; the certificate/PKI half is covered by the sibling proposal `design-proposals/unified-tls-pki`, on which this one depends for the trust-anchor object. + +## Scope and related proposals + +- **Depends on:** `design-proposals/unified-tls-pki` — provides the `-ca-cert` key-free trust anchor that external clients use to verify the endpoint. This proposal does not re-specify it. It is a companion submission under the same epic, on its own branch; the path resolves once both proposals merge. The dependency is **per-engine, not blanket**: an engine is exposable here only once its `ca.crt` is actually delivered under that contract, and the path differs by engine — redis self-publishes a key-free `-ca-cert` through its forked operator, while postgres and mongodb obtain theirs through the extraction controller in `unified-tls-pki`. So SNI exposure for a given engine is gated on that engine's `unified-tls-pki` convergence, not merely on the contract existing. +- **Umbrella:** `cozystack/cozystack#2811`. This proposal covers WS4 (`cozystack/cozystack#2815`, SNI exposure) and WS5 (`cozystack/cozystack#2816`, end-to-end TLS). +- **Referenced, not designed here:** WS6 east-west / in-cluster CNI encryption (`cozystack/cozystack#2977`, PR `cozystack/cozystack#2984`). It is complementary defense-in-depth for pod-to-pod traffic and is explicitly out of scope (see Non-goals). + +All repository paths below refer to the `cozystack/cozystack` repository. + +## Context + +### TLS-passthrough already exists + +Cozystack already runs Gateway API TLS-passthrough for layer-7 system services. The `TenantGateway` controller (`internal/controller/tenantgateway/reconciler.go`) renders, for each entry in `TenantGateway.Spec.TLSPassthroughServices`, a listener named `tls-` on port 443 with `Protocol: TLS`, `Mode: Passthrough`, hostname `.`, and `AllowedRoutes` restricted to `TLSRoute`. The default services are `api`, `vm-exportproxy`, and `cdi-uploadproxy`, each with a `TLSRoute` that attaches by `sectionName: tls-` (for example `packages/system/cozystack-api/templates/api-tlsroute.yaml`). The platform runs Cilium 1.19.3 with GatewayClass `cilium`. `TLSRoute` is GA in the Gateway API Standard channel as `gateway.networking.k8s.io/v1` since Gateway API v1.5.0; Cilium 1.19.x still consumes it from the **experimental-channel** CRD, so that CRD must be installed and the operative API version for this design is Cilium's channel, not the upstream graduation. `mode: Passthrough` TLSRoute is supported in Cilium since 1.14. + +The mechanism this proposal needs is therefore already in production for layer-7 services. The work is to extend it to databases, which speak on native (non-443) ports. + +### Passthrough on a native port works; the real constraint is SNI-hostname overlap + +Routing a `mode: Passthrough` listener on a non-443 port is supported, not exotic: Cilium has shipped TLSRoute passthrough since 1.14, its own conformance fixtures exercise passthrough listeners on a non-443 port, and field reports route passthrough on database ports (for example a mongo `27017` case). The native port (5432/6379/27017/3306) is not the risk. + +The real constraint is **SNI-hostname uniqueness across listeners on one Gateway**. On every released Cilium up to and including 1.19.x, two listeners that share a hostname but differ in port — or a passthrough listener that shares a hostname with an HTTPS-terminate listener — collide and do not isolate correctly (the cross-port hostname-isolation defect; fixed by an upstream change that targets Cilium 1.20 and is not in any release tag as of this writing). This matters here because the listener model below puts every engine-type passthrough listener on the wildcard hostname `*.`, which is the same hostname the tenant Gateway's HTTPS-terminate listeners already use. This proposal keeps the flat `*.` scheme — for clean `.` connection hostnames — and therefore **depends on Cilium 1.20** for correct multi-listener isolation. If that dependency is unacceptable, the documented fallback is distinct per-engine subdomains (`*.pg.`, `*.redis.`, …), which sidestep the overlap at the cost of longer hostnames and per-engine SAN/recipe changes. Either way this is the Phase 1 validation gate (Rollout): the gate validates the shared-hostname case on the target Cilium version, not merely that passthrough binds on a non-443 port. + +### Each external database burns its own IP today + +When a database chart sets `external: true`, it provisions a dedicated `Service` of `type: LoadBalancer`: postgres `-external-write` on 5432 (`packages/apps/postgres/templates/external-svc.yaml`), redis `-external-lb` on 6379 (`packages/apps/redis/templates/service.yaml`), mongodb `-external` on 27017 (`packages/apps/mongodb/templates/external-svc.yaml`), kafka on 9094 plus a LoadBalancer per broker (`packages/apps/kafka/templates/kafka.yaml`), and mariadb via a `primaryService` of `type: LoadBalancer` (`packages/apps/mariadb/templates/mariadb.yaml`). Each is a separate IP from the shared MetalLB (or Cilium LB-IPAM) pool; there is no per-tenant IP isolation and no SNI multiplexing. + +### The certificate hooks already exist + +The external hostname is `.<_namespace.host>`, where `_namespace.host` is the tenant apex. The SAN-injection hook already exists on `main` for postgres: it adds the hostname to the CNPG `Cluster` via `spec.certificates.serverAltDNSNames` in `packages/apps/postgres/templates/db.yaml`, gated by the `tls.enabled` tri-state in `_tls.tpl` (which defaults on when `external` is true). The other engines acquire the same SAN hook through the per-app TLS series tracked by the `unified-tls-pki` proposal — on `main` today redis and mariadb carry no TLS templates at all, and their certificate/SAN support lives in the open PRs `cozystack/cozystack#2729` and `cozystack/cozystack#2680`. The trust anchor (`ca.crt`) is delivered to the tenant through that same `unified-tls-pki` contract. So this proposal builds on hooks that are present for postgres and arriving for the rest — it does not invent new ones. + +### The one-IP-per-tenant ceiling + +Multiple per-tenant Gateways cannot share a single LoadBalancer IP on current Cilium. Cilium LB-IPAM shares one IP across Services via `lbipam.cilium.io/sharing-key` only when their ports do not conflict; every Gateway claims `443/TCP`, so the shared-IP path is inactive on that port collision (`packages/extra/gateway/README.md`; the port-conflict rule is documented in the Cilium LB-IPAM docs). Within a single Gateway, a parent and its inheriting children share one IP. `Gateway.spec.listeners` is hard-capped at 64 (Gateway API CRD `MaxItems`). The practical consequence: the unit of IP sharing is one tenant Gateway, and a tenant's database fan-out draws from the 64-listener budget shared with the existing http/https/https-apex and per-child-apex wildcard listeners. + +### The problem + +Two concrete problems, both solved by SNI-passthrough plus certificate reuse: + +1. **IPv4 scarcity.** N externally-exposed databases cost N public IPs per tenant. This does not scale, and IPs are the scarce resource. +2. **No managed end-to-end TLS.** Exposing a database externally today either lacks a managed-TLS-to-the-client story or would require terminating at the edge — which means a second certificate, a re-encryption hop, and the edge holding a private key for a database it does not own. + +## Goals + +- Multiple external databases per tenant share a single LoadBalancer IP, distinguished by SNI on their native ports. +- End-to-end TLS using the **same** operator-issued certificate client-side and pod-side: no edge termination, no edge-held private key, no second certificate. +- Trust established through the existing `ca.crt`-only object from `unified-tls-pki` — no new PKI, issuer, or rotation machinery. +- A minimal, auditable API surface that reuses the existing passthrough listener and SAN-injection hooks. +- Per-engine honesty: ship what fits the model, defer what does not, and say why. + +### Non-goals + +- **East-west / pod-to-pod CNI encryption** — separate workstream (`cozystack/cozystack#2977`, PR `cozystack/cozystack#2984`); complementary, not designed here. +- **Kafka external SNI exposure** — deferred; its per-broker advertised-address topology does not fit a single SNI entrypoint (see the matrix). Kafka keeps today's per-broker LoadBalancer behavior. +- **MongoDB non-sharded replica-set SNI exposure** — deferred; per-member LoadBalancers plus `rs.conf` rewrite are the same class of problem as Kafka. Only the sharded `mongos` topology fits. +- **Edge TLS termination / re-encryption / `BackendTLSPolicy`** — explicitly rejected; the recorded decision lands on non-terminating passthrough. +- **New PKI, issuer, or certificate-rotation machinery** — out of scope; this reuses `unified-tls-pki` entirely. +- **Multi-Gateway single-IP sharing** — out of scope until Cilium implements ListenerSet; the design targets one IP per tenant Gateway. +- **Replacing `TLSPassthroughServices`** — the existing layer-7 field stays; this adds a parallel field. +- **SNI confidentiality (ECH)** — database clients do not use ECH; hiding the external hostname is not a goal. +- **Non-Cilium GatewayClasses** — the design assumes GatewayClass `cilium`. + +## Design + +### 1. Listener and port model + +```mermaid +flowchart TB + subgraph ext["External clients"] + C1["psql
SNI=db1.alice.example.org:5432"] + C2["psql
SNI=db2.alice.example.org:5432"] + C3["redis-cli
SNI=cache.alice.example.org:6379"] + end + subgraph gw["Tenant Gateway — ONE LoadBalancer IP, one listener per engine type"] + L1["listener tls-postgres
5432 / Passthrough
hostname *.alice.example.org"] + L2["listener tls-redis
6379 / Passthrough
hostname *.alice.example.org"] + end + subgraph rt["Per-release TLSRoutes (SNI hostname)"] + R1["TLSRoute db1"] + R2["TLSRoute db2"] + R3["TLSRoute cache"] + end + subgraph pods["Operator-owned database pods"] + P1["postgres db1
presents CNPG server cert"] + P2["postgres db2
presents CNPG server cert"] + P3["redis cache
presents operator server cert"] + end + CA["<release>-ca-cert (ca.crt only)
projected via tenantsecrets"] + C1 -->|"raw TLS, SNI"| L1 + C2 -->|"raw TLS, SNI"| L1 + C3 -->|"raw TLS, SNI"| L2 + L1 -->|"SNI db1"| R1 -->|"no termination"| P1 + L1 -->|"SNI db2"| R2 -->|"no termination"| P2 + L2 -->|"SNI cache"| R3 -->|"no termination"| P3 + CA -.->|"verifies"| C1 + CA -.->|"verifies"| C3 +``` + +A Gateway listener is keyed by the tuple (port, protocol, hostname/SNI), and **multiple `TLSRoute` objects can attach to one listener**, each selecting its backend by its own `spec.hostnames`. SNI-based routing therefore works on any TCP port, not only 443, and — crucially — it does not need a listener per database. The design uses **one passthrough listener per engine type**, on that engine's native port, with a wildcard hostname: `tls-postgres` on 5432, `tls-redis` on 6379, `tls-mongos` on 27017, `tls-mariadb` on 3306 — each `mode: Passthrough`, hostname `*.` (or unrestricted), `AllowedRoutes` limited to `TLSRoute`. Every database release of that engine then attaches a per-release `TLSRoute` carrying `spec.hostnames: ["."]`, and the Gateway SNI-routes each connection to the right backend. All of a tenant's database listeners live on the one tenant Gateway and therefore share its single IP. + +This keeps listener consumption at **O(engine types)** — at most four or five — rather than O(database instances). A tenant running thirty Postgres releases spends one `tls-postgres` slot, not thirty, so the 64-listener budget (§3) stops being a per-database ceiling and the consolidation goal scales with instance count, not against it. + +**The shared-hostname dependency is load-bearing.** Putting `tls-postgres` (5432) and `tls-redis` (6379) both on `*.` means several listeners share one hostname across ports, and that hostname also overlaps the Gateway's HTTPS-terminate listeners. As noted in Context, Cilium isolates that case correctly only from 1.20 onward, so this scheme depends on Cilium 1.20. The per-release `TLSRoute` hostnames (`.`) are always distinct and are not the issue; the overlap is at the listener level. The flat scheme is chosen for clean connection hostnames; distinct per-engine subdomains are the documented fallback if the Cilium 1.20 dependency is unacceptable. + +Native ports are chosen over forcing everything onto 443 because the latter buys nothing: it does not improve IP consolidation (SNI already does that on any port) and it breaks client ergonomics and tooling defaults (`psql -h host` assumes 5432) while still demanding direct-TLS. The all-on-443 variant is retained as a documented opt-in for operators who want a single-port firewall surface (see Alternatives). + +**The Postgres caveat is load-bearing.** libpq has historically performed a StartTLS-style negotiation: it sends a plaintext `SSLRequest` and waits for the server's single-byte reply *before* the TLS `ClientHello`. There is no SNI in the first packet, so a passthrough listener cannot route it. This is resolved by `sslnegotiation=direct` (libpq, PostgreSQL 17+), which sends the `ClientHello` immediately, carrying SNI. It comes with hard prerequisites, not preferences: the **server must also be PostgreSQL 17+**; `sslnegotiation=direct` requires `sslmode=require` or stronger; direct SSL mandates ALPN (`postgresql`); and SNI is emitted only when the client dials by **hostname** with `sslsni=1` (the default) — an IP literal carries no SNI. Driver coverage is uneven (recent pgjdbc 42.7.x and Npgsql 9.0+ support it; node-postgres historically did not). Any client older than these, against a pre-17 server, or with `sslmode` below `require`, falls back to the legacy negotiation and cannot be SNI-routed. Postgres-over-passthrough is therefore conditional on direct-TLS-capable clients and is opt-in, not default-on. + +### 2. Certificate reuse is a property of passthrough (the WS5 core) + +Passthrough *is* the certificate reuse. Because the Gateway in `mode: Passthrough` does not terminate TLS, it forwards the raw handshake bytes to the backend pod. The certificate the external client validates is byte-identical to the operator-issued server certificate the database already presents internally — CNPG-managed for postgres, Strimzi-managed for kafka, PSMDB-managed for mongo. There is no second certificate, no re-encryption, and no new issuance. End-to-end TLS is a consequence of not terminating, not a feature to build. + +One subtlety the "single certificate" framing hides: an external client doing `sslmode=verify-full` (or the equivalent) validates the presented certificate's SANs against the hostname it dialed. CNPG presents one server certificate across its `-rw`/`-r`/`-ro` services with no per-listener split, so the external passthrough hostname `.` **must be in that certificate's `serverAltDNSNames`** for verify-full to pass — which is exactly what the SAN hook below injects. Engines with a per-listener certificate model (Strimzi `brokerCertChainAndKey`) could serve a different cert externally, but the fitting engines here present one server cert, so "reuse" is literal. + +WS5 adds nothing beyond two hooks that already exist: + +1. **Trust-anchor delivery** — the `-ca-cert` `ca.crt`-only object from `unified-tls-pki`, projected to the tenant. The external client verifies against that `ca.crt`. +2. **SAN coverage** — the chart already injects `.` into the operator-issued certificate (postgres `serverAltDNSNames` in `db.yaml`, gated by `tls.enabled`). The SAN the client's SNI will carry is already in the certificate. + +Stated plainly, where this does **not** work: + +- **Kafka** — the external listener advertises per-broker addresses; after bootstrap the client connects directly to each broker. A single SNI endpoint cannot represent N per-broker endpoints. Deferred. +- **MongoDB non-sharded** — one LoadBalancer per member plus an `rs.conf` rewrite; the driver reaches each member by its advertised address. A single SNI front cannot represent the replica-set topology. Only sharded `mongos` fits. +- **Clients that omit or mis-emit SNI** — a passthrough listener has no certificate to fall back to, so a missing SNI is a hard connection failure, not a downgrade. This includes pre-direct-TLS Postgres clients. + +### 3. IP consolidation and its limits + +The headline payoff is IPv4 scarcity relief. Today ten external databases cost ten public IPs (more for Kafka and Mongo replica sets, which add per-broker / per-member IPs). Under this proposal, N SNI-routed databases on one tenant Gateway share one IP, differentiated by SNI on their native ports. Ten databases become one IP. + +The ceiling, stated honestly: + +- `Gateway.spec.listeners` is capped at 64. Because database listeners are **one per engine type** (§1), not one per release, their contribution to that budget is bounded by the four or five supported engines no matter how many instances a tenant runs; the per-release fan-out lands on `TLSRoute` objects, which are not capped. The remaining slots go to the http/https/https-apex listeners, the per-child-apex wildcard listeners, and the three default passthrough services. A tenant therefore approaches the cap through child-apex fan-out, not database fan-out; the controller's existing advice to split a high-fanout subtree onto its own Gateway via `tenant.spec.gateway=true` still applies there. (The 64 cap is on one Gateway's inline `listeners` array; a future ListenerSet could raise the effective count, see Open questions.) +- Multi-Gateway IP sharing is not possible today (the one-IP-per-tenant blocker above). The win is one IP per tenant Gateway. Lifting it depends on Cilium implementing ListenerSet, which would let many tenants' listeners be consolidated onto **one shared Gateway** (and thus one IP) — it does not make N separate Gateways share an IP. ListenerSet graduated toward the Gateway API Standard channel in v1.5, but Cilium does not implement it on any stable release (tracked by the Cilium CFP `cilium#42756`, implementation PR `cilium#46303` in progress). + +### 4. Per-engine treatment + +| Engine | Native port | Fits SNI-passthrough? | Client requirement | Recommendation | +| --- | --- | --- | --- | --- | +| Redis | 6379 | Yes — immediate TLS, no pre-TLS negotiation | `--tls` + `ca.crt`; modern client emits SNI from a hostname | cleanest fit; ships opt-in first, default-on candidate once validated | +| PostgreSQL (CNPG) | 5432 | Yes, only with direct-TLS | `sslnegotiation=direct` + `sslmode=verify-full` + `ca.crt`, hostname (not IP); libpq PG17+ **and** server PG17+ | opt-in (pre-PG17 client or server fails closed) | +| MongoDB — sharded (mongos) | 27017 | Yes — single stateless endpoint | `tls=true` + `tlsCAFile`; seed pointing at the SNI hostname | opt-in (sharded only) | +| MariaDB | 3306 | Likely — modern connectors do TLS-first with SNI | TLS connector emitting SNI + `ca.crt`; verify per connector | opt-in pending connector conformance | +| MongoDB — replica set | 27017 | No — per-member LB + `rs.conf` rewrite | n/a | deferred | +| Kafka (Strimzi) | 9094 | No — per-broker advertised addresses | n/a | deferred | + +Kafka is deferred because its wire protocol redirects: the client connects to a bootstrap endpoint, receives metadata listing per-broker advertised host:port pairs, then opens direct connections to each broker. Strimzi's `type: loadbalancer` external listener provisions a LoadBalancer per broker plus a bootstrap LB precisely for this. A single SNI front cannot satisfy the per-broker fan-out — this is a topology problem, not a certificate one (Strimzi could serve any cert per listener). Making Kafka fit would require either per-broker SNI listeners and TLSRoutes (N listeners against the 64-listener cap, partially defeating the consolidation goal) plus rewriting Strimzi's advertised listeners, or a Kafka-aware re-advertising proxy — both out of scope. Revisit if per-broker SNI proves worth the listener budget. + +### 5. API and controller extension + +The trigger and listener synthesis stay in the controller; database charts do not render Gateway plumbing. Database charts already receive the `_cluster` values channel and read parts of it (for example `packages/apps/postgres/templates/db.yaml` reads `_cluster.scheduling`, and mongodb reads `_cluster["cluster-domain"]`), but they do not read the gateway-discovery keys that `cozystack-api` consumes (the gateway-enabled flag and the gateway name) and have no logic to locate the tenant Gateway. Teaching every database chart that discovery dance would duplicate it across five charts and couple application charts to networking topology. The controller already owns listener synthesis; keep it there. + +Two objects implement the model, at different cardinalities: one shared listener per engine type, and one `TLSRoute` per database release. + +The **listener** is the new API surface. The existing `TLSPassthroughServices []string` field (`api/gateway/v1alpha1/tenantgateway_types.go`) is too weak — a bare service name hardcodes the layer-7 convention of port 443 and hostname `.`. Add one structured field alongside it, leaving the existing field untouched for backward compatibility: + +```go +// TLSPassthroughListener declares one shared passthrough listener for a +// database engine type, on its native port, with a wildcard SNI hostname. +// Every release of that engine attaches its own TLSRoute by SNI hostname. +// +// Validation contract (enforced by CEL / admission): +// - Name: DNS-1123 label; unique within the list; renders as "tls-". +// - Port: 1..65535; unique within the list; must not collide with a +// synthesized layer-7 listener port (443) or another entry. +// - Hostname: optional; a valid DNS wildcard or exact hostname; defaults +// to "*." when empty. +type TLSPassthroughListener struct { + Name string `json:"name"` // listener suffix -> "tls-" (e.g. "postgres") + Port int32 `json:"port"` // native port (5432/6379/27017/3306) + Hostname string `json:"hostname,omitempty"` // wildcard SNI match, default "*." +} + +// TLSPassthroughListeners renders one Passthrough listener "tls-" on +// .Port per entry, AllowedRoutes restricted to TLSRoute. Independent of the +// layer-7 TLSPassthroughServices field. +// +optional +TLSPassthroughListeners []TLSPassthroughListener `json:"tlsPassthroughListeners,omitempty"` +``` + +The **route** is a standard Gateway API `TLSRoute` (no new type), rendered once per exposed database release: `spec.parentRefs` attaches to the shared `tls-` listener by `sectionName`, `spec.hostnames: ["."]` carries the SNI match, and `spec.rules[].backendRefs` is a standard `BackendRef` (whose embedded `BackendObjectReference` points at the database Service on its native port, cross-namespace via `ReferenceGrant`). `TLSRoute` is GA as `gateway.networking.k8s.io/v1` since Gateway API v1.5.0, but Cilium 1.19.x consumes the experimental-channel CRD — pin to whatever API version the targeted Cilium ships, not the upstream `v1` graduation. + +The controller change is to render, per `TLSPassthroughListeners` entry, one Passthrough listener on the supplied `Port` instead of the hardcoded 443, and to attach each per-release `TLSRoute` by `sectionName: tls-` — the same attachment pattern as the existing `api-tlsroute.yaml`, except that many routes share one listener and the Gateway disambiguates them by SNI hostname. Both are populated by the Tenant / HelmRelease orchestration layer that already knows the tenant Gateway and the database release — not the database chart and not the human — so a database's surface stays a single `external`-adjacent toggle. The engine-type listener is created on first exposure of that engine and removed once its last release is gone; per-release add/remove only touches the route, never the shared listener. + +### 6. Trust-anchor and SAN flow + +End to end: the chart injects the external hostname `.` into the operator-issued certificate's SAN list (gated by the `tls.enabled` tri-state, which defaults on with `external`). The operator issues the server certificate with that SAN. The `ca.crt`-only object is projected to the tenant. The external client connects with SNI `.` and verifies the presented certificate against that `ca.crt`. The passthrough listener forwards the raw handshake; the pod presents the SAN-matching certificate. No hop in the path terminates TLS. + +## User-facing changes + +A database gains an `external`-adjacent toggle to select passthrough/SNI mode. Per-engine connection recipes are documented: `psql "sslnegotiation=direct sslmode=verify-full sslrootcert=ca.crt host=."` (libpq and server both PG17+), `redis-cli --tls --cacert ca.crt -h .`, `mongosh --tls --tlsCAFile ca.crt --host .`, `mysql --ssl-mode=VERIFY_IDENTITY --ssl-ca=ca.crt -h .`. Kafka and non-sharded MongoDB keep today's per-LoadBalancer behavior. + +## Upgrade and rollback compatibility + +The default stays today's per-database LoadBalancer, so no existing external database changes its IP on upgrade. Passthrough/SNI mode is opt-in. Migrating an existing external database to SNI mode is a breaking, opt-in change — the IP changes and the client must be reconfigured (new host, direct-TLS for Postgres); it is not automatic. The existing `TLSPassthroughServices` field continues to work unchanged, and reverting the feature removes the per-release routes and the engine-type listeners without touching the database's own PKI. + +## Security + +The edge never holds the database's private key — the central strength of passthrough over termination. SNI is sent in cleartext on TLS 1.3 except under ECH, which database clients do not use, so the external hostname `.` is observable on the wire; this is no worse than DNS or SNI exposure for any TLS service, and the payload stays encrypted end-to-end. A missing or mis-emitted SNI is a hard connection failure (a passthrough listener has no certificate to fall back to) — fail-closed, which is the secure default. Exposing a database externally with TLS explicitly off is not silently corrected: when `tls.enabled` is left unset the tri-state turns TLS on together with `external`, but an explicit `tls.enabled: false` combined with `external: true` is **rejected by a ValidatingAdmissionPolicy** — the same admission mechanism that already enforces tenant hostname/IP isolation — so a tenant cannot stand up a plaintext external endpoint by accident or by overriding the default, and the conflict surfaces as a clear admission error rather than a silent override the operator never sees. Cross-namespace `backendRef` requires a `ReferenceGrant`, consistent with the existing attached-namespaces model. Because trust rides the `ca.crt`-only object, clients never receive private key material. + +## Failure and edge cases + +- A pre-PG17 (client or server) or non-direct-TLS Postgres client sends no SNI → no route → connection reset/timeout (document the symptom). +- The 64-listener budget is exceeded → the listener is rejected; because listeners are one per engine type this is reached only through child-apex fan-out, and the mitigation is to split that subtree onto its own Gateway. +- A passthrough listener shares a hostname with a terminate listener on Cilium <1.20 → routing does not isolate correctly; mitigated by the Cilium 1.20 dependency or the distinct-subdomain fallback, and caught by the Phase 1 gate. +- An explicit `tls.enabled: false` together with `external: true` → rejected at admission by a ValidatingAdmissionPolicy, not silently overridden; an unset `tls` tri-state auto-enables TLS with `external`. +- An operator expects multi-Gateway IP sharing → each Gateway still gets its own IP (expected under the Cilium constraint). +- A database is deleted → its per-release `TLSRoute` is removed; the shared engine-type listener is removed only when its last release is gone, so a single deletion never orphans a listener. + +## Testing + +- Helm-template assertions that the certificate SAN includes `.` per engine, mirroring the existing TLS test fixtures. +- A controller unit test that a `TLSPassthroughListeners` entry renders a shared listener on the native port with `mode: Passthrough` and `AllowedRoutes` restricted to `TLSRoute`, and that two per-release `TLSRoute` objects on that one listener SNI-route to their respective backends. +- An admission test that `tls.enabled: false` with `external: true` is rejected by the ValidatingAdmissionPolicy, while an unset `tls` is admitted and auto-enables; and that the listener validation contract (name/port uniqueness, port range, 443 collision) is enforced. +- An end-to-end test per fitting engine: connect from outside the cluster with SNI and `ca.crt`, and assert that the serial of the presented certificate equals the operator-issued internal certificate — proving reuse, not re-issuance. +- A negative test: a client without SNI fails closed. + +## Rollout + +1. API field plus controller listener rendering, no engine wired. **Gate:** on the target Cilium version, stand up a `mode: Passthrough` listener on a native database port (for example 5432) whose hostname `*.` overlaps the Gateway's terminate listeners, and confirm it SNI-routes to the right backend with exactly one Envoy filter chain. The risk is not the non-443 port (supported since Cilium 1.14) but the shared-hostname isolation that lands in Cilium 1.20; this gate must pass on the version actually deployed before any engine is wired, or the distinct-subdomain fallback adopted. +2. Redis (the cleanest fit) behind an opt-in; default-on candidate once the gate and client-emits-SNI behavior are confirmed in practice. +3. PostgreSQL (direct-TLS) and sharded MongoDB, opt-in. +4. MariaDB after connector-conformance validation. + +Kafka and non-sharded MongoDB are explicitly out of this rollout. Each phase ships documentation, a connection recipe, and an end-to-end gate. + +## Open questions + +- **Cilium shared-hostname isolation on the target version** — the listener model overlaps `*.` across the engine-type passthrough listeners and the HTTPS-terminate listeners, which Cilium isolates correctly only from 1.20. Do we pin Cilium 1.20 as a hard prerequisite for this feature, or ship the distinct-per-engine-subdomain scheme so it runs on 1.19.x? This is the single biggest implementation risk and the Phase 1 gate. +- **MariaDB / MySQL connector conformance** — do the dominant connectors (libmysqlclient, MariaDB Connector/C, JDBC, Go drivers) emit SNI and do TLS-first such that passthrough routes correctly, or does any do a server-greeting-then-STARTTLS dance like pre-direct-TLS Postgres? Default-on versus opt-in for MariaDB hinges on this. +- **Direct-TLS client/server floor for Postgres** — what fraction of the tenant base predates `sslnegotiation=direct` on either client (libpq PG17+, driver support) or server (PG17+)? Is a per-release attestation gate enough, given there is no graceful downgrade on a passthrough listener? +- **TLSRoute ownership** — does the controller synthesize the per-release `TLSRoute` alongside the shared listener (single source of truth, but the controller must know the backend Service), or does the database chart render it (the chart owns the backend selector, but must learn a gateway name/namespace it does not have today)? This decides whether database charts gain gateway-discovery logic. +- **Who writes the listener/route set** — the orchestration layer auto-creating the engine-type listener on first exposure and a per-release route on each `external: true` + passthrough, an operator-managed explicit list, or both; and which reconciliation owns route add/remove plus reference-counting the shared listener down to zero on the last database deletion. +- **64-listener budget accounting** — with the parent listeners, per-child-apex wildcards, and default passthrough services now joined by one listener per database engine type (not per instance), what is the realistic per-tenant ceiling, and should the controller surface a status condition as the budget nears 64? +- **ListenerSet timeline** — consolidating onto one shared Gateway/IP is blocked on Cilium implementing ListenerSet (CFP `cilium#42756`, PR `cilium#46303`). Do we design the field and status to be ListenerSet-ready now, or revisit when Cilium ships it? This determines whether one-IP-per-tenant is a permanent or temporary ceiling. + +## Alternatives considered + +- **Edge TLS termination plus re-encryption (`BackendTLSPolicy`).** Rejected: the edge holds keys, there are two certificates, and it defeats reuse. This is the recorded decision WS5 asks for — passthrough chooses application-level, CNI-agnostic TLS over edge termination. +- **Distinct per-engine subdomains for the passthrough listeners** (`*.pg.`, `*.redis.`, …). Sidesteps the Cilium shared-hostname isolation defect on 1.19.x without waiting for 1.20, but at the cost of longer connection hostnames and per-engine SAN/recipe changes. Retained as the documented fallback; the flat `*.` scheme is preferred for clean hostnames where Cilium 1.20 is available. +- **All-on-443 SNI.** Mirrors the existing three services exactly with no new port plumbing, retained as a documented opt-in, but rejected as the default for the client-ergonomics and direct-TLS-everywhere tax. +- **Route-driven implicit listeners** — the controller watches `TLSRoute` objects (it already does) and auto-synthesizes a passthrough listener for any route targeting a non-443 port. Less API surface, but it inverts the explicit-intent model and makes the listener set implicit and hard to audit. Offered as a future ergonomic; the explicit field is recommended for v1. +- **Per-broker SNI for Kafka.** Rejected for now (see the matrix); revisit if the listener-budget cost is justified. +- **Status quo: a LoadBalancer per database.** The IPv4-scarcity problem this proposal exists to solve. + +--- + +