feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook) by cevian · Pull Request #79 · timescale/memory-engine

cevian · 2026-06-18T10:57:59Z

One-time migration of production from the old org/engine/role + RLS model
(deployed at server/v0.2.5, SHA a6cfabf) to the new auth/core/space model
introduced in #71. Verified against live prod (read-only).

Topology

Production runs two separate clusters — DB_ACCOUNTS (identity) and
DB_SHARD (memories, one me_<slug> schema per engine) — and the ETL writes a
third, new database (auth + core + per-space me_<slug>). Sources are
read-only; rollback = repoint the app at them.

What's here

PROD_MIGRATION_PLAN.md — design, full old→new mapping, decisions, run
procedure, and the §9 survey results.
PROD_MIGRATION_RUNBOOK.md — cutover: pre-flight, rollback, modes,
decommission, cross-DB verification SQL.
packages/migrate-prod — the ETL (migrateProdToMultiplayer(conns) over
{accounts, shard, target}; run.ts runner) + survey.ts (the read-only
§9 tool). Reuses the new code's own provisioning + core SQL functions.

Approach

Fresh target DB (no collisions; slugs reused verbatim). Provision auth/core,
migrate identities; per engine create+provision the space, build roster/grants,
and stream-copy memories cross-database (meta via sql.json). Run the ETL
first, then point DATABASE_URL at the target (the app's idempotent boot
migration becomes a no-op, dodging the helm --wait --atomic crashloop).

Key outcomes

Identities → auth.users + core.principal (id preserved); oauth links;
sessions migrate (same sha256(token) bytea — users stay logged in).
Engines → spaces; org roster → space roster; org owner/admin → admin + owner@root;
tree_owner/tree_grant → grants (lossy, over-permissive, documented); RBAC
roles → groups.
Service users → an agent owned by the engine's org owner (confirmed: each is
the owner's coding agent), with grants preserved.
API keys do NOT migrate (argon2, unrecoverable) — agents must re-issue. The
one hard, user-visible break.

§9 — verified against live prod (read-only)

Two distinct physical clusters confirmed → cross-DB ETL correct. No DDL drift.
32 identities, 34 active engines (1 deleted), 62,111 memories, 0 orphans,
0 pending invites, only 4 memories without embeddings.
33/34 orgs are single-owner/single-engine (trivial path); 1 multi-member org;
1 RBAC role; 6 service users (all owners' coding agents) → mapped to agents.

Tested

tsc + lint clean; 14 integration + 5 unit pass. The integration test stands
in one physical DB for all three connections and covers the simple + complex
scenarios (multi-member org, RBAC role→group, service-user→agent, grants, dangling
identity, invitations, deleted/orphan engines).

🤖 Generated with Claude Code

One-time, in-place migration from the old org/engine/role + RLS model (deployed at server/v0.2.5) to the new auth/core/space model (PR #71), all within the single existing database. PROD_MIGRATION_PLAN.md captures the full old→new mapping, decisions (in-place, reuse slugs, rename-aside), the phased run procedure, and the "verify against live prod" checklist (no DB access yet — drafted off code). packages/migrate-prod implements it, reusing the new code's own provisioning (migrateAuth/migrateCore/provisionSpace) and core SQL functions rather than re-implementing DDL: - Phase A: provision auth+core beside the live accounts schema; migrate identities → auth.users + core.principal (id preserved), oauth links, and live sessions (token_hash copied verbatim — same sha256 scheme). - Phase B (per engine, one txn): rename old me_<slug> aside, provision a fresh one, build the roster + tree-access grants from org membership / superuser / tree_owner / tree_grant / role_membership, same-DB copy memories (carrying embeddings). - Phase C: explicit dropLegacy/dropAccounts teardown. Not migrated (by constraint): api keys (argon2, unrecoverable — agents re-issue), oauth tokens, device-flow rows. Grant {actions}→level mapping is intentionally lossy/over-permissive (documented). Tested end-to-end against a real Postgres (simple + complex scenarios: multi-member org, RBAC role→group, explicit grants, dangling identity, invitations, deleted/orphan engines) plus unit tests for the mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Operational step-by-step for executing the migration: pre-flight checklist (privileges, backup, §9 verification), rollback plan, the maintenance-window mode (recommended) vs per-engine zero-downtime mode, Phase-C teardown, and reconciliation/verification SQL. Grounds the steps in how prod actually deploys/migrates: the new server auto-migrates idempotently on boot (so run the ETL first to avoid the helm --wait --atomic crashloop), and connects via DATABASE_URL with a temporary ENGINE_DATABASE_URL fallback (so the single-DB cutover needs no chart connection change). Flags the one hard break (API-key re-issue) and the cross-schema privilege requirement for the ETL connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prod actually runs two separate databases (DB_ACCOUNTS + DB_SHARD), not one, so the migration now targets a brand-new third database instead of an in-place schema swap. - ETL takes three connections {accounts, shard, target}; sources are read-only. - No rename-aside / collision handling (sources and target are different DBs); each engine slug is reused verbatim as its space slug. - Memories are copied cross-database by streaming (cursor over DB_SHARD → batched insert into the target) instead of insert…select; meta re-sent via sql.json to dodge the postgres.js text-in-jsonb double-encoding footgun. - Removed the dropLegacy/dropAccounts teardown helpers — sources are never modified, so rollback is just repointing the app at the old databases and decommissioning them is out of band. - run.ts reads DB_ACCOUNTS / DB_SHARD / DATABASE_URL(target). - Plan + runbook rewritten for the three-DB topology (fresh target, cross-DB copy, repoint-to-sources rollback, chart DB-secret repoint, cross-DB verification queries). Tests updated: the integration test stands in one physical DB for all three connections (source schemas carry a distinct prefix so they don't collide with the target). typecheck + lint clean; 13 integration + 5 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Added survey.ts: a READ-ONLY §9 verification tool (DB_ACCOUNTS + DB_SHARD) that checks DDL drift and surveys the data shape. Run against live prod: - two distinct physical clusters confirmed → cross-DB ETL is correct - no DDL drift; 32 identities, 34 active engines, 62,111 memories, 0 orphans - 33/34 orgs single-owner; 1 multi-member org; 1 RBAC role - 6 service users (login, no identity), all confirmed to be each owner's own coding agents (claude/codex/sidekick/…) Service-user handling (decision): map each to a kind='a' agent owned by the engine's org owner, joined to the space, with its grants re-created (clamped under the owner's owner@root). Dangling identities (none in prod) still drop with a warning. Memory copy kept per-row (cursor fetched in batches). Fixture + test cover the new service-user→agent path; plan updated with the §9 results and the decision (§4.1, §10). typecheck + lint clean; 14 integration + 5 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the per-engine zero-downtime mode (Mode B) and the modes section; the runbook is now a single linear maintenance-window cutover. Fold in the §9 survey facts: humans keep working (sessions migrate) so only agents (former service users) need re-issued keys; expected reconciliation numbers (~32 users, 34 active engines, ~62k memories, 0 orphans, 0 skipped/warnings); note the row-by-row copy runtime. Fix the plan's stale runbook cross-refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The target needs only an empty database (+ creatable/installed extensions + a schema-creating role) — not a pre-migrated one. The ETL runs migrateAuth/ migrateCore + provisionSpace itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add MigrateOptions.engineSlugs (run.ts: MIGRATE_ENGINES env) to restrict Phase B to a subset of engines — Phase A still migrates all identities. Lets you smoke test the ETL against a throwaway target with the real (read-only) prod sources: a fast few-engine pass first, then a full rehearsal. Requested slugs that aren't active engines are reported in skippedEngines. Runbook §0 documents the rehearsal procedure (incl. target reset SQL for re-runs). Tests cover the filter + the not-found-slug case. typecheck + lint clean; 16 integration + 5 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A subset-aware, read-only verifier across DB_ACCOUNTS/DB_SHARD/target: identity↔ user counts + the auth.users == core.principal invariant, oauth/session copy, per-space memory counts (target vs source shard), ≥1 admin per space, every member's effective build_tree_access non-empty, and the Tiger-Den access-parity spot-check (owner→owner@root, member→group grants). Prints a ✓/✗ checklist, exits non-zero on failure. Runbook §5 points at it. Verified the smoke-test target (Tiger Den + one small engine): all 12 checks passed — 32 identities/users/principals, 18 sessions, memory counts match, and the role→group access resolves for the collaborator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the per-row cross-DB insert with one batched insert per cursor fetch: read every column as text, then `insert … select … from unnest($1::text[], …)` with scalar casts in the projection (meta::jsonb, tree::ltree, temporal::tstzrange, embedding::halfvec). Validated the cast pattern locally (jsonb objects, halfvec, ltree incl. root, tstzrange, nulls all preserved). Cuts ~62k target round-trips to ~125 → the memory copy drops from tens of minutes to a few. Behavior unchanged: the row-level enqueue trigger still fires per inserted row (null-embedding rows enqueue), counts/embeddings/tree paths identical. Docs (plan §5, runbook) updated; 16 integration + unit suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Full rehearsal against a test target (real prod sources, read-only): 34 spaces, 62,111 memories, 0 skipped, 0 warnings; verify.ts 114/114 checks pass (counts reconcile per space incl. the 20,862-row one; service-user→agent confirmed). Wall-clock 14m36s but ~3% CPU — I/O-bound (~1.8 GB of halfvec over the WAN), so run the real cutover in-region. Runbook timing notes corrected from "a few minutes" to the measured number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cevian and others added 3 commits June 22, 2026 15:48

cevian force-pushed the prod-multiplayer-migration branch from 6a06e0c to 3bce9e1 Compare June 22, 2026 13:49

cevian and others added 7 commits June 22, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79
cevian wants to merge 10 commits into
mainfrom
prod-multiplayer-migration

cevian commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cevian commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Topology

What's here

Approach

Key outcomes

§9 — verified against live prod (read-only)

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cevian commented Jun 18, 2026 •

edited

Loading