Skip to content

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79

Draft
cevian wants to merge 10 commits into
mainfrom
prod-multiplayer-migration
Draft

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79
cevian wants to merge 10 commits into
mainfrom
prod-multiplayer-migration

Conversation

@cevian

@cevian cevian commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

One-time migration of production from the old org/engine/role + RLS model
(deployed at server/v0.2.5, SHA a6cfabf) to the new auth/core/space model
introduced in #71. Verified against live prod (read-only).

Topology

Production runs two separate clustersDB_ACCOUNTS (identity) and
DB_SHARD (memories, one me_<slug> schema per engine) — and the ETL writes a
third, new database (auth + core + per-space me_<slug>). Sources are
read-only; rollback = repoint the app at them.

What's here

  • PROD_MIGRATION_PLAN.md — design, full old→new mapping, decisions, run
    procedure, and the §9 survey results.
  • PROD_MIGRATION_RUNBOOK.md — cutover: pre-flight, rollback, modes,
    decommission, cross-DB verification SQL.
  • packages/migrate-prod — the ETL (migrateProdToMultiplayer(conns) over
    {accounts, shard, target}; run.ts runner) + survey.ts (the read-only
    §9 tool). Reuses the new code's own provisioning + core SQL functions.

Approach

Fresh target DB (no collisions; slugs reused verbatim). Provision auth/core,
migrate identities; per engine create+provision the space, build roster/grants,
and stream-copy memories cross-database (meta via sql.json). Run the ETL
first, then point DATABASE_URL at the target (the app's idempotent boot
migration becomes a no-op, dodging the helm --wait --atomic crashloop).

Key outcomes

  • Identities → auth.users + core.principal (id preserved); oauth links;
    sessions migrate (same sha256(token) bytea — users stay logged in).
  • Engines → spaces; org roster → space roster; org owner/admin → admin + owner@root;
    tree_owner/tree_grant → grants (lossy, over-permissive, documented); RBAC
    roles → groups.
  • Service users → an agent owned by the engine's org owner (confirmed: each is
    the owner's coding agent), with grants preserved.
  • API keys do NOT migrate (argon2, unrecoverable) — agents must re-issue. The
    one hard, user-visible break.

§9 — verified against live prod (read-only)

  • Two distinct physical clusters confirmed → cross-DB ETL correct. No DDL drift.
  • 32 identities, 34 active engines (1 deleted), 62,111 memories, 0 orphans,
    0 pending invites, only 4 memories without embeddings.
  • 33/34 orgs are single-owner/single-engine (trivial path); 1 multi-member org;
    1 RBAC role; 6 service users (all owners' coding agents) → mapped to agents.

Tested

tsc + lint clean; 14 integration + 5 unit pass. The integration test stands
in one physical DB for all three connections and covers the simple + complex
scenarios (multi-member org, RBAC role→group, service-user→agent, grants, dangling
identity, invitations, deleted/orphan engines).

🤖 Generated with Claude Code

cevian and others added 3 commits June 22, 2026 15:48
One-time, in-place migration from the old org/engine/role + RLS model
(deployed at server/v0.2.5) to the new auth/core/space model (PR #71),
all within the single existing database.

PROD_MIGRATION_PLAN.md captures the full old→new mapping, decisions
(in-place, reuse slugs, rename-aside), the phased run procedure, and the
"verify against live prod" checklist (no DB access yet — drafted off code).

packages/migrate-prod implements it, reusing the new code's own
provisioning (migrateAuth/migrateCore/provisionSpace) and core SQL
functions rather than re-implementing DDL:
  - Phase A: provision auth+core beside the live accounts schema; migrate
    identities → auth.users + core.principal (id preserved), oauth links,
    and live sessions (token_hash copied verbatim — same sha256 scheme).
  - Phase B (per engine, one txn): rename old me_<slug> aside, provision a
    fresh one, build the roster + tree-access grants from org membership /
    superuser / tree_owner / tree_grant / role_membership, same-DB copy
    memories (carrying embeddings).
  - Phase C: explicit dropLegacy/dropAccounts teardown.

Not migrated (by constraint): api keys (argon2, unrecoverable — agents
re-issue), oauth tokens, device-flow rows. Grant {actions}→level mapping
is intentionally lossy/over-permissive (documented).

Tested end-to-end against a real Postgres (simple + complex scenarios:
multi-member org, RBAC role→group, explicit grants, dangling identity,
invitations, deleted/orphan engines) plus unit tests for the mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operational step-by-step for executing the migration: pre-flight checklist
(privileges, backup, §9 verification), rollback plan, the maintenance-window
mode (recommended) vs per-engine zero-downtime mode, Phase-C teardown, and
reconciliation/verification SQL.

Grounds the steps in how prod actually deploys/migrates: the new server
auto-migrates idempotently on boot (so run the ETL first to avoid the
helm --wait --atomic crashloop), and connects via DATABASE_URL with a
temporary ENGINE_DATABASE_URL fallback (so the single-DB cutover needs no
chart connection change). Flags the one hard break (API-key re-issue) and the
cross-schema privilege requirement for the ETL connection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prod actually runs two separate databases (DB_ACCOUNTS + DB_SHARD), not one,
so the migration now targets a brand-new third database instead of an in-place
schema swap.

- ETL takes three connections {accounts, shard, target}; sources are read-only.
- No rename-aside / collision handling (sources and target are different DBs);
  each engine slug is reused verbatim as its space slug.
- Memories are copied cross-database by streaming (cursor over DB_SHARD →
  batched insert into the target) instead of insert…select; meta re-sent via
  sql.json to dodge the postgres.js text-in-jsonb double-encoding footgun.
- Removed the dropLegacy/dropAccounts teardown helpers — sources are never
  modified, so rollback is just repointing the app at the old databases and
  decommissioning them is out of band.
- run.ts reads DB_ACCOUNTS / DB_SHARD / DATABASE_URL(target).
- Plan + runbook rewritten for the three-DB topology (fresh target, cross-DB
  copy, repoint-to-sources rollback, chart DB-secret repoint, cross-DB
  verification queries).

Tests updated: the integration test stands in one physical DB for all three
connections (source schemas carry a distinct prefix so they don't collide with
the target). typecheck + lint clean; 13 integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cevian cevian force-pushed the prod-multiplayer-migration branch from 6a06e0c to 3bce9e1 Compare June 22, 2026 13:49
cevian and others added 7 commits June 22, 2026 16:08
Added survey.ts: a READ-ONLY §9 verification tool (DB_ACCOUNTS + DB_SHARD)
that checks DDL drift and surveys the data shape. Run against live prod:
  - two distinct physical clusters confirmed → cross-DB ETL is correct
  - no DDL drift; 32 identities, 34 active engines, 62,111 memories, 0 orphans
  - 33/34 orgs single-owner; 1 multi-member org; 1 RBAC role
  - 6 service users (login, no identity), all confirmed to be each owner's
    own coding agents (claude/codex/sidekick/…)

Service-user handling (decision): map each to a kind='a' agent owned by the
engine's org owner, joined to the space, with its grants re-created (clamped
under the owner's owner@root). Dangling identities (none in prod) still drop
with a warning. Memory copy kept per-row (cursor fetched in batches).

Fixture + test cover the new service-user→agent path; plan updated with the
§9 results and the decision (§4.1, §10). typecheck + lint clean; 14
integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the per-engine zero-downtime mode (Mode B) and the modes section; the
runbook is now a single linear maintenance-window cutover. Fold in the §9
survey facts: humans keep working (sessions migrate) so only agents (former
service users) need re-issued keys; expected reconciliation numbers (~32
users, 34 active engines, ~62k memories, 0 orphans, 0 skipped/warnings);
note the row-by-row copy runtime. Fix the plan's stale runbook cross-refs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The target needs only an empty database (+ creatable/installed extensions + a
schema-creating role) — not a pre-migrated one. The ETL runs migrateAuth/
migrateCore + provisionSpace itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add MigrateOptions.engineSlugs (run.ts: MIGRATE_ENGINES env) to restrict Phase B
to a subset of engines — Phase A still migrates all identities. Lets you smoke
test the ETL against a throwaway target with the real (read-only) prod sources:
a fast few-engine pass first, then a full rehearsal. Requested slugs that aren't
active engines are reported in skippedEngines.

Runbook §0 documents the rehearsal procedure (incl. target reset SQL for re-runs).
Tests cover the filter + the not-found-slug case. typecheck + lint clean; 16
integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A subset-aware, read-only verifier across DB_ACCOUNTS/DB_SHARD/target: identity↔
user counts + the auth.users == core.principal invariant, oauth/session copy,
per-space memory counts (target vs source shard), ≥1 admin per space, every
member's effective build_tree_access non-empty, and the Tiger-Den access-parity
spot-check (owner→owner@root, member→group grants). Prints a ✓/✗ checklist,
exits non-zero on failure. Runbook §5 points at it.

Verified the smoke-test target (Tiger Den + one small engine): all 12 checks
passed — 32 identities/users/principals, 18 sessions, memory counts match, and
the role→group access resolves for the collaborator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-row cross-DB insert with one batched insert per cursor fetch:
read every column as text, then `insert … select … from unnest($1::text[], …)`
with scalar casts in the projection (meta::jsonb, tree::ltree,
temporal::tstzrange, embedding::halfvec). Validated the cast pattern locally
(jsonb objects, halfvec, ltree incl. root, tstzrange, nulls all preserved).

Cuts ~62k target round-trips to ~125 → the memory copy drops from tens of
minutes to a few. Behavior unchanged: the row-level enqueue trigger still fires
per inserted row (null-embedding rows enqueue), counts/embeddings/tree paths
identical. Docs (plan §5, runbook) updated; 16 integration + unit suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full rehearsal against a test target (real prod sources, read-only): 34 spaces,
62,111 memories, 0 skipped, 0 warnings; verify.ts 114/114 checks pass (counts
reconcile per space incl. the 20,862-row one; service-user→agent confirmed).
Wall-clock 14m36s but ~3% CPU — I/O-bound (~1.8 GB of halfvec over the WAN), so
run the real cutover in-region. Runbook timing notes corrected from "a few
minutes" to the measured number.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant