fix(session-ingest): shrink cli_sessions_v2 status row lock#4322
Conversation
Reduce per-session row lock hold time on cli_sessions_v2 by removing the multi-step JS transaction around status metadata. Status transitions now run as a single SELECT ... FOR UPDATE + UPDATE ... RETURNING CTE, no-op status writes no longer stamp timestamps or lock the row, and parent and non-status metadata updates no longer execute while the status lock is held.
| ) | ||
| .limit(1); | ||
| let changedNonStatus = false; | ||
| let notificationRow = await updateSessionMetadata(db, kiloUserId, sessionId, updates); |
There was a problem hiding this comment.
WARNING: Splitting these writes across separate committed statements breaks the queue retry guarantee
applyMetadataChanges() used to persist metadata, parent updates, and status changes in one transaction, so a later failure rolled the whole batch back. With the three independent statements here, updateSessionMetadata() or updateSessionParent() can commit and then updateSessionStatus() can still fail. The queue will retry the message, but the Durable Object only re-emits metadata changes when its stored value differs, so the replay can come back with an empty changes list and Postgres will stay permanently out of sync on whichever fields were not written before the failure. This path still needs all-or-nothing persistence, or a retry path that can reconstruct and reapply the full merged change set.
Reply with @kilocode-bot fix it to have Kilo Code address this issue.
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Files Reviewed (3 files)
Reviewed by gpt-5.4-20260305 · Input: 90.7K · Output: 13.2K · Cached: 624.6K Review guidance: REVIEW.md from base branch |
Summary
Reduces per-session row-lock hold time on
cli_sessions_v2, the second major lock contributor identified in production Supabase lock-wait logs (theselect "status" from "cli_sessions_v2" ... for updatequery). Over a representative 7-day window this exact query accounted for 1,440 acquired lock waits, dominated byShareLock-on-transaction waits (max 110s) that the earlierExclusiveLock-only investigation missed.The previous
applyMetadataChangesflow opened a JS transaction, tookSELECT ... FOR UPDATEearly, then held that row lock across a metadataUPDATE, a parent-session lookup + update, and a final full-rowSELECTbefore commit. This change shrinks the locked critical section:WITH locked AS (SELECT ... FOR UPDATE)CTE plusUPDATE ... RETURNING, returning both the locked previous status and the updated row. No multi-round-trip JS transaction holds the lock.status_updated_at/updated_at, no longer hold the lock through a write, and emit no status notification.UPDATE ... RETURNINGstatements.RETURNINGinstead of a final in-transactionSELECT.Locked-status notification semantics are preserved:
session.status.updatedstill reports the persisted previous status captured under the row lock, not the queued intake snapshot.Verification
Visual Changes
N/A
Reviewer Notes
db.execute(sql\...`)) because Drizzle's query builder can't express "return the locked previous value plus the updated row" in one statement.previous_statusis read via a CTE subquery inRETURNINGrather than relying onFROM`-alias visibility, for cross-version safety.updated_at/status_updated_atand emits no notification. This is intended (it was previously incidental lock-holding work), but flagging in case any consumer relied on the timestamp bump.