Keep and backfill Usage Event records of running apps, tasks, and service instances by joyvuu-dave · Pull Request #5210 · cloudfoundry/cloud_controller_ng

joyvuu-dave · 2026-06-19T19:20:24Z

This PR fully addresses #4182.

It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.

With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

App and service usage event cleanup previously pruned every record older than the cutoff, including the opening STARTED/CREATED event of a resource that is still running -- which makes it impossible to reconstruct current usage once that event ages out. Database::OldRecordCleanup can now optionally keep "running" records. For each lifecycle a model declares via usage_lifecycles (beginning states, ending state, guid column), a beginning-state event (STARTED/CREATED/TASK_STARTED, and the WAS_RUNNING/TASK_WAS_RUNNING baselines) is retained unless: * a later ending-state event (STOPPED/DELETED/TASK_STOPPED) for the same resource also falls outside the retention window -- the run is over; or * it is a superseded baseline: an earlier beginning of the same run and a later beginning both exist outside the window. Consumers only need the first beginning of the current run (the true start time) and the latest one (the current footprint), so the in-between events written by scaling an app or updating a service instance are pruned and cutoff_age_in_days keeps bounding the table size for long-running, frequently-changed resources. The app and service usage event repositories enable the behavior with keep_running_records: true; requesting it for a model without usage_lifecycles raises instead of silently deleting the records of running resources. Task events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING -> TASK_STOPPED, keyed by task_guid), so the start events of long-running tasks survive cleanup as well. The task baseline state is distinct from WAS_RUNNING because task events share the app_usage_events table but carry an empty app_guid: reusing WAS_RUNNING would let the app lifecycle correlate every task baseline through app_guid = '' and wrongly prune them as superseded baselines of one phantom app (and the app backfill's stale-row sweep would delete them outright). Deletion runs in ordered passes -- prunable beginning rows first, while the rows that make them prunable still exist, then everything else -- so a beginning row cannot be stranded when its pair is removed in an earlier batch. The cleanup log line now reports the row counts BatchDelete returns instead of issuing extra COUNT queries, and BatchDelete fetches each batch's ids in the same query that checks for emptiness, halving evaluations of the (potentially expensive) filtered dataset. Also renames the positional days_ago to a cutoff_age_in_days keyword.

Add a composite [state, <guid>, id] index on app_usage_events and service_usage_events to support the keep-running cleanup's correlated lookups of related lifecycle events and the backfill's existence checks. Created concurrently on Postgres. The task lifecycle's correlated lookups (keyed by task_guid) are served by the existing app_usage_events_task_guid_index -- a task has only a handful of events, so probing by task_guid alone stays cheap and no [state, task_guid, id] index is needed. Also note on VCAP::BigintMigration.drop_pk_column that dropping the pk column silently drops composite indexes containing it (such as these), so any future swap migration must recreate them.

…ances Seed a synthetic WAS_RUNNING usage event for every currently-running app process, a TASK_WAS_RUNNING event for every currently-running task, and a WAS_RUNNING event for every existing service instance, so billing consumers can bootstrap a complete baseline even after the original STARTED/TASK_STARTED/CREATED events have been pruned. The backfill is a batched, idempotent VCAP::WasRunningBackfill helper invoked from thin no_transaction migrations (mirrors the bigint-migration pattern): each batch keysets over the started processes / running tasks / service instances by id and runs in its own READ COMMITTED transaction, so no statement risks the migration timeout and MySQL's INSERT..SELECT takes no shared next-key locks on the scanned source rows while the API serves traffic. The app backfill scopes the package/droplet aggregates to each batch's apps (index-backed) to keep package_state fidelity without scanning the whole tables, and COALESCEs nullable legacy process/app/task columns so a single NULL row cannot abort a deploy. Because the API stays live during migrations, a batch can race a concurrent stop/delete and insert a baseline row that no later ending event would ever prune; a post-seed sweep removes WAS_RUNNING/TASK_WAS_RUNNING rows whose resource is no longer running/present. A skip_was_running_backfill config flag lets operators opt out (checked by the migrations, not the helper, since the migrations are recorded as applied either way); 'rake db:was_running_backfill' re-runs the seeding later for operators who skipped. Rollback deletes are batched too. Document the WAS_RUNNING/TASK_WAS_RUNNING states and their created_at (migration-time) semantics on the V3 resources, and list the new states in the legacy V2 usage-event docs because V2 reads the same event rows.

create_stop_event_if_needed skipped the TASK_STOPPED event whenever the TASK_STARTED event was absent -- so a task whose start event had been pruned never got a stop event on completion, and a billing consumer that recorded the start would bill the task forever. Emit the stop when either piece of recorded start evidence exists: the TASK_STARTED event, or the TASK_WAS_RUNNING baseline seeded by the backfill for tasks that were already running when the keep-running cleanup was introduced. Between the two, a legitimately started task always has one -- the cleanup no longer prunes the start event of a running task, and the backfill covers tasks that had already lost theirs. When neither exists (e.g. a task canceled before it ever ran), no consumer ever saw the task start, so a stop event would be unmatched noise.

joyvuu-dave added 2 commits June 19, 2026 12:13

joyvuu-dave mentioned this pull request Jun 19, 2026

Keep Usage Event records of running apps and services #4646

Closed

5 tasks

joyvuu-dave added 2 commits June 19, 2026 15:32

joyvuu-dave force-pushed the was-running-backfill branch from 4af7d20 to ecd4d94 Compare June 19, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill

joyvuu-dave commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joyvuu-dave commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant