Skip to content

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210

Open
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill
Open

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill

Conversation

@joyvuu-dave

Copy link
Copy Markdown
Contributor

This PR fully addresses #4182.

It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.

With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.

  • I have reviewed the contributing guide
  • I have viewed, signed, and submitted the Contributor License Agreement
  • I have made this pull request to the main branch
  • I have run all the unit tests using bundle exec rake
  • I have run CF Acceptance Tests

App and service usage event cleanup previously pruned every record older than
the cutoff, including the opening STARTED/CREATED event of a resource that is
still running -- which makes it impossible to reconstruct current usage once
that event ages out.

Database::OldRecordCleanup can now optionally keep "running" records. For each
lifecycle a model declares via usage_lifecycles (beginning states, ending
state, guid column), a beginning-state event (STARTED/CREATED/TASK_STARTED,
and the WAS_RUNNING/TASK_WAS_RUNNING baselines) is retained unless:

* a later ending-state event (STOPPED/DELETED/TASK_STOPPED) for the same
  resource also falls outside the retention window -- the run is over; or
* it is a superseded baseline: an earlier beginning of the same run and a
  later beginning both exist outside the window. Consumers only need the first
  beginning of the current run (the true start time) and the latest one (the
  current footprint), so the in-between events written by scaling an app or
  updating a service instance are pruned and cutoff_age_in_days keeps bounding
  the table size for long-running, frequently-changed resources.

The app and service usage event repositories enable the behavior with
keep_running_records: true; requesting it for a model without usage_lifecycles
raises instead of silently deleting the records of running resources. Task
events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING -> TASK_STOPPED,
keyed by task_guid), so the start events of long-running tasks survive cleanup
as well. The task baseline state is distinct from WAS_RUNNING because task
events share the app_usage_events table but carry an empty app_guid: reusing
WAS_RUNNING would let the app lifecycle correlate every task baseline through
app_guid = '' and wrongly prune them as superseded baselines of one phantom
app (and the app backfill's stale-row sweep would delete them outright).

Deletion runs in ordered passes -- prunable beginning rows first, while the
rows that make them prunable still exist, then everything else -- so a
beginning row cannot be stranded when its pair is removed in an earlier batch.
The cleanup log line now reports the row counts BatchDelete returns instead of
issuing extra COUNT queries, and BatchDelete fetches each batch's ids in the
same query that checks for emptiness, halving evaluations of the (potentially
expensive) filtered dataset. Also renames the positional days_ago to a
cutoff_age_in_days keyword.
Add a composite [state, <guid>, id] index on app_usage_events and
service_usage_events to support the keep-running cleanup's correlated lookups
of related lifecycle events and the backfill's existence checks. Created
concurrently on Postgres.

The task lifecycle's correlated lookups (keyed by task_guid) are served by
the existing app_usage_events_task_guid_index -- a task has only a handful of
events, so probing by task_guid alone stays cheap and no [state, task_guid,
id] index is needed.

Also note on VCAP::BigintMigration.drop_pk_column that dropping the pk column
silently drops composite indexes containing it (such as these), so any future
swap migration must recreate them.
…ances

Seed a synthetic WAS_RUNNING usage event for every currently-running app
process, a TASK_WAS_RUNNING event for every currently-running task, and a
WAS_RUNNING event for every existing service instance, so billing consumers
can bootstrap a complete baseline even after the original
STARTED/TASK_STARTED/CREATED events have been pruned.

The backfill is a batched, idempotent VCAP::WasRunningBackfill helper invoked
from thin no_transaction migrations (mirrors the bigint-migration pattern):
each batch keysets over the started processes / running tasks / service
instances by id and runs in its own READ COMMITTED transaction, so no
statement risks the migration timeout and MySQL's INSERT..SELECT takes no
shared next-key locks on the scanned source rows while the API serves
traffic. The app backfill scopes the package/droplet aggregates to each
batch's apps (index-backed) to keep package_state fidelity without scanning
the whole tables, and COALESCEs nullable legacy process/app/task columns so a
single NULL row cannot abort a deploy.

Because the API stays live during migrations, a batch can race a concurrent
stop/delete and insert a baseline row that no later ending event would ever
prune; a post-seed sweep removes WAS_RUNNING/TASK_WAS_RUNNING rows whose
resource is no longer running/present.

A skip_was_running_backfill config flag lets operators opt out (checked by the
migrations, not the helper, since the migrations are recorded as applied
either way); 'rake db:was_running_backfill' re-runs the seeding later for
operators who skipped. Rollback deletes are batched too.

Document the WAS_RUNNING/TASK_WAS_RUNNING states and their created_at
(migration-time) semantics on the V3 resources, and list the new states in
the legacy V2 usage-event docs because V2 reads the same event rows.
create_stop_event_if_needed skipped the TASK_STOPPED event whenever the
TASK_STARTED event was absent -- so a task whose start event had been pruned
never got a stop event on completion, and a billing consumer that recorded
the start would bill the task forever.

Emit the stop when either piece of recorded start evidence exists: the
TASK_STARTED event, or the TASK_WAS_RUNNING baseline seeded by the backfill
for tasks that were already running when the keep-running cleanup was
introduced. Between the two, a legitimately started task always has one --
the cleanup no longer prunes the start event of a running task, and the
backfill covers tasks that had already lost theirs. When neither exists
(e.g. a task canceled before it ever ran), no consumer ever saw the task
start, so a stop event would be unmatched noise.
@joyvuu-dave joyvuu-dave force-pushed the was-running-backfill branch from 4af7d20 to ecd4d94 Compare June 19, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant