Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
Open
joyvuu-dave wants to merge 4 commits into
Open
Keep and backfill Usage Event records of running apps, tasks, and service instances#5210joyvuu-dave wants to merge 4 commits into
joyvuu-dave wants to merge 4 commits into
Conversation
App and service usage event cleanup previously pruned every record older than the cutoff, including the opening STARTED/CREATED event of a resource that is still running -- which makes it impossible to reconstruct current usage once that event ages out. Database::OldRecordCleanup can now optionally keep "running" records. For each lifecycle a model declares via usage_lifecycles (beginning states, ending state, guid column), a beginning-state event (STARTED/CREATED/TASK_STARTED, and the WAS_RUNNING/TASK_WAS_RUNNING baselines) is retained unless: * a later ending-state event (STOPPED/DELETED/TASK_STOPPED) for the same resource also falls outside the retention window -- the run is over; or * it is a superseded baseline: an earlier beginning of the same run and a later beginning both exist outside the window. Consumers only need the first beginning of the current run (the true start time) and the latest one (the current footprint), so the in-between events written by scaling an app or updating a service instance are pruned and cutoff_age_in_days keeps bounding the table size for long-running, frequently-changed resources. The app and service usage event repositories enable the behavior with keep_running_records: true; requesting it for a model without usage_lifecycles raises instead of silently deleting the records of running resources. Task events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING -> TASK_STOPPED, keyed by task_guid), so the start events of long-running tasks survive cleanup as well. The task baseline state is distinct from WAS_RUNNING because task events share the app_usage_events table but carry an empty app_guid: reusing WAS_RUNNING would let the app lifecycle correlate every task baseline through app_guid = '' and wrongly prune them as superseded baselines of one phantom app (and the app backfill's stale-row sweep would delete them outright). Deletion runs in ordered passes -- prunable beginning rows first, while the rows that make them prunable still exist, then everything else -- so a beginning row cannot be stranded when its pair is removed in an earlier batch. The cleanup log line now reports the row counts BatchDelete returns instead of issuing extra COUNT queries, and BatchDelete fetches each batch's ids in the same query that checks for emptiness, halving evaluations of the (potentially expensive) filtered dataset. Also renames the positional days_ago to a cutoff_age_in_days keyword.
Add a composite [state, <guid>, id] index on app_usage_events and service_usage_events to support the keep-running cleanup's correlated lookups of related lifecycle events and the backfill's existence checks. Created concurrently on Postgres. The task lifecycle's correlated lookups (keyed by task_guid) are served by the existing app_usage_events_task_guid_index -- a task has only a handful of events, so probing by task_guid alone stays cheap and no [state, task_guid, id] index is needed. Also note on VCAP::BigintMigration.drop_pk_column that dropping the pk column silently drops composite indexes containing it (such as these), so any future swap migration must recreate them.
5 tasks
…ances Seed a synthetic WAS_RUNNING usage event for every currently-running app process, a TASK_WAS_RUNNING event for every currently-running task, and a WAS_RUNNING event for every existing service instance, so billing consumers can bootstrap a complete baseline even after the original STARTED/TASK_STARTED/CREATED events have been pruned. The backfill is a batched, idempotent VCAP::WasRunningBackfill helper invoked from thin no_transaction migrations (mirrors the bigint-migration pattern): each batch keysets over the started processes / running tasks / service instances by id and runs in its own READ COMMITTED transaction, so no statement risks the migration timeout and MySQL's INSERT..SELECT takes no shared next-key locks on the scanned source rows while the API serves traffic. The app backfill scopes the package/droplet aggregates to each batch's apps (index-backed) to keep package_state fidelity without scanning the whole tables, and COALESCEs nullable legacy process/app/task columns so a single NULL row cannot abort a deploy. Because the API stays live during migrations, a batch can race a concurrent stop/delete and insert a baseline row that no later ending event would ever prune; a post-seed sweep removes WAS_RUNNING/TASK_WAS_RUNNING rows whose resource is no longer running/present. A skip_was_running_backfill config flag lets operators opt out (checked by the migrations, not the helper, since the migrations are recorded as applied either way); 'rake db:was_running_backfill' re-runs the seeding later for operators who skipped. Rollback deletes are batched too. Document the WAS_RUNNING/TASK_WAS_RUNNING states and their created_at (migration-time) semantics on the V3 resources, and list the new states in the legacy V2 usage-event docs because V2 reads the same event rows.
create_stop_event_if_needed skipped the TASK_STOPPED event whenever the TASK_STARTED event was absent -- so a task whose start event had been pruned never got a stop event on completion, and a billing consumer that recorded the start would bill the task forever. Emit the stop when either piece of recorded start evidence exists: the TASK_STARTED event, or the TASK_WAS_RUNNING baseline seeded by the backfill for tasks that were already running when the keep-running cleanup was introduced. Between the two, a legitimately started task always has one -- the cleanup no longer prunes the start event of a running task, and the backfill covers tasks that had already lost theirs. When neither exists (e.g. a task canceled before it ever ran), no consumer ever saw the task start, so a stop event would be unmatched noise.
4af7d20 to
ecd4d94
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fully addresses #4182.
It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.
With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.
mainbranchbundle exec rake