Make the async processing service (psv2) the default and update all existing projects by mihow · Pull Request #1353 · RolnickLab/antenna

mihow · 2026-06-26T22:59:05Z

Summary

This flips the switch on the new async processing service (psv2) platform-wide. It does two things together: every project that already exists is switched over, and every project created from now on uses psv2 by default. Concretely, the async_pipeline_workers feature flag drives whether a project's ML jobs run on workers that pull tasks from the NATS queue (psv2) instead of the synchronous push API; this PR makes that the default and rolls it out to the back-catalogue so operators don't have to opt each project in by hand.

We are flipping the default now because psv2 has proven substantially more reliable and much faster than the synchronous service in operator testing — on the order of two orders of magnitude higher throughput on large capture sets (an operational observation on production data, not a controlled benchmark). The remaining psv2 work is tracked but none of it blocks turning psv2 on by default; the open items are refinements, hardening, and follow-ups rather than correctness gates for the common path.

Operational precondition: this routes every project's ML jobs through async (NATS) workers. Each deployment it runs against must have psv2 workers live and consuming the queue. Stranded async jobs are caught by the stale-job reaper rather than hanging indefinitely.

List of Changes

Change (user/operator effect)	How (implementation)
New projects use the async processing service (psv2) by default.	`ProjectFeatureFlags.async_pipeline_workers` default changed from `False` to `True` in `ami/main/models.py`. No migration is needed: the `feature_flags` field deconstructs by its `schema=` class reference and default-factory reference, neither of which changes when a default inside the pydantic model changes — `makemigrations --check` confirms no model migration.
Every project that already exists is switched over to psv2, without an operator flipping each one by hand.	New data migration `0094_enable_async_pipeline_workers` runs a single DB-side `UPDATE ... jsonb_set(...)` that toggles only the `async_pipeline_workers` key in place for every existing `Project`, with a `WHERE ... IS DISTINCT FROM` guard so rows already at the target are skipped. Updating the one key server-side (rather than reading the whole JSON into Python and writing it back) leaves the other feature flags untouched even under a concurrent change, and runs as one statement instead of one save per row.
The existing-project rollout can be undone in one step.	The data migration's reverse flips the flag back to `False` for every project (a blanket disable). It does not restore per-project values from before the rollout — some projects may have been opted in individually beforehand, so the reverse returns every project to the off state. The reverse does not change the new model default; that is a code change, reverted by reverting the commit.

Notes

The default change and the data migration are complementary: the default only affects rows created after deploy, so the data migration is still required to cover projects that already exist.
Validated on a fresh test database and end-to-end on a dev box: the data migration enables the flag for all projects and the reverse clears it, while an unrelated flag set on a project is preserved through both directions. The factory that backs new-project creation now returns async_pipeline_workers = True with the other defaults unchanged.

Known follow-ups (PSv2) — none blocking this change

psv2 is the new async/distributed ML backend tracked under the umbrella issue #515 and the PSv2 label. The items below are known and open at the time of this change. None of them block making psv2 the default — they are hardening, performance, and feature-completeness follow-ups. Full lists: Antenna PSv2 issues · Antenna PSv2 PRs · ADC (ami-data-companion) PRs.

Reliability / job-state correctness (Antenna)

Jobs that are created but not started get revoked #1354 — Jobs that are created but not started get revoked.
Saving job progress concurrently is the root of multiple issues related to incorrect job statuses #1337 / Review how the state and progress of jobs are tracked #1285 — Concurrent job-progress saves cause incorrect job statuses; broader review of how job state and progress are tracked.
Clean up task queue when job is revoked or re-started, fix duplicate tasks #1283 / CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282 — Clean up the task queue on revoke/restart (duplicate tasks); CANCELLED jobs leak through the /next filter and starve newer jobs.
A few images are regularly stranded after long-running jobs #1247 — A few images are regularly stranded after long-running jobs.
PSv2: Async jobs hang forever when NATS tasks exhaust max_deliver without posting results #1168 — Async jobs can hang forever when NATS tasks exhaust max_deliver without posting results (mitigated by the stale-job reaper).
fix: PSv2 - Tasks are queued, worker sees job but no tasks #1123 — Tasks are queued and the worker sees the job, but no tasks are handed out.
Handle Job.DoesNotExist exception properly in process_source_images_async #1061 — Handle Job.DoesNotExist properly in process_source_images_async.

Performance / scaling (Antenna)

perf(jobs): defer aggregate count refresh in process_nats_pipeline_result (5-10x speedup expected) #1288 — Defer the aggregate count refresh in process_nats_pipeline_result (expected 5–10× on the result path).
feat(job): refactor job logging so it isn't a bottleneck #1256 — Refactor job logging so it isn't a bottleneck.
Update progress while queuing NATS tasks #1173 — Update progress while queuing NATS tasks (currently blocks).

Auth & permissions (Antenna + ADC)

Add required permissions to processing service users for processing jobs #1263 / Authentication workflow for Processing Service V2 #964 / PSv2: method for workers to re/authenticate & use an application token #1153 / ML Data Manager role can't fetch job tasks via API #1182 — Processing-service user permissions, the auth workflow for psv2, worker (re)authentication with an application token, and the ML Data Manager role not being able to fetch job tasks.
In flight: Antenna feat: API key auth and identity for processing services #1194 / feat(ui): API key management and client info for processing services #1201 (API-key auth + key management UI for processing services); ADC Bump alembic from 1.10.2 to 1.10.4 in /backend #136 (switch ADC workers to API-key auth).

Pipeline config & result contract (in flight)

Propagate PipelineRequestConfigParameters through pull-mode (NATS) tasks #1275 + Antenna PR feat(ml): propagate pipeline config through NATS pull-mode tasks #1279 with ADC PR Setup and use React Query #146 — propagate PipelineRequestConfigParameters through pull-mode (NATS) tasks so workers honor per-request config.
ADC Bump react-admin from 4.8.4 to 4.10.2 in /frontend #149 — split oversized result uploads so wide-taxonomy batches aren't rejected.
ADC Bump ra-data-simple-rest from 4.9.0 to 4.10.2 in /frontend #148 / Bump axios from 0.21.4 to 1.3.6 in /frontend #139 — DataLoader hardening (forkserver context, timeouts, teardown) and making the tuning knobs env-configurable.

Docs, templates, and infra (in flight)

Antenna PR PSv2: Async backend docs and diagram #1137 (async backend docs + diagram), Enhance processing service templates to consume from pipeline queues #1011 (processing-service templates consume from pipeline queues), PSv2: add push-mode worker to the official processing service templates #1154 (push-mode worker template), PSv2: Use connection pooling and retries for NATS #1130 (NATS connection pooling + retries), Update AWS deployment with PSv2 #972 (update AWS deployment for psv2).

Summary by CodeRabbit

New Features
- Async pipeline workers are now enabled by default for new projects.
- Existing projects have been updated to use the new default setting automatically.
Bug Fixes
- Improved consistency in job dispatch behavior by making related tests set the project flag explicitly.

Turn on the `async_pipeline_workers` feature flag for every project that exists at deploy time, rolling out async ML processing (workers that pull tasks from the NATS queue instead of the synchronous push API) across the whole platform at once. The flag lives in the `feature_flags` JSONB column. The data migration reads each project's flags, sets the one boolean, and writes it back, leaving the other feature flags untouched. The reverse flips the flag back off for every project. New projects keep the model default of False until opted in separately. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-06-26T22:59:10Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`29cf619`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a3f153c035f360009d600ab

netlify · 2026-06-26T22:59:12Z

👷 Deploy Preview for antenna-preview processing.

Name	Link
🔨 Latest commit	`276d7c8`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a3f043c5431e00008b6db68

netlify · 2026-06-26T22:59:13Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`29cf619`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a3f153bf3b46300080b06f4

coderabbitai · 2026-06-26T22:59:24Z

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99e28375-b79b-4dfc-9cf5-ce6c711c7527

📥 Commits

Reviewing files that changed from the base of the PR and between 276d7c8 and 29cf619.

📒 Files selected for processing (3)

ami/jobs/tests/test_jobs.py
ami/main/migrations/0094_enable_async_pipeline_workers.py
ami/main/models.py

📝 Walkthrough

Walkthrough

Updates the async_pipeline_workers default to True, adds a migration that updates existing Project.feature_flags values with SQL, and adjusts one job test to set the flag explicitly before job creation.

Changes

Project flag rollout

Layer / File(s)	Summary
Default and data migration `ami/main/models.py`, `ami/main/migrations/0094_enable_async_pipeline_workers.py`	Sets the `ProjectFeatureFlags.async_pipeline_workers` default to `True` and updates existing projects through a SQL-backed `RunPython` migration with forward and reverse toggles.
Job test flag setup `ami/jobs/tests/test_jobs.py`	Updates the ML job dispatch test to disable `async_pipeline_workers` on the project before creating the auto-sync job.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A bunny hops on fields of code,
Where flags now glow in default mode.
Old rows shift with SQL light,
New jobs hop straight and land just right.
🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: making psv2 default and rolling it out to existing projects.
Description check	✅ Passed	It covers the summary, change list, notes, and deployment/testing context, though some template sections like issues/screenshots/checklist are absent.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/enable-async-pipeline-workers-all-projects

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

ami/main/migrations/0094_enable_async_pipeline_workers.py (1)

20-37: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Prefer an atomic JSONB update for this rollout.

The read/modify/save loop writes the whole feature_flags value back per project, so a concurrent update to another flag between Line 23/33 and Line 27/37 can be lost. A DB-side jsonb_set update also avoids caching every Project instance and issuing one save per row.

♻️ Proposed direction

+def set_async_pipeline_workers(apps, schema_editor, enabled):
+    Project = apps.get_model("main", "Project")
+    table = schema_editor.connection.ops.quote_name(Project._meta.db_table)
+    with schema_editor.connection.cursor() as cursor:
+        cursor.execute(
+            f"""
+            UPDATE {table}
+            SET feature_flags = jsonb_set(
+                COALESCE(feature_flags, '{{}}'::jsonb),
+                '{{async_pipeline_workers}}',
+                to_jsonb(%s::boolean),
+                true
+            )
+            WHERE COALESCE((feature_flags->>'async_pipeline_workers')::boolean, false)
+                  IS DISTINCT FROM %s
+            """,
+            [enabled, enabled],
+        )
+
+
 def enable_async_pipeline_workers(apps, schema_editor):
-    Project = apps.get_model("main", "Project")
-    for project in Project.objects.all():
-        flags = project.feature_flags
-        if not flags.async_pipeline_workers:
-            flags.async_pipeline_workers = True
-            project.feature_flags = flags
-            project.save(update_fields=["feature_flags"])
+    set_async_pipeline_workers(apps, schema_editor, True)
 
 
 def disable_async_pipeline_workers(apps, schema_editor):
-    Project = apps.get_model("main", "Project")
-    for project in Project.objects.all():
-        flags = project.feature_flags
-        if flags.async_pipeline_workers:
-            flags.async_pipeline_workers = False
-            project.feature_flags = flags
-            project.save(update_fields=["feature_flags"])
+    set_async_pipeline_workers(apps, schema_editor, False)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/migrations/0094_enable_async_pipeline_workers.py` around lines 20 -
37, The `enable_async_pipeline_workers` and `disable_async_pipeline_workers`
migration helpers currently read/modify/save `Project.feature_flags` in Python,
which can overwrite concurrent flag changes and is inefficient per row. Update
these rollout functions to use an atomic DB-side JSONB update on `feature_flags`
(for example via an `update()` with `jsonb_set`-style logic) so only
`async_pipeline_workers` is toggled in place without loading each `Project`
instance or rewriting the full JSON object.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ami/main/migrations/0094_enable_async_pipeline_workers.py`:
- Around line 20-37: The `enable_async_pipeline_workers` and
`disable_async_pipeline_workers` migration helpers currently read/modify/save
`Project.feature_flags` in Python, which can overwrite concurrent flag changes
and is inefficient per row. Update these rollout functions to use an atomic
DB-side JSONB update on `feature_flags` (for example via an `update()` with
`jsonb_set`-style logic) so only `async_pipeline_workers` is toggled in place
without loading each `Project` instance or rewriting the full JSON object.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c5db873-d1d4-45fa-a0a1-b151588515b1

📥 Commits

Reviewing files that changed from the base of the PR and between 08ca0a4 and 276d7c8.

📒 Files selected for processing (1)

ami/main/migrations/0094_enable_async_pipeline_workers.py

Copilot

Pull request overview

This PR introduces a Django data migration that globally enables async ML processing for all existing projects by setting the feature_flags.async_pipeline_workers flag to True, with a reversible migration that disables it again.

Changes:

Add migration 0094_enable_async_pipeline_workers to enable async_pipeline_workers for every existing Project.
Add reverse migration logic to disable async_pipeline_workers for every Project.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Flip the `async_pipeline_workers` default in `ProjectFeatureFlags` to True so projects created from now on use the async processing service (psv2) — workers that pull tasks from the NATS queue — without an operator opting them in. No migration is required: the `feature_flags` field deconstructs by its schema class and default-factory references, neither of which changes when a default inside the pydantic model changes (`makemigrations --check` reports no changes). The companion data migration handles existing projects. Co-Authored-By: Claude <noreply@anthropic.com>

Address CodeRabbit review on the data migration: replace the read/modify/save loop with a single DB-side `jsonb_set` UPDATE that toggles only the `async_pipeline_workers` key. Updating the one key server-side leaves the other feature flags untouched even if another process changes one of them during the deploy (the previous loop rewrote the whole JSONB value and could clobber a concurrent sibling change), and it runs as one statement instead of one save per row. A `WHERE ... IS DISTINCT FROM` guard skips rows already at the target value. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-06-26T23:17:24Z

Claude says: Addressed the data-migration nitpick in bd4d8814 — switched the rollout to a single DB-side jsonb_set UPDATE that toggles only the async_pipeline_workers key, with a WHERE ... IS DISTINCT FROM guard so already-correct rows are skipped. This removes the read/modify/save loop, so a concurrent change to a sibling flag during deploy can't be clobbered, and it runs as one statement instead of one save per row. Validated on a fresh test DB: flips both directions and preserves an unrelated flag set on a project.

`test_ml_job_dispatch_mode_set_on_creation` asserted that an ML job on a default project dispatches via sync_api, which relied on `async_pipeline_workers` defaulting to False. Now that the default is True, the sync branch must set the flag off explicitly to exercise that path — the async branch already sets it on. Pins both transitions instead of leaning on the default. Co-Authored-By: Claude <noreply@anthropic.com>

The reverse is a blanket disable; some projects may have had the flag enabled individually before this rollout, so the docstring no longer claims no project was True beforehand — it states the reverse returns every project to the off state rather than to its prior value. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 26, 2026 22:59

Copilot started reviewing on behalf of mihow June 26, 2026 22:59 View session

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Copilot AI reviewed Jun 26, 2026

View reviewed changes

mihow changed the title ~~Turn on async ML processing for all existing projects~~ Make the async processing service (psv2) the default and switch all existing projects to it Jun 26, 2026

mihow changed the title ~~Make the async processing service (psv2) the default and switch all existing projects to it~~ Make the async processing service (psv2) the default and update all existing projects Jun 26, 2026

mihow mentioned this pull request Jun 26, 2026

PSv2: add push-mode worker to the official processing service templates #1154

Open

mihow merged commit cdcb3b1 into main Jun 27, 2026
6 of 7 checks passed

mihow deleted the chore/enable-async-pipeline-workers-all-projects branch June 27, 2026 00:13

This was referenced Jun 27, 2026

Update platform to use upcoming asynchronous ML backend API #505

Closed

Stage Implementation Monitoring #538

Closed

fix: PSv2 - Tasks are queued, worker sees job but no tasks #1123

Closed

New async & distributed ML backend (aka "PSv2") #515

Open

Uh oh!

Conversation

mihow commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Notes

Known follow-ups (PSv2) — none blocking this change

Summary by CodeRabbit

Uh oh!

netlify Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented Jun 26, 2026

👷 Deploy Preview for antenna-preview processing.

Uh oh!

netlify Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented Jun 26, 2026 •

edited

Loading

netlify Bot commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading