fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects by myftija · Pull Request #3894 · triggerdotdev/trigger.dev

myftija · 2026-06-10T16:14:23Z

wip

…nnects The compute suspend flow delays snapshots by snapshotDelayMs to avoid wasted work on short-lived waitpoints, with the intent that a run which continues before the delay expires cancels the pending snapshot. But the only cancel() call site is the /continue workload action, which runners only invoke when restoring from an already-taken snapshot - so a pending snapshot is never actually cancelled (zero snapshot.canceled events in prod). When a run resumes and completes within the delay window, the stale snapshot fires anyway and fcrun pauses the VM for ~6-13s while its controller is mid warm-start long-poll. The frozen guest can't fire its abort timer or send a FIN, so firestarter keeps the connection claimable past the client deadline and dispatches runs into it - each one a ~300s stall (TRI-10293). Cancel the pending snapshot when the attempt completes and when the run socket disconnects. Genuine waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook cancels a snapshot that is still wanted. Cancellation is guarded by runnerId so a stale duplicate runner for a reassigned run can't cancel the new runner's pending snapshot.

changeset-bot · 2026-06-10T16:14:28Z

⚠️ No Changeset found

Latest commit: a76626c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-10T16:14:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6aac65fa-9e92-4a3e-8aee-05919e5d2c6e

📥 Commits

Reviewing files that changed from the base of the PR and between dbbe9b3 and a76626c.

📒 Files selected for processing (2)

apps/supervisor/src/workloadManager/compute.test.ts
apps/supervisor/src/workloadManager/compute.ts

🚧 Files skipped from review as they are similar to previous changes (1)

apps/supervisor/src/workloadManager/compute.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: typecheck / typecheck
GitHub Check: build (supervisor)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Import from @trigger.dev/sdk when writing Trigger.dev tasks. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only from packages/core (@trigger.dev/core), never import from the root

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/supervisor/src/workloadManager/compute.test.ts

apps/supervisor/src/workloadManager/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Container orchestration abstraction (Docker or Kubernetes) should be implemented in src/workloadManager/

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx}: Never mock anything in tests - use testcontainers instead
Test files should be placed next to source files (e.g., MyService.ts -> MyService.test.ts)

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.{js,ts,tsx,jsx,css,json,md}

📄 CodeRabbit inference engine (AGENTS.md)

Use Prettier for code formatting and run pnpm run format before committing

Files:

apps/supervisor/src/workloadManager/compute.test.ts

**/*.test.{js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{js,ts,tsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Use vitest for unit testing
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

apps/supervisor/src/workloadManager/compute.test.ts

🧠 Learnings (7)

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-05-18T14:40:02.173Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-06-04T18:16:35.386Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

📚 Learning: 2026-06-09T17:58:04.699Z

Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

apps/supervisor/src/workloadManager/compute.test.ts

🔇 Additional comments (3)

apps/supervisor/src/workloadManager/compute.test.ts (3)

5-14: LGTM!

16-56: LGTM!

3-3: Confirm compute.ts exports used in compute.test.ts

apps/supervisor/src/workloadManager/compute.ts exports both runnerNameForAttempt and isRetryableCreateError as named export functions, matching the imports in compute.test.ts.

Walkthrough

This PR makes snapshot cancellation runner-aware by adding TimerWheel.peek, changing ComputeSnapshotService.cancel to accept an optional runnerId and refuse cancellation if the pending entry belongs to a different runner, adds tests covering scheduling/cancellation and re-scheduling semantics, and updates workloadServer HTTP and WebSocket handlers to pass runnerId when cancelling. It also introduces isRetryableCreateError and implements a bounded retry loop with backoff for compute instance creation, plus tests for the retry rules.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and provides minimal information; it only contains 'wip' (work in progress) instead of following the required template sections.	Complete the description by filling out all template sections: Testing (describe testing steps), Changelog (explain what changed), and optional Screenshots. Replace the 'wip' placeholder with actual content.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: cancelling pending delayed snapshots when runs complete or disconnect, which is reflected in the workload server updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tri-10293-cancel-stale-delayed-snapshots

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ada40160-1246-43b9-a2a1-52dc2ea2e5ed

📥 Commits

Reviewing files that changed from the base of the PR and between 081b6ba and 829fec6.

📒 Files selected for processing (4)

apps/supervisor/src/services/computeSnapshotService.test.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/workloadServer/index.ts

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: typecheck / typecheck

🧰 Additional context used

📓 Path-based instructions (9)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Import from @trigger.dev/sdk when writing Trigger.dev tasks. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only from packages/core (@trigger.dev/core), never import from the root

Files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

apps/supervisor/src/services/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Core service logic should be organized in the src/services/ directory

Files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

**/*.{js,ts,tsx,jsx,css,json,md}

📄 CodeRabbit inference engine (AGENTS.md)

Use Prettier for code formatting and run pnpm run format before committing

Files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

apps/supervisor/src/workloadServer/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

HTTP server for workload communication (heartbeats, snapshots) should be implemented in src/workloadServer/

Files:

apps/supervisor/src/workloadServer/index.ts

**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

apps/supervisor/src/services/computeSnapshotService.test.ts

**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx}: Never mock anything in tests - use testcontainers instead
Test files should be placed next to source files (e.g., MyService.ts -> MyService.test.ts)

Files:

apps/supervisor/src/services/computeSnapshotService.test.ts

**/*.test.{js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{js,ts,tsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Use vitest for unit testing
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

apps/supervisor/src/services/computeSnapshotService.test.ts

🧠 Learnings (7)

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-06-04T18:16:35.386Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-06-09T17:58:04.699Z

Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

apps/supervisor/src/services/timerWheel.ts
apps/supervisor/src/services/computeSnapshotService.ts
apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/computeSnapshotService.test.ts

📚 Learning: 2026-05-18T14:40:02.173Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

apps/supervisor/src/services/computeSnapshotService.test.ts

🔇 Additional comments (4)

apps/supervisor/src/services/timerWheel.ts (1)

124-128: LGTM!

apps/supervisor/src/services/computeSnapshotService.ts (1)

95-107: LGTM!

apps/supervisor/src/services/computeSnapshotService.test.ts (1)

1-130: LGTM!

apps/supervisor/src/workloadServer/index.ts (1)

741-748: LGTM!

coderabbitai · 2026-06-10T16:24:02Z

+                // A completed attempt invalidates any pending delayed snapshot: the
+                // suspended execution state it was scheduled to capture no longer
+                // exists. Without this, the snapshot fires up to snapshotDelayMs
+                // later and pauses a VM that has long moved on, e.g. mid warm-start
+                // long-poll or already executing the next run.
+                this.snapshotService?.cancel(
+                  params.runFriendlyId,
+                  this.runnerIdFromRequest(req)
+                );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cancel pending delayed snapshots even when completeRunAttempt fails.

This cancellation currently runs only on the success path, so transient completion failures can still leave a stale delayed snapshot armed and later pause the VM unexpectedly.

Suggested fix

async () => { const { req, reply, params, body } = ctx; + const runnerId = this.runnerIdFromRequest(req); const completeResponse = await this.workerClient.completeRunAttempt( params.runFriendlyId, params.snapshotFriendlyId, body, - this.runnerIdFromRequest(req) + runnerId ); + + // Always invalidate pending delayed snapshots once a completion is reported + // by this runner; runnerId guard prevents stale-runner cancellation. + this.snapshotService?.cancel(params.runFriendlyId, runnerId); if (!completeResponse.success) { this.logger.error("Failed to complete run", { params, error: completeResponse.error, }); reply.empty(500); return; } - - this.snapshotService?.cancel( - params.runFriendlyId, - this.runnerIdFromRequest(req) - );

…he run ComputeWorkloadManager.create swallows gateway errors by design, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine's PENDING_EXECUTING timeout redrives it minutes later. These failures are transient per placement - redriven runs virtually always succeed - so retry the create up to 3 times with short backoff before giving up. Gateway 5xx and network-level fetch failures are retried; 4xx responses (won't heal) and timeouts (the instance may still be provisioning) are not.

A failed create can leave its instance name registered gateway/fcrun-side until async cleanup runs, so a same-name retry can 409 against our own residue (observed: tap-EBUSY 500 at 18:29Z followed by 409 name_conflict on the retry 2.7s later, costing the full redrive anyway). Give retry attempts a deterministic -rN suffix; attempt 1 keeps the unsuffixed name so the non-retry path is unchanged. The suffixed name flows into both the instance name and TRIGGER_RUNNER_ID from the same variable - every downstream flow (suspend scheduling, snapshot dispatch, cancel guards, run-engine fields) treats it as one opaque self-reported token, and restored VMs already carry deterministic name suffixes. Temporary measure (TRI-10293): the proper fix is gateway-side cleanup of failed-create registrations.

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

myftija added 2 commits June 10, 2026 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894
myftija wants to merge 3 commits into
mainfrom
tri-10293-cancel-stale-delayed-snapshots

myftija commented Jun 10, 2026

Uh oh!

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

myftija commented Jun 10, 2026

Uh oh!

changeset-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading