Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773

matt-aitken · 2025-03-06T19:28:58Z

The logic for failed attempt retrying has been extracted to a separate file to simplify things.
Added tests for attempts failing in different conditions
Added support for OOM retrying
Set the --max-old-space-size on TaskRunProcess based on the machine size. So this now works for dev runs as well as prod runs.

Summary by CodeRabbit

New Features
- Added a new task configuration that streamlines package builds.
- Introduced enhanced Node.js runtime options for dynamic memory management.
- Added a function to deduplicate command line flags for improved command handling.
Bug Fixes
- Improved detection and handling of memory-related issues to increase overall system stability and reduce crashes.
Refactor
- Optimized retry and error-handling mechanisms for task runs.
- Updated background processing to support customizable retry settings, enhancing recovery and performance.
- Enhanced schema validation to enforce required machine configuration in task execution.
- Updated test configurations to include additional test file paths for comprehensive coverage.

…riority order

# Conflicts: # packages/core/package.json

changeset-bot · 2025-03-06T19:29:03Z

⚠️ No Changeset found

Latest commit: 271eaa8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-03-06T19:29:07Z

Walkthrough

This PR introduces a new task configuration for building packages and adjusts several error handling paths across the codebase. The error identification logic has been updated by replacing the general OOM check with a more specific one, and old functions have been removed. In addition, the run engine’s retry logic has been refactored into a centralized function, and new tests have been added or modified across multiple modules. Changes also include updates to task run initialization for memory management and expanded export mappings and schema validations in the core package.

Changes

File(s)	Change Summary
`.zed/tasks.json`	Adds a new JSON configuration for the "Build packages" task with environment variable overrides and terminal behavior settings.
`apps/webapp/.../alerts/deliverAlert.server.ts` `apps/webapp/.../completeAttempt.server.ts`	Updates error handling by replacing `isOOMError` with `isOOMRunError`; removes the obsolete `isOOMError` function in the CompleteAttempt service.
`internal-packages/.../engine/index.ts` `internal-packages/.../engine/retrying.ts`	Refactors run engine error handling by introducing `retryOutcomeFromCompletion` and a helper (`retryOOMOnMachine`) for centralized retry outcome decisions.
`internal-packages/.../tests/attemptFailures.test.ts` `internal-packages/.../tests/batchTriggerAndWait.test.ts` `internal-packages/.../tests/heartbeats.test.ts` `internal-packages/.../tests/trigger.test.ts`	Adds new tests for various failure scenarios, updates imports to include `describe`, and removes a specific test case (“Single run (retry attempt, then succeed)”).
`internal-packages/testcontainers/src/setup.ts`	Updates the `setupBackgroundWorker` function signature by adding an optional `retryOptions` parameter for enhanced task retry configuration.
`packages/cli-v3/.../taskRunProcess.ts`	Enhances the initialization process by including `payload` in option destructuring and adding memory management through `nodeOptionsWithMaxOldSpaceSize`.
`packages/core/package.json`	Adds a new export entry for the `./v3/machines` module, updating both module and type definitions for ESM and CommonJS formats.
`packages/core/src/v3/build/flags.ts` `packages/core/src/v3/build/flags.test.ts` `packages/core/src/v3/build/index.ts` `packages/core/src/v3/build/runtime.ts`	Introduces the `dedupFlags` function with corresponding tests and export updates; modifies `execOptionsForRuntime` to accept additional Node options and deduplicate flags.
`packages/core/src/v3/errors.ts` `packages/core/src/v3/machines/index.ts` `packages/core/src/v3/machines/max-old-space.test.ts`	Adds the `isOOMRunError` function for improved OOM error identification; introduces new functions to compute and append max-old-space-size flags along with their tests.
`packages/core/src/v3/schemas/common.ts` `packages/core/vitest.config.ts`	Updates the TaskRunExecution schema to require the `machine` property and expands Vitest’s file matching pattern to include tests in `src/v3`.
`apps/docker-provider/src/index.ts`	Updates the import source for `testDockerCheckpoint` from `@trigger.dev/core/v3/checkpoints` to `@trigger.dev/core/v3/serverOnly`.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as RunEngine
    participant RetryFn as retryOutcomeFromCompletion
    participant DB as Prisma DB
    participant Worker as Background Worker

    Engine->>RetryFn: Pass runId, error, retry settings
    RetryFn->>DB: Query run settings & machine preset
    DB-->>RetryFn: Return settings
    alt Retry conditions met
        RetryFn-->>Engine: Return "retry" outcome
        Engine->>Worker: Schedule retry with updated machine config
    else No retry available
        RetryFn-->>Engine: Return "fail" outcome
        Engine->>Engine: Mark run as failed
    end

Suggested reviewers

ericallam

Poem

In the land of code, I hop with glee,
A rabbit’s cheer for changes, you see!
Memory flagged and errors fixed,
With retry logic so neatly mixed.
Carrots and commits in a joyful spree 🥕✨
Hopping forward, bug-free and free!

🐰 Happy coding!

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f5cfac and 271eaa8.

📒 Files selected for processing (1)

apps/docker-provider/src/index.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

apps/docker-provider/src/index.ts

⏰ Context from checks skipped due to timeout of 90000ms (7)

GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: typecheck / typecheck
GitHub Check: units / 🧪 Unit Tests
GitHub Check: Analyze (javascript-typescript)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (19)

.zed/tasks.json (1)

1-45: Well-structured task configuration for building packages

This task configuration is well-defined and provides a clear build command for Trigger.dev packages. The configuration includes detailed comments explaining each option, which is helpful for maintainability.

One minor observation: You have an environment variable foo: bar defined on line 7, which appears to be a placeholder. Consider replacing it with actual environment variables needed for the build process if any are required.
packages/core/src/v3/errors.ts (1)
117-118: Consider using optional chaining for error message check

The static analysis tool suggests using optional chaining for the error message property check.
-    if (error.message && error.message.includes("ffmpeg was killed with signal SIGKILL")) {
+    if (error.message?.includes("ffmpeg was killed with signal SIGKILL")) {
🧰 Tools

🪛 Biome (1.9.4)

[error] 117-117: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)
packages/core/src/v3/build/runtime.ts (1)

59-72: Validate flag ordering and deduplication
Combining process.env.NODE_OPTIONS with additionalNodeOptions and then deduplicating is sensible. However, consider edge cases where flags conflict between the environment variable and manual additions. Ensure that critical flags aren’t inadvertently removed or overridden.

internal-packages/testcontainers/src/setup.ts (1)

90-96: Reasonable default retry strategy
Providing a fallback retryConfig object when retryOptions is not passed helps maintain reliable behavior. You may consider logging or surfacing these defaults for debugging when diagnosing retry failures.

internal-packages/run-engine/src/engine/index.ts (2)

2920-2934: Concise handling of “cancel_run” case
On “cancel_run,” you finalize the run immediately. This is straightforward, but watch out for states that might slip through if the run is partially updated in a different thread.

Do you need further validation scripts to confirm that no concurrency collisions occur during cancellation?

2946-2957: Dedicated branch for permanent failure
This branch calls #permanentlyFailRun and returns. Logic is clean and cohesive, but might be streamlined if “cancel_run” and “fail_run” share repeated code.
packages/core/src/v3/machines/index.ts (2)
11-13: Validate overhead range for safer usage.
Consider adding safeguards for unexpected overhead values (e.g., negative or >1) which could lead to zero or negative results for --max-old-space-size.
 export function maxOldSpaceSizeForMachine(machine: MachinePreset, overhead: number = 0.2): number {
-  return Math.round(machine.memory * 1_024 * (1 - overhead));
+  const safeOverhead = Math.min(Math.max(overhead, 0), 0.9);
+  return Math.max(
+    1,
+    Math.round(machine.memory * 1_024 * (1 - safeOverhead))
+  );
 }
34-48: Consider additional checks for multiple or invalid overhead flags.
The function works well for typical usage. Optionally, you could add extra validation to warn if overhead might exceed expected bounds.
internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts (3)

13-169: Additional partial success test scenario may help.
This test suite thoroughly covers user error retry scenarios. Consider adding a partial success scenario or a scenario with incremental error resolution for comprehensive coverage.

395-501: Consider testing different memory configurations.
This OOM test is valuable; you could optionally expand coverage to include multiple memory presets to ensure consistent behavior.

663-830: Additional scenario: partial up-scaling.
It might be insightful to test a scenario where memory is increased incrementally beyond two tiers, ensuring the logic scales repeatedly until the run either succeeds or fully crashes.

packages/core/src/v3/machines/max-old-space.test.ts (2)

31-37: Correct replacement logic.
The test confirms only one --max-old-space-size is ultimately present.

52-56: Tests custom overhead usage.
Optional improvement: test boundary overhead values (e.g., 0, 1, or invalid) to confirm resilience.

internal-packages/run-engine/src/engine/retrying.ts (6)

13-13: Ensure clarity for relative import
The statement import { ServiceValidationError } from "."; can be confusing because "." usually references the current directory's index file. Consider explicitly referencing the correct file path for improved clarity and maintainability.

15-21: Consider making attemptNumber mandatory or defaulted
attemptNumber is declared as number | null. In practice, you ensure correctness by defaulting it to 1 in the OOM logic. Alternatively, consider making this parameter always defined or handle the initialization more explicitly to avoid potential confusion.

23-40: Review naming consistency
While the outcome strings such as "cancel_run", "fail_run", and "retry" are descriptive, consider aligning them further with your domain conventions (e.g., “canceled,” “failed,” “retrying”) for clarity. This is just a naming suggestion, but consistent domain terminology can make logs and error reporting more intuitive.

53-73: Validate concurrency for OOM-based retry
When handling OOM errors, you query the database to determine if a bigger machine is available. In highly concurrent scenarios, another process might concurrently modify these values. If concurrency can occur, ensure that each task run is effectively locked or that you re-check machine availability post-update to avoid race conditions.

92-99: Use findUnique for performance and clarity
When looking for a single record by primary key (id in this case), using findUnique instead of findFirst can be more semantically appropriate and might yield a minor performance benefit.

122-176: Refine error handling strategy and logging
The retryOOMOnMachine function logs an error to the console if the database query fails or parsing fails. In production, consider a structured logging approach or a logger instead of console.error, especially if you’re relying on logs for debugging. This will help you capture context more effectively and manage error levels in a production environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e297c7f and a5b6f06.

📒 Files selected for processing (21)

.zed/tasks.json (1 hunks)
apps/webapp/app/v3/services/alerts/deliverAlert.server.ts (2 hunks)
apps/webapp/app/v3/services/completeAttempt.server.ts (2 hunks)
internal-packages/run-engine/src/engine/index.ts (2 hunks)
internal-packages/run-engine/src/engine/retrying.ts (1 hunks)
internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts (1 hunks)
internal-packages/run-engine/src/engine/tests/batchTriggerAndWait.test.ts (1 hunks)
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (1 hunks)
internal-packages/run-engine/src/engine/tests/trigger.test.ts (0 hunks)
internal-packages/testcontainers/src/setup.ts (2 hunks)
packages/cli-v3/src/executions/taskRunProcess.ts (2 hunks)
packages/core/package.json (3 hunks)
packages/core/src/v3/build/flags.test.ts (1 hunks)
packages/core/src/v3/build/flags.ts (1 hunks)
packages/core/src/v3/build/index.ts (1 hunks)
packages/core/src/v3/build/runtime.ts (3 hunks)
packages/core/src/v3/errors.ts (5 hunks)
packages/core/src/v3/machines/index.ts (1 hunks)
packages/core/src/v3/machines/max-old-space.test.ts (1 hunks)
packages/core/src/v3/schemas/common.ts (1 hunks)
packages/core/vitest.config.ts (1 hunks)

💤 Files with no reviewable changes (1)

internal-packages/run-engine/src/engine/tests/trigger.test.ts

🧰 Additional context used

🪛 Biome (1.9.4)

packages/core/src/v3/errors.ts

[error] 117-117: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

⏰ Context from checks skipped due to timeout of 90000ms (7)

GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: units / 🧪 Unit Tests
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (35)

packages/core/src/v3/build/index.ts (1)

5-5: Export addition looks good

This change properly exposes the functionality from flags.js through the index file, making it accessible to consumers of this module.

packages/core/vitest.config.ts (1)

5-5: Good addition to test configuration

Adding the src/v3/**/*.test.ts pattern to the Vitest configuration ensures that tests located in the source directory will be executed alongside those in the dedicated test directory. This aligns with the approach of co-locating tests with their implementation in the v3 directory.

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (1)

8-8: Fixed missing import

Good catch adding the describe import, which was being used on line 12 but wasn't previously imported.

internal-packages/run-engine/src/engine/tests/batchTriggerAndWait.test.ts (1)

8-8: Fixed missing import

Good catch adding the describe import, which was being used on line 13 but wasn't previously imported.

packages/core/src/v3/build/flags.ts (1)

1-47: Well-implemented flag deduplication utility with good documentation

The dedupFlags function is well-documented with clear examples and handles various flag formats correctly. The implementation efficiently processes the flags in reverse order to preserve the last occurrence of each flag.

The approach of using a set to track seen keys and unshifting to preserve order is clean and effective. The function handles both flags with values (--key=value) and boolean flags (--verbose) appropriately.

packages/core/src/v3/build/flags.test.ts (1)

1-53: Comprehensive test suite for flag deduplication

The test suite is thorough and covers a wide range of scenarios including:

Simple flag handling

Multiple flags

Duplicate flags

Mixed flag types

Various edge cases

This comprehensive testing will help ensure the robustness of the flag handling, which is critical for proper memory management in the run engine.

apps/webapp/app/v3/services/alerts/deliverAlert.server.ts (2)

16-16: Updated OOM error detection import

The import has been updated to use the more specific isOOMRunError function from the core package instead of the previous isOOMError function.

379-379: Improved OOM error detection

The code now uses the more specific isOOMRunError function instead of the previous isOOMError. This provides better detection of out-of-memory conditions by checking for specific error patterns that indicate memory issues.

packages/cli-v3/src/executions/taskRunProcess.ts (3)

22-22: Added machine-specific memory management import

This new import brings in the function that handles calculating appropriate memory limits based on machine configurations.

121-123: Enhanced memory management based on machine size

Added code to extract the payload from options and calculate the appropriate Node.js memory limit based on the machine configuration specified in the execution.

130-130: Applying dynamic memory limits to Node.js process

The Node.js options now include the machine-specific memory limits calculated above, which helps prevent OOM errors by configuring the proper --max-old-space-size parameter.

apps/webapp/app/v3/services/completeAttempt.server.ts (2)

14-14: Updated OOM error detection import

The import has been updated to use the more specific isOOMRunError function from the core package instead of the previous isOOMError function.

259-259: Improved OOM error detection

The code now uses the more specific isOOMRunError function instead of the previous isOOMError. This provides better detection of out-of-memory conditions which is critical for the retry logic that may upgrade the machine size on OOM errors.

packages/core/src/v3/errors.ts (4)

79-129: New comprehensive OOM error detection function

This new isOOMRunError function provides a robust way to detect Out of Memory errors by checking for:

Specific OOM error codes

Error messages containing SIGKILL signals

SIGABRT signals with OOM-related stack traces

ffmpeg killed with SIGKILL (common in memory-intensive operations)

Manual OOM kills

The enhanced detection capabilities will improve the system's ability to identify and respond to memory issues appropriately.

🧰 Tools

🪛 Biome (1.9.4)

[error] 117-117: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

617-618: Added SIGABRT signal detection

SIGABRT signals can also indicate memory issues, especially when combined with specific stack traces. Adding this check improves error detection capabilities.

644-647: Enhanced error handling for SIGABRT signals

SIGABRT signals are now properly categorized as potential OOM errors in the error enhancer, ensuring consistent error handling throughout the codebase.

754-758: Extended exception handling for SIGABRT signals

Similar to the changes above, SIGABRT signals are now properly handled in the exception event enhancer, ensuring consistent error categorization across different error handling paths.

packages/core/src/v3/build/runtime.ts (2)

4-4: New import is consistent with code usage
The newly introduced dedupFlags import is properly referenced below. No issues found.

45-49: Ensure optional parameter usage is well-tested
The function signature now optionally accepts additionalNodeOptions; confirm that all callers either supply the argument or handle its absence. Unit tests would help prevent runtime errors (e.g., passing an unexpected type).

internal-packages/testcontainers/src/setup.ts (1)

71-73: Optional retry configurations
The addition of retryOptions?: RetryOptions after machineConfig is a practical extension. Be sure to handle scenarios where machineConfig is provided but retryOptions remains undefined, and vice versa.

packages/core/package.json (2)

62-62: New export path acknowledged
Exporting ./v3/machines aligns with existing patterns. No issues spotted here.

681-690: Additional export mapping
Paths for both ESM and CommonJS under ./v3/machines appear consistent. Ensuring these entries get tested in various bundler workflows (e.g., webpack, TS path resolution) would be beneficial.

internal-packages/run-engine/src/engine/index.ts (3)

74-74: New import for retry outcome logic
Importing retryOutcomeFromCompletion clarifies how retries are centrally handled. Ensure versioning or compatibility issues don’t arise if this method evolves.

2871-2898: Centralized retry logic and minimal run retrieval
Calling retryOutcomeFromCompletion is a clear way to consolidate retry handling, and fetching a “minimalRun” object for event emission is a clean approach. Verify that partial updates or concurrency locks don’t block subsequent changes.

2971-3004: OOM handling and scheduled retries
The logic for detecting OOM errors and scheduling an immediate or delayed retry seems robust. Ensure that repeated OOMs eventually converge on a final failure to avoid infinite loops.

packages/core/src/v3/machines/index.ts (2)

23-25: Looks solid.
Implementation is straightforward and aligns with the function description.

55-57: Helper function is concise.
Implementation is correct and ensures single spacing.

internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts (3)

171-280: All looks good.
The test for no remaining retries is well-structured and verifies expected status transitions.

282-394: Robust test coverage.
Covers unretriable errors thoroughly; no major concerns noted.

502-661: Thorough coverage of OOM retry logic.
The test effectively verifies scaling up to a larger machine after failing.

packages/core/src/v3/machines/max-old-space.test.ts (5)

16-19: Handles undefined input well.
Ensures defaults are applied when NODE_OPTIONS is absent.

21-24: Correct fallback for empty string.
Verifies a typical edge case successfully.

26-29: Good test for preserving flags.
Ensures existing flags are retained and new flags are appended.

39-45: Multiple flags scenario is well-covered.
Demonstrates robust handling of duplicates.

47-50: Extra spaces are normalized.
Properly validated spacing logic.

coderabbitai · 2025-03-06T19:32:58Z

packages/core/src/v3/schemas/common.ts

@@ -308,7 +308,7 @@ export const TaskRunExecution = z.object({
  organization: TaskRunExecutionOrganization,
  project: TaskRunExecutionProject,
  batch: TaskRunExecutionBatch.optional(),
-  machine: MachinePreset.optional(),
+  machine: MachinePreset,


💡 Verification agent

🧩 Analysis chain

Breaking schema change: machine property now required

You've changed the machine property in the TaskRunExecution schema from optional to required. This is a breaking change that aligns with the PR's focus on improving memory management and OOM retrying.

Make sure all instances of TaskRunExecution creation in the codebase include the machine property to avoid validation errors.

🏁 Script executed:

#!/bin/bash # Check if there are any places creating TaskRunExecution objects that might be missing the machine property rg -A 3 -B 3 "TaskRunExecution" --type typescript | grep -v "machine:"

Length of output: 103

Below is an updated shell script to re-run the verification using a filename filter instead of the unsupported --type typescript. Please run this script to check for any instances where the TaskRunExecution object might be created without the required machine property:

Breaking Change: Mandatory 'machine' Property for TaskRunExecution Schema

The machine property in the TaskRunExecution schema (see packages/core/src/v3/schemas/common.ts, line 311) is now required.

This is a breaking change meant to improve memory management and OOM retrying.

Attention: Ensure that every instance of creating a TaskRunExecution object throughout the codebase includes the machine property to avoid validation errors.

Please re-run the updated script above or manually verify the affected parts of the codebase to confirm that all TaskRunExecution creations provide the machine property.

🏁 Script executed:

#!/bin/bash # Re-run the search for TaskRunExecution usages in TS files, # filtering on the "*.ts" file extension. rg -A 3 -B 3 "TaskRunExecution" -g "*.ts" | grep -v "machine:"

Length of output: 65719

Action Required: Update TaskRunExecution creations with the required machine property

The schema in packages/core/src/v3/schemas/common.ts now mandates that every TaskRunExecution object includes a valid machine property. Our re-run search across the codebase revealed many references where TaskRunExecution objects are constructed without an obvious assignment for machine. For example, in:

apps/webapp/app/v3/services/createTaskRunAttempt.server.ts – the object literal assigned to const execution: TaskRunExecution = { ... } does not appear to include a machine property.

apps/webapp/app/v3/services/completeAttempt.server.ts and apps/webapp/app/v3/services/resumeAttempt.server.ts – similar patterns are visible where TaskRunExecution is referenced without the required machine field.

Please ensure that all locations where a TaskRunExecution object is created are updated to include the machine property (with a valid MachinePreset value) to prevent validation errors and potential runtime issues.

internal-packages/run-engine/src/engine/retrying.ts

matt-aitken added 17 commits March 6, 2025 11:49

Added describe to tests that were missing it

d5cd1c7

Added a function to get the maxOldSpaceSize

0ce5e0f

Make it easy to take NODE_OPTIONS and set the old space flag

c560a18

Merge remote-tracking branch 'origin/main' into run-engine-2-retry-oom

2b92cc0

Added a zed task to rebuild the packages

8b66469

Moved isOOMRunError and added SIBABRT condition

87bd15a

Deduplication flags function with tests

e60020a

Export flags file

408a42d

On TaskRunProcess, set max old space and deduplicate the flags with p…

0a2c090

…riority order

Move retrying logic to a separate function, it was getting very messy

a06b7d1

Created new test file for attempt failures

6766720

Allow setting retry settings for tests

8f1086c

Some retrying tests, including OOM

7aa4598

More failure condition tests

c46d560

Fix for OOM retrying

c8dec74

Complete the attempt span if it was an OOM error

206b9bc

Merge remote-tracking branch 'origin/main' into run-engine-2-retry-oom

a5b6f06

# Conflicts: # packages/core/package.json

mintlify bot deployed to staging - docs March 6, 2025 19:29 View deployment

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

matt-aitken added 3 commits March 6, 2025 19:39

Remove old broken import

a4fadc3

Fixed order of exports

6f5cfac

Fix for docker-provider checkpoints import

271eaa8

ericallam approved these changes Mar 7, 2025

View reviewed changes

ericallam merged commit 4dbf11a into main Mar 7, 2025
6 of 18 checks passed

ericallam deleted the run-engine-2-retry-oom branch March 7, 2025 09:07

coderabbitai bot mentioned this pull request Apr 4, 2025

v4: eagerly fork child processes before warm start initiates #1879

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773

Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773

matt-aitken commented Mar 6, 2025 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Mar 6, 2025 •

edited

Loading

coderabbitai bot commented Mar 6, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Mar 6, 2025

Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773

Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773

Conversation

matt-aitken commented Mar 6, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

changeset-bot bot commented Mar 6, 2025 • edited Loading

⚠️ No Changeset found

coderabbitai bot commented Mar 6, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 6, 2025

Choose a reason for hiding this comment

matt-aitken commented Mar 6, 2025 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Mar 6, 2025 •

edited

Loading

coderabbitai bot commented Mar 6, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)