-
-
Notifications
You must be signed in to change notification settings - Fork 705
Run Engine 2: More robust attempt failing/retrying (inc. OOM retrying) #1773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # packages/core/package.json
|
WalkthroughThis PR introduces a new task configuration for building packages and adjusts several error handling paths across the codebase. The error identification logic has been updated by replacing the general OOM check with a more specific one, and old functions have been removed. In addition, the run engine’s retry logic has been refactored into a centralized function, and new tests have been added or modified across multiple modules. Changes also include updates to task run initialization for memory management and expanded export mappings and schema validations in the core package. Changes
Sequence Diagram(s)sequenceDiagram
participant Engine as RunEngine
participant RetryFn as retryOutcomeFromCompletion
participant DB as Prisma DB
participant Worker as Background Worker
Engine->>RetryFn: Pass runId, error, retry settings
RetryFn->>DB: Query run settings & machine preset
DB-->>RetryFn: Return settings
alt Retry conditions met
RetryFn-->>Engine: Return "retry" outcome
Engine->>Worker: Schedule retry with updated machine config
else No retry available
RetryFn-->>Engine: Return "fail" outcome
Engine->>Engine: Mark run as failed
end
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (7)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (19)
.zed/tasks.json (1)
1-45
: Well-structured task configuration for building packagesThis task configuration is well-defined and provides a clear build command for Trigger.dev packages. The configuration includes detailed comments explaining each option, which is helpful for maintainability.
One minor observation: You have an environment variable
foo: bar
defined on line 7, which appears to be a placeholder. Consider replacing it with actual environment variables needed for the build process if any are required.packages/core/src/v3/errors.ts (1)
117-118
: Consider using optional chaining for error message checkThe static analysis tool suggests using optional chaining for the error message property check.
- if (error.message && error.message.includes("ffmpeg was killed with signal SIGKILL")) { + if (error.message?.includes("ffmpeg was killed with signal SIGKILL")) {🧰 Tools
🪛 Biome (1.9.4)
[error] 117-117: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
packages/core/src/v3/build/runtime.ts (1)
59-72
: Validate flag ordering and deduplication
Combiningprocess.env.NODE_OPTIONS
withadditionalNodeOptions
and then deduplicating is sensible. However, consider edge cases where flags conflict between the environment variable and manual additions. Ensure that critical flags aren’t inadvertently removed or overridden.internal-packages/testcontainers/src/setup.ts (1)
90-96
: Reasonable default retry strategy
Providing a fallbackretryConfig
object whenretryOptions
is not passed helps maintain reliable behavior. You may consider logging or surfacing these defaults for debugging when diagnosing retry failures.internal-packages/run-engine/src/engine/index.ts (2)
2920-2934
: Concise handling of “cancel_run” case
On “cancel_run,” you finalize the run immediately. This is straightforward, but watch out for states that might slip through if the run is partially updated in a different thread.Do you need further validation scripts to confirm that no concurrency collisions occur during cancellation?
2946-2957
: Dedicated branch for permanent failure
This branch calls#permanentlyFailRun
and returns. Logic is clean and cohesive, but might be streamlined if “cancel_run” and “fail_run” share repeated code.packages/core/src/v3/machines/index.ts (2)
11-13
: Validate overhead range for safer usage.
Consider adding safeguards for unexpected overhead values (e.g., negative or >1) which could lead to zero or negative results for--max-old-space-size
.export function maxOldSpaceSizeForMachine(machine: MachinePreset, overhead: number = 0.2): number { - return Math.round(machine.memory * 1_024 * (1 - overhead)); + const safeOverhead = Math.min(Math.max(overhead, 0), 0.9); + return Math.max( + 1, + Math.round(machine.memory * 1_024 * (1 - safeOverhead)) + ); }
34-48
: Consider additional checks for multiple or invalid overhead flags.
The function works well for typical usage. Optionally, you could add extra validation to warn if overhead might exceed expected bounds.internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts (3)
13-169
: Additional partial success test scenario may help.
This test suite thoroughly covers user error retry scenarios. Consider adding a partial success scenario or a scenario with incremental error resolution for comprehensive coverage.
395-501
: Consider testing different memory configurations.
This OOM test is valuable; you could optionally expand coverage to include multiple memory presets to ensure consistent behavior.
663-830
: Additional scenario: partial up-scaling.
It might be insightful to test a scenario where memory is increased incrementally beyond two tiers, ensuring the logic scales repeatedly until the run either succeeds or fully crashes.packages/core/src/v3/machines/max-old-space.test.ts (2)
31-37
: Correct replacement logic.
The test confirms only one--max-old-space-size
is ultimately present.
52-56
: Tests custom overhead usage.
Optional improvement: test boundary overhead values (e.g., 0, 1, or invalid) to confirm resilience.internal-packages/run-engine/src/engine/retrying.ts (6)
13-13
: Ensure clarity for relative import
The statementimport { ServiceValidationError } from ".";
can be confusing because"."
usually references the current directory'sindex
file. Consider explicitly referencing the correct file path for improved clarity and maintainability.
15-21
: Consider makingattemptNumber
mandatory or defaulted
attemptNumber
is declared asnumber | null
. In practice, you ensure correctness by defaulting it to 1 in the OOM logic. Alternatively, consider making this parameter always defined or handle the initialization more explicitly to avoid potential confusion.
23-40
: Review naming consistency
While the outcome strings such as"cancel_run"
,"fail_run"
, and"retry"
are descriptive, consider aligning them further with your domain conventions (e.g., “canceled,” “failed,” “retrying”) for clarity. This is just a naming suggestion, but consistent domain terminology can make logs and error reporting more intuitive.
53-73
: Validate concurrency for OOM-based retry
When handling OOM errors, you query the database to determine if a bigger machine is available. In highly concurrent scenarios, another process might concurrently modify these values. If concurrency can occur, ensure that each task run is effectively locked or that you re-check machine availability post-update to avoid race conditions.
92-99
: UsefindUnique
for performance and clarity
When looking for a single record by primary key (id
in this case), usingfindUnique
instead offindFirst
can be more semantically appropriate and might yield a minor performance benefit.
122-176
: Refine error handling strategy and logging
TheretryOOMOnMachine
function logs an error to the console if the database query fails or parsing fails. In production, consider a structured logging approach or a logger instead ofconsole.error
, especially if you’re relying on logs for debugging. This will help you capture context more effectively and manage error levels in a production environment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
.zed/tasks.json
(1 hunks)apps/webapp/app/v3/services/alerts/deliverAlert.server.ts
(2 hunks)apps/webapp/app/v3/services/completeAttempt.server.ts
(2 hunks)internal-packages/run-engine/src/engine/index.ts
(2 hunks)internal-packages/run-engine/src/engine/retrying.ts
(1 hunks)internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
(1 hunks)internal-packages/run-engine/src/engine/tests/batchTriggerAndWait.test.ts
(1 hunks)internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
(1 hunks)internal-packages/run-engine/src/engine/tests/trigger.test.ts
(0 hunks)internal-packages/testcontainers/src/setup.ts
(2 hunks)packages/cli-v3/src/executions/taskRunProcess.ts
(2 hunks)packages/core/package.json
(3 hunks)packages/core/src/v3/build/flags.test.ts
(1 hunks)packages/core/src/v3/build/flags.ts
(1 hunks)packages/core/src/v3/build/index.ts
(1 hunks)packages/core/src/v3/build/runtime.ts
(3 hunks)packages/core/src/v3/errors.ts
(5 hunks)packages/core/src/v3/machines/index.ts
(1 hunks)packages/core/src/v3/machines/max-old-space.test.ts
(1 hunks)packages/core/src/v3/schemas/common.ts
(1 hunks)packages/core/vitest.config.ts
(1 hunks)
💤 Files with no reviewable changes (1)
- internal-packages/run-engine/src/engine/tests/trigger.test.ts
🧰 Additional context used
🪛 Biome (1.9.4)
packages/core/src/v3/errors.ts
[error] 117-117: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (35)
packages/core/src/v3/build/index.ts (1)
5-5
: Export addition looks goodThis change properly exposes the functionality from flags.js through the index file, making it accessible to consumers of this module.
packages/core/vitest.config.ts (1)
5-5
: Good addition to test configurationAdding the
src/v3/**/*.test.ts
pattern to the Vitest configuration ensures that tests located in the source directory will be executed alongside those in the dedicated test directory. This aligns with the approach of co-locating tests with their implementation in the v3 directory.internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (1)
8-8
: Fixed missing importGood catch adding the
describe
import, which was being used on line 12 but wasn't previously imported.internal-packages/run-engine/src/engine/tests/batchTriggerAndWait.test.ts (1)
8-8
: Fixed missing importGood catch adding the
describe
import, which was being used on line 13 but wasn't previously imported.packages/core/src/v3/build/flags.ts (1)
1-47
: Well-implemented flag deduplication utility with good documentationThe
dedupFlags
function is well-documented with clear examples and handles various flag formats correctly. The implementation efficiently processes the flags in reverse order to preserve the last occurrence of each flag.The approach of using a set to track seen keys and unshifting to preserve order is clean and effective. The function handles both flags with values (
--key=value
) and boolean flags (--verbose
) appropriately.packages/core/src/v3/build/flags.test.ts (1)
1-53
: Comprehensive test suite for flag deduplicationThe test suite is thorough and covers a wide range of scenarios including:
- Simple flag handling
- Multiple flags
- Duplicate flags
- Mixed flag types
- Various edge cases
This comprehensive testing will help ensure the robustness of the flag handling, which is critical for proper memory management in the run engine.
apps/webapp/app/v3/services/alerts/deliverAlert.server.ts (2)
16-16
: Updated OOM error detection importThe import has been updated to use the more specific
isOOMRunError
function from the core package instead of the previousisOOMError
function.
379-379
: Improved OOM error detectionThe code now uses the more specific
isOOMRunError
function instead of the previousisOOMError
. This provides better detection of out-of-memory conditions by checking for specific error patterns that indicate memory issues.packages/cli-v3/src/executions/taskRunProcess.ts (3)
22-22
: Added machine-specific memory management importThis new import brings in the function that handles calculating appropriate memory limits based on machine configurations.
121-123
: Enhanced memory management based on machine sizeAdded code to extract the payload from options and calculate the appropriate Node.js memory limit based on the machine configuration specified in the execution.
130-130
: Applying dynamic memory limits to Node.js processThe Node.js options now include the machine-specific memory limits calculated above, which helps prevent OOM errors by configuring the proper
--max-old-space-size
parameter.apps/webapp/app/v3/services/completeAttempt.server.ts (2)
14-14
: Updated OOM error detection importThe import has been updated to use the more specific
isOOMRunError
function from the core package instead of the previousisOOMError
function.
259-259
: Improved OOM error detectionThe code now uses the more specific
isOOMRunError
function instead of the previousisOOMError
. This provides better detection of out-of-memory conditions which is critical for the retry logic that may upgrade the machine size on OOM errors.packages/core/src/v3/errors.ts (4)
79-129
: New comprehensive OOM error detection functionThis new
isOOMRunError
function provides a robust way to detect Out of Memory errors by checking for:
- Specific OOM error codes
- Error messages containing SIGKILL signals
- SIGABRT signals with OOM-related stack traces
- ffmpeg killed with SIGKILL (common in memory-intensive operations)
- Manual OOM kills
The enhanced detection capabilities will improve the system's ability to identify and respond to memory issues appropriately.
🧰 Tools
🪛 Biome (1.9.4)
[error] 117-117: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
617-618
: Added SIGABRT signal detectionSIGABRT signals can also indicate memory issues, especially when combined with specific stack traces. Adding this check improves error detection capabilities.
644-647
: Enhanced error handling for SIGABRT signalsSIGABRT signals are now properly categorized as potential OOM errors in the error enhancer, ensuring consistent error handling throughout the codebase.
754-758
: Extended exception handling for SIGABRT signalsSimilar to the changes above, SIGABRT signals are now properly handled in the exception event enhancer, ensuring consistent error categorization across different error handling paths.
packages/core/src/v3/build/runtime.ts (2)
4-4
: New import is consistent with code usage
The newly introduceddedupFlags
import is properly referenced below. No issues found.
45-49
: Ensure optional parameter usage is well-tested
The function signature now optionally acceptsadditionalNodeOptions
; confirm that all callers either supply the argument or handle its absence. Unit tests would help prevent runtime errors (e.g., passing an unexpected type).internal-packages/testcontainers/src/setup.ts (1)
71-73
: Optional retry configurations
The addition ofretryOptions?: RetryOptions
aftermachineConfig
is a practical extension. Be sure to handle scenarios wheremachineConfig
is provided butretryOptions
remains undefined, and vice versa.packages/core/package.json (2)
62-62
: New export path acknowledged
Exporting./v3/machines
aligns with existing patterns. No issues spotted here.
681-690
: Additional export mapping
Paths for both ESM and CommonJS under./v3/machines
appear consistent. Ensuring these entries get tested in various bundler workflows (e.g., webpack, TS path resolution) would be beneficial.internal-packages/run-engine/src/engine/index.ts (3)
74-74
: New import for retry outcome logic
ImportingretryOutcomeFromCompletion
clarifies how retries are centrally handled. Ensure versioning or compatibility issues don’t arise if this method evolves.
2871-2898
: Centralized retry logic and minimal run retrieval
CallingretryOutcomeFromCompletion
is a clear way to consolidate retry handling, and fetching a “minimalRun” object for event emission is a clean approach. Verify that partial updates or concurrency locks don’t block subsequent changes.
2971-3004
: OOM handling and scheduled retries
The logic for detecting OOM errors and scheduling an immediate or delayed retry seems robust. Ensure that repeated OOMs eventually converge on a final failure to avoid infinite loops.packages/core/src/v3/machines/index.ts (2)
23-25
: Looks solid.
Implementation is straightforward and aligns with the function description.
55-57
: Helper function is concise.
Implementation is correct and ensures single spacing.internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts (3)
171-280
: All looks good.
The test for no remaining retries is well-structured and verifies expected status transitions.
282-394
: Robust test coverage.
Covers unretriable errors thoroughly; no major concerns noted.
502-661
: Thorough coverage of OOM retry logic.
The test effectively verifies scaling up to a larger machine after failing.packages/core/src/v3/machines/max-old-space.test.ts (5)
16-19
: Handles undefined input well.
Ensures defaults are applied whenNODE_OPTIONS
is absent.
21-24
: Correct fallback for empty string.
Verifies a typical edge case successfully.
26-29
: Good test for preserving flags.
Ensures existing flags are retained and new flags are appended.
39-45
: Multiple flags scenario is well-covered.
Demonstrates robust handling of duplicates.
47-50
: Extra spaces are normalized.
Properly validated spacing logic.
@@ -308,7 +308,7 @@ export const TaskRunExecution = z.object({ | |||
organization: TaskRunExecutionOrganization, | |||
project: TaskRunExecutionProject, | |||
batch: TaskRunExecutionBatch.optional(), | |||
machine: MachinePreset.optional(), | |||
machine: MachinePreset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Breaking schema change: machine
property now required
You've changed the machine
property in the TaskRunExecution
schema from optional to required. This is a breaking change that aligns with the PR's focus on improving memory management and OOM retrying.
Make sure all instances of TaskRunExecution
creation in the codebase include the machine
property to avoid validation errors.
🏁 Script executed:
#!/bin/bash
# Check if there are any places creating TaskRunExecution objects that might be missing the machine property
rg -A 3 -B 3 "TaskRunExecution" --type typescript | grep -v "machine:"
Length of output: 103
Below is an updated shell script to re-run the verification using a filename filter instead of the unsupported --type typescript
. Please run this script to check for any instances where the TaskRunExecution
object might be created without the required machine
property:
Breaking Change: Mandatory 'machine' Property for TaskRunExecution Schema
- The
machine
property in theTaskRunExecution
schema (seepackages/core/src/v3/schemas/common.ts
, line 311) is now required. - This is a breaking change meant to improve memory management and OOM retrying.
- Attention: Ensure that every instance of creating a
TaskRunExecution
object throughout the codebase includes themachine
property to avoid validation errors.
Please re-run the updated script above or manually verify the affected parts of the codebase to confirm that all TaskRunExecution
creations provide the machine
property.
🏁 Script executed:
#!/bin/bash
# Re-run the search for TaskRunExecution usages in TS files,
# filtering on the "*.ts" file extension.
rg -A 3 -B 3 "TaskRunExecution" -g "*.ts" | grep -v "machine:"
Length of output: 65719
Action Required: Update TaskRunExecution creations with the required machine
property
The schema in packages/core/src/v3/schemas/common.ts now mandates that every TaskRunExecution object includes a valid machine
property. Our re-run search across the codebase revealed many references where TaskRunExecution objects are constructed without an obvious assignment for machine
. For example, in:
- apps/webapp/app/v3/services/createTaskRunAttempt.server.ts – the object literal assigned to
const execution: TaskRunExecution = { ... }
does not appear to include amachine
property. - apps/webapp/app/v3/services/completeAttempt.server.ts and apps/webapp/app/v3/services/resumeAttempt.server.ts – similar patterns are visible where TaskRunExecution is referenced without the required
machine
field.
Please ensure that all locations where a TaskRunExecution object is created are updated to include the machine
property (with a valid MachinePreset value) to prevent validation errors and potential runtime issues.
--max-old-space-size
on TaskRunProcess based on the machine size. So this now works for dev runs as well as prod runs.Summary by CodeRabbit
New Features
Bug Fixes
Refactor