Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(s3-deployment): optimize memory usage for large files #34020

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

scorbiere
Copy link
Contributor

Issue # (if applicable)

Closes #34002.

Reason for this change

The original fix for issue #22661 (PR #33698) introduced a regression where the S3 deployment Lambda would read entire files into memory to check if they're JSON. This approach works fine for small files but causes Lambda timeouts and memory issues with large files (10MB+). This is particularly problematic for customers deploying large assets to S3 buckets.

Description of changes

The S3 deployment Lambda handler was reading entire files into memory to check if they're JSON, causing timeouts and memory issues with large files (10MB+).

This change improves memory efficiency by:

  1. Only reading the entire file when markers contain double quotes
  2. Adding an early return when there are no markers to replace
  3. Only performing special JSON parsing and replacement when both conditions are met:
    a) At least one marker's value contains double quotes
    b) The file content is valid JSON (determined after reading the file)

Performance testing shows:

  • Standard operations stay under 32MB memory usage
  • Even complex JSON files, with double quotes in the markers, stay under 256MB

Performance metrics show significant improvements:

  • Any files (whatever their size, both JSON and text) are processed efficiently with minimal memory overhead, as long as no marker's value contains double quotes
  • Complex JSON files with double quotes in markers still require memory but stay within Lambda limits

Additionally, this change:

  • Added documentation in the README about memory requirements for JSON files with double quotes in markers to help users understand when they might need to increase Lambda memory limits

Describe any new or updated permissions being added

No new or updated IAM permissions are required for this change.

Description of how you validated changes

  • Created a local stack test that reproduces the issue with large files
  • Implemented local testing to verify the fix with both small and large files
  • Added memory limit assertions to ensure memory usage stays within acceptable bounds

Performance testing was conducted with isolated test runs to ensure accurate measurements:

Test Case File Size Execution Time Memory Usage
Small JSON file 1.01 KB 0.0003 seconds 0 KB (no increase)
Large JSON file 10,842.58 KB (~10.6 MB) 0.0397 seconds 0 KB (no increase)
Complex JSON file 28,468.06 KB (~28 MB) 0.2210 seconds 0 KB (no increase)
Complex JSON with double quotes in markers 28,468.06 KB (~28 MB) 1.1423 seconds 208,512 KB (~204 MB)
Small text file 0.96 KB 0.0003 seconds 0 KB (no increase)
Large text file 10,919.16 KB (~10.7 MB) 0.0389 seconds 0 KB (no increase)

All tests passed with memory limits of:

  • 32MB for standard operations
  • 256MB for complex JSON with double quotes in markers

The results confirm that our optimization only uses significant memory when processing JSON files with double quote markers, which is the specific case that requires special handling.

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

The S3 deployment Lambda handler was reading entire files into memory to check
if they're JSON, causing timeouts and memory issues with large files (10MB+).

This change improves memory efficiency by:
1. Using file extension to determine if a file is JSON instead of reading its content
2. Only loading the entire file for JSON processing when necessary
3. Maintaining special handling for JSON files with double quotes in markers

Performance testing shows:
- Standard operations stay under 32MB memory usage
- Even complex JSON files with double quotes stay under 256MB
- Processing time is comparable to the previous implementation
@aws-cdk-automation aws-cdk-automation requested a review from a team April 2, 2025 17:26
@github-actions github-actions bot added bug This issue is a bug. p0 labels Apr 2, 2025
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Apr 2, 2025
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This review is outdated)

Copy link

codecov bot commented Apr 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.98%. Comparing base (74cbe27) to head (c34c6b9).
Report is 27 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #34020   +/-   ##
=======================================
  Coverage   83.98%   83.98%           
=======================================
  Files         120      120           
  Lines        6976     6976           
  Branches     1178     1178           
=======================================
  Hits         5859     5859           
  Misses       1005     1005           
  Partials      112      112           
Flag Coverage Δ
suite.unit 83.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
packages/aws-cdk ∅ <ø> (∅)
packages/aws-cdk-lib/core 83.98% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@scorbiere
Copy link
Contributor Author

Exemption Request: The issue involves memory usage with large files (10MB+), and adding integ test would require to add such files to the repository. Instead, I've implemented comprehensive local testing with Docker-based isolated test runs that measure memory usage across various file types and sizes. This approach provides more targeted performance metrics than a standard integration test could, as it includes specific memory instrumentation that captures the exact behavior being fixed. The detailed performance metrics included in this PR demonstrate the effectiveness of the solution without needing to commit large test files to the repository.

@aws-cdk-automation aws-cdk-automation added the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Apr 2, 2025
$DOCKER_CMD build .
$DOCKER_CMD build --no-cache -t s3-deployment-test .

$DOCKER_CMD run --rm s3-deployment-test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and where/when does this test get run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are manually run by the developer when testing their changes on the custom resource's code.

@rehos
Copy link

rehos commented Apr 3, 2025

I also ran into the memory issue and we don't use any markers. I almost did a PR until I saw this one.

I would like to suggest to move the check for no markers from replace_markers to extract_and_replace_markers.

# extract archive and replace markers in output files
def extract_and_replace_markers(archive, contents_dir, markers):
    with ZipFile(archive, "r") as zip:
        zip.extractall(contents_dir)

        # replace markers for this source if there are any markers
        if markers:
            for file in zip.namelist():
                file_path = os.path.join(contents_dir, file)
                if os.path.isdir(file_path): continue
                replace_markers(file_path, markers)

…memory usage

This change addresses issue aws#34002 where the S3 deployment Lambda function experiences memory issues with large JSON files. The fix:

- Adds a new `jsonAwareSourceProcessing` boolean property to BucketDeploymentProps
- Implements efficient marker detection using line-by-line processing
- Optimizes memory usage by only loading entire files when necessary
- Updates tests to verify both processing modes
- Updates documentation with usage examples and memory considerations

By default, the property is set to false for backward compatibility.
@aws-cdk-automation aws-cdk-automation dismissed their stale review April 7, 2025 19:31

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@scorbiere
Copy link
Contributor Author

@rehos

I would like to suggest to move the check for no markers from replace_markers to extract_and_replace_markers

That's a valid proposition. However, I want to be able to test the changes without having to create a zip file. So I will keep the test at the beginning of replace_markers.

chunk += ` "name": "Item ${lineNum}",\n`;
chunk += ` "hash": "${crypto.createHash('sha256').update(lineNum.toString()).digest('hex')}",\n`;
chunk += ' "properties": {\n';
chunk += ` "description": "${lineContent.replace(/"/g, '\\"')}",\n`;

Check failure

Code scanning / CodeQL

Incomplete string escaping or encoding High test

This does not escape backslash characters in the input.
chunk += ` }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`;
} else {
// Simple items for the rest
chunk += ` { "id": "${lineNum}", "value": "${lineContent.replace(/"/g, '\\"')}" }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`;

Check failure

Code scanning / CodeQL

Incomplete string escaping or encoding High test

This does not escape backslash characters in the input.
scorbiere and others added 7 commits April 10, 2025 09:24
We are now using a safe json value for marker replacement and avoid loading the complete JSON file in memory.

This change also add a way to trigger the new processing when creating the source data. This now limit the processing at the asset level instead of the deployment level.
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: c34c6b9
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. contribution/core This is a PR that came from AWS. p0 pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(s3-deployment): times out with large files
5 participants