Skip to content

S3TransferManager got stuck while downloading multiple directories #3850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
caroline202210 opened this issue Mar 21, 2023 · 23 comments
Closed
Assignees
Labels
bug This issue is a bug. closed-for-staleness p1 This is a high priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days.

Comments

@caroline202210
Copy link

Describe the bug

Hi team,

Our service is using S3TransferManager to download multiple directories from an S3 bucket. The progress stuck on the following stack trace:

sdk-async-response-2-7" #345 daemon prio=5 os_prio=0 cpu=1504829.76ms elapsed=1613.05s tid=0x00007ff554188800 nid=0x2851 runnable  [0x00007ff546d76000]
   java.lang.Thread.State: RUNNABLE
        at software.amazon.awssdk.transfer.s3.internal.AsyncBufferingSubscriber.flushBufferIfNeeded(AsyncBufferingSubscriber.java:104)
        at software.amazon.awssdk.transfer.s3.internal.AsyncBufferingSubscriber.onComplete(AsyncBufferingSubscriber.java:150)
        at software.amazon.awssdk.utils.async.DelegatingSubscriber.onComplete(DelegatingSubscriber.java:47)
        at software.amazon.awssdk.core.pagination.async.PaginationSubscription.completeSubscription(PaginationSubscription.java:81)
        at software.amazon.awssdk.core.internal.pagination.async.ItemsSubscription.handleRequests(ItemsSubscription.java:51)
        at software.amazon.awssdk.core.internal.pagination.async.ItemsSubscription.sendNextElement(ItemsSubscription.java:104)
...

All 8 CPU cores are running the code above so the CPU usage is about 800%.

The threads above might prevent other threads from running. For example, the following RUNNABLE thread didn't get chance to run since the cpu time is only 0.25ms:

"udp-epoll-6" #48 daemon prio=5 os_prio=0 cpu=0.25ms elapsed=2117.01s tid=0x00007ff68e112000 nid=0xfbe runnable  [0x00007ff5d598e000]
   java.lang.Thread.State: RUNNABLE
        at io.micrometer.shaded.io.netty.channel.epoll.Native.epollWait(Native Method)
        at io.micrometer.shaded.io.netty.channel.epoll.Native.epollWait(Native.java:209)
        at io.micrometer.shaded.io.netty.channel.epoll.Native.epollWait(Native.java:202)
        at io.micrometer.shaded.io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:306)
        at io.micrometer.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:363)
        at io.micrometer.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.micrometer.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.micrometer.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run([email protected]/Thread.java:834)

Expected Behavior

Download multiple directories(>50) successfully with reasonable CPU usage.

Current Behavior

See above.

Reproduction Steps

The issue is not a transient issue, we can reproduce it using the following steps:

  1. Multiple directories under a bucket(>50).
  2. Within each bucket, there are multiple S3 objects under a directory (10-30)
  3. Some s3 objects are large, one biggest file is more then 10GB (which triggered the same issue described in issue 3394 when we were using a PREVIEW version)
  4. Create a S3TransferManger
S3TransferManager transferManager = S3TransferManager.create();
  1. For each directory, download use the following code:
    @VisibleForTesting
    boolean downloadS3Directory(String farmName, String key, String destination, String globalState) {
        DirectoryDownload directoryDownload = s3TransferManager.downloadDirectory(
                DownloadDirectoryRequest.builder()
                        .destination(Paths.get(destination))
                        .bucket(s3Properties.getMapsBucket())
                        .listObjectsV2RequestTransformer(request -> request.prefix(key))
                        .downloadFileRequestTransformer(request -> request.addTransferListener(LoggingTransferListener.create()))
                        .build());
        return waitForDirectoryDownloadComplete(directoryDownload, farmName, globalState);
    }

    @VisibleForTesting
    boolean waitForDirectoryDownloadComplete(DirectoryDownload directoryDownload, String farmName, String globalState) {
        CompletedDirectoryDownload completedDirectoryDownload = directoryDownload.completionFuture().join();
        int failedTransferSize = completedDirectoryDownload.failedTransfers().size();
        if (failedTransferSize > 0) {
            log.error("Farm {} BfsStateValue {} failed transfer size {}", farmName, globalState,
                    completedDirectoryDownload.failedTransfers().size());
            completedDirectoryDownload.failedTransfers().forEach(fail ->
                    log.error("Farm {} BfsStateValue {} Object failed to transfer with message:{}",
                            farmName, globalState, fail, fail.exception()));
            return false;
        }
        return true;
    }
  1. Run the code on an EC2 instance.

Possible Solution

From AsyncBufferingSubscriber, looks like if the numRequestsInFlight is larger than 0, and the event is ON_COMPLETE, the following code will be in an infinite while loop. If all cores are running the following code, it probably would starve other threads, so the numRequestsInFlight will never be decremented.

    private void flushBufferIfNeeded() {
        if (isDelivering.compareAndSet(false, true)) {
            try {
                Optional<StoringSubscriber.Event<T>> next = storingSubscriber.peek();
                while (numRequestsInFlight.get() < maxConcurrentExecutions) {
                    if (!next.isPresent()) {
                        subscription.request(1);
                        break;
                    }

                    switch (next.get().type()) {
                        case ON_COMPLETE:
                            handleCompleteEvent();
                            break;
                        case ON_ERROR:
                            handleError(next.get().runtimeError());
                            break;
                        case ON_NEXT:
                            handleOnNext(next.get().value());
                            break;
                        default:
                            handleError(new IllegalStateException("Unknown stored type: " + next.get().type()));
                            break;
                    }

                    next = storingSubscriber.peek();
                }
            } finally {
                isDelivering.set(false);
            }
        }
    }

    private void handleCompleteEvent() {
        if (numRequestsInFlight.get() == 0) {
            returnFuture.complete(null);
            storingSubscriber.poll();
        }
    }

Additional Information/Context

Version:

  implementation 'software.amazon.awssdk:s3-transfer-manager:2.20.26'
  implementation group: 'software.amazon.awssdk', name: 'aws-crt-client', version: '2.20.26'

EC2 instance type: c5.2xlarge

AWS Java SDK version used

NA

JDK version used

openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.10+9) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.10+9, mixed mode)

Operating System and version

Linux ip-10-16-81-190 4.14.305-155.531.amzn1.x86_64 #1 SMP Tue Feb 14 10:36:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@caroline202210 caroline202210 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 21, 2023
@zoewangg
Copy link
Contributor

zoewangg commented Mar 22, 2023

Hi @caroline202210 thank you for reporting the issue! I was not able to reproduce it unfortunately. I have created a potential fix #3852, would you be able to build the branch and verify it? You can run mvn clean install -pl :s3-transfer-manager -P quick --am and then replace the version with the snapshot version. Edit: we found some issues and we are fixing it. We will let you know once the PR is finalized.

@zoewangg zoewangg removed the needs-triage This issue or PR still needs to be triaged. label Mar 22, 2023
@caroline202210
Copy link
Author

Hi @zoewangg , thanks for your reply. This issue has become a blocking factor for our version release to the production environment. Could you please provide an estimated time for resolving this issue? Thanks in advance.

@zoewangg
Copy link
Contributor

We don't have a timeline at the moment, but we are actively working on it. Will provide an update once we have more information.

@caroline202210
Copy link
Author

Hi @zoewangg, thanks for your reply. I was wondering if there is a workaround you'd like to recommend during this time?

@debora-ito debora-ito added the p1 This is a high priority issue label Mar 27, 2023
@zoewangg
Copy link
Contributor

I think I can reproduce the issue now. Can you try lowering targetThroughputInGbps? The default is 10.

https://github.com/aws/aws-sdk-java-v2/blob/master/services/s3/src/main/java/software/amazon/awssdk/services/s3/S3CrtAsyncClientBuilder.java#L108

@caroline202210
Copy link
Author

Lowing targetThroughputInGbps to 1? Is it a workaround for this period of time? May I please know a bit more about how it could bypass this problem please? Thanks.

@zoewangg
Copy link
Contributor

To be clear, it may or may not work for you. From my testing, the request seemed to succeed after I lowered the value. Basically, the SDK will calculate the optimal number of connections needed to reach the target throughput, and if you reduce it, the SDK will try to establish fewer connections, using fewer resources. Note that we are still investigating the issue.

@caroline202210
Copy link
Author

Hi, I tried lowering targetThroughputInGbps to 1G but it didn't work. Thread still got stuck.

@zoewangg
Copy link
Contributor

Did you also see NPE from S3MetaRequestResponseHandlerNativeAdapter?

	at software.amazon.awssdk.crt.s3.S3MetaRequestResponseHandlerNativeAdapter.onResponseBody(S3MetaRequestResponseHandlerNativeAdapter.java:19)
Exception in thread "Thread-8" java.lang.NullPointerException: Cannot read the array length because "array" is null

If so, I think we know the root cause of why it got stuck now. As a workaround, you could try setting a timer to cancel the request after certain amount of time future.get(xxx, xxx).

@caroline202210
Copy link
Author

We got NPE from S3MetaRequestResponseHandlerNativeAdapter when we used PREVIEW version and we didn't get it yet in the current version. Is it possible to describe more about the suspected root cause you mentioned? Since I thought it's due to the infinite loop mentioned in the description.
We could set the timer while waiting, can I ask what is the suggested retry policy when it's timed out. Also it's not a transient issue within our environment, so I'm afraid it would always fail the requests every time we retry.

@zoewangg
Copy link
Contributor

zoewangg commented Mar 28, 2023

Hmm, did you see any errors from the logs? I agree that infinite loop is likely the cause of the high CPU utilization. I suspect the reason it never broke from the loop (and the request got stuck) is that the SDK did not receive terminal event such as complete or error event from upstream CRT class. From my test case, since NPE was not handled properly in CRT, onFinished callback was never invoked, so the SDK did not know the request had failed.

I'm not sure why the request got stuck in your case if you did not see any errors. I'll look into it a bit more. Can you enable SDK logs (software.amazon.awssdk.transfer) and CRT logs? You can switch to AwsCrtAsyncHttpClient tab under Verbose wire logging section in this page https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/logging-slf4j.html to see how you can enable CRT logs.

@zoewangg
Copy link
Contributor

I see you have LoggingTransferListener enabled, did all requests get stuck (not printing out progress) or just a couple of requests?

@caroline202210
Copy link
Author

Sure, will have a try to get more logs today. My reasoning is since the CPUs are all occupied, so the real the downloading of objects are essentially paused, so there would be no onFinished callback invoked from CRT.

I see you have LoggingTransferListener enabled, did all requests get stuck (not printing out progress) or just a couple of requests?
There is no progress got printed out on EC2 instances.

@caroline202210
Copy link
Author

Our service tries downloading from many directories(could be more than 50) once triggered. I think all requests get stuck. I am not sure if the 2.20.26 version still has the NPE we encountered with PREVIEW version because NPE is a transient problem which only happens when downloading big files. The problem we have now is that the process is completely blocked, and we have not reached the part where NPE may occur.

@zoewangg
Copy link
Contributor

I see, I'd recommend slowing down the request rate (for example queuing the requests) or trying with larger instances. I've identified a couple of potential improvements and will create a PR. I'm still not able to reproduce the exact issue you were seeing and I'll try with higher concurrency.

@zoewangg
Copy link
Contributor

zoewangg commented Mar 29, 2023

I think your theory about thread starvation seems to be valid. I created #3867 to fix it. With this change, the thread will stop polling from the request queue if the upstream has finished and there are still in-flight requests, which will eventually flush the queue once they finish. (Edit: we ended up refactoring it because buffering and queuing logic is not really needed) I tested it locally and it seemed to be working now for my test code that sends 10 downloadDirectory requests (each directory contains 10 1GB objects). If you would like to test it, you can build the branch zoewang/tmFixDownloadDirectory locally using mvn clean install -pl :s3-transfer-manager -P quick --am

@caroline202210
Copy link
Author

I tried to build my project with this patch but got this error below. I also tried to build the whole project by mvn clean install -P quick --am but got build failure as well.

`Execution failed for task ':compileJava'.

Could not resolve all files for configuration ':compileClasspath'.
Could not resolve software.amazon.awssdk:s3-transfer-manager:2.20.34-SNAPSHOT.
Required by:
project :
> Could not resolve software.amazon.awssdk:s3-transfer-manager:2.20.34-SNAPSHOT.
> Could not parse POM /Users/carli/.m2/repository/software/amazon/awssdk/s3-transfer-manager/2.20.34-SNAPSHOT/s3-transfer-manager-2.20.34-SNAPSHOT.pom
> Could not find software.amazon.awssdk:bom-internal:2.20.34-SNAPSHOT.
Searched in the following locations:
- file:/Users/carli/.m2/repository/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/maven-metadata.xml
- file:/Users/carli/.m2/repository/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/bom-internal-2.20.34-SNAPSHOT.pom
- https://artylab.expedia.biz/public-maven-virtual/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/maven-metadata.xml
- https://artylab.expedia.biz/public-maven-virtual/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/bom-internal-2.20.34-SNAPSHOT.pom
- https://plugins.gradle.org/m2/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/maven-metadata.xml
- https://plugins.gradle.org/m2/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/bom-internal-2.20.34-SNAPSHOT.pom
- https://repo.maven.apache.org/maven2/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/maven-metadata.xml
- https://repo.maven.apache.org/maven2/software/amazon/awssdk/bom-internal/2.20.34-SNAPSHOT/bom-internal-2.20.34-SNAPSHOT.pom
`

@zoewangg
Copy link
Contributor

Ah, can you try mvn clean install -pl :bom-internal first?

@caroline202210
Copy link
Author

I did local testing and it seems that the problem of high CPU usage has been solved. However, since this is not a release version, it is difficult for me to deploy it to our real testing environment for further testing. May I ask when this issue will be merged into one of your release versions?

@zoewangg
Copy link
Contributor

Good to hear, the PR has been merged and will be released today.

@zoewangg
Copy link
Contributor

The fix has been released. Could you try with the latest version 2.20.36?

@zoewangg zoewangg added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 30, 2023
@caroline202210
Copy link
Author

I have tested the 2.20.36 on our Lab environment and the CPU looked fine and copy was successful. I will observe in the coming weekend and will push it to our production next week if everything is fine. Thanks.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 31, 2023
@zoewangg zoewangg added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 31, 2023
@github-actions
Copy link

github-actions bot commented Apr 5, 2023

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added closing-soon This issue will close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will close in 4 days unless further comments are made. labels Apr 5, 2023
@github-actions github-actions bot closed this as completed Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p1 This is a high priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days.
Projects
None yet
Development

No branches or pull requests

3 participants