-
Notifications
You must be signed in to change notification settings - Fork 907
S3TransferManager got stuck while downloading multiple directories #3850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @caroline202210 thank you for reporting the issue! I was not able to reproduce it unfortunately. |
Hi @zoewangg , thanks for your reply. This issue has become a blocking factor for our version release to the production environment. Could you please provide an estimated time for resolving this issue? Thanks in advance. |
We don't have a timeline at the moment, but we are actively working on it. Will provide an update once we have more information. |
Hi @zoewangg, thanks for your reply. I was wondering if there is a workaround you'd like to recommend during this time? |
I think I can reproduce the issue now. Can you try lowering |
Lowing targetThroughputInGbps to 1? Is it a workaround for this period of time? May I please know a bit more about how it could bypass this problem please? Thanks. |
To be clear, it may or may not work for you. From my testing, the request seemed to succeed after I lowered the value. Basically, the SDK will calculate the optimal number of connections needed to reach the target throughput, and if you reduce it, the SDK will try to establish fewer connections, using fewer resources. Note that we are still investigating the issue. |
Hi, I tried lowering |
Did you also see NPE from S3MetaRequestResponseHandlerNativeAdapter?
If so, I think we know the root cause of why it got stuck now. As a workaround, you could try setting a timer to cancel the request after certain amount of time |
We got NPE from S3MetaRequestResponseHandlerNativeAdapter when we used PREVIEW version and we didn't get it yet in the current version. Is it possible to describe more about the suspected root cause you mentioned? Since I thought it's due to the infinite loop mentioned in the description. |
Hmm, did you see any errors from the logs? I agree that infinite loop is likely the cause of the high CPU utilization. I suspect the reason it never broke from the loop (and the request got stuck) is that the SDK did not receive terminal event such as complete or error event from upstream CRT class. From my test case, since NPE was not handled properly in CRT, onFinished callback was never invoked, so the SDK did not know the request had failed. I'm not sure why the request got stuck in your case if you did not see any errors. I'll look into it a bit more. Can you enable SDK logs ( |
I see you have LoggingTransferListener enabled, did all requests get stuck (not printing out progress) or just a couple of requests? |
Sure, will have a try to get more logs today. My reasoning is since the CPUs are all occupied, so the real the downloading of objects are essentially paused, so there would be no onFinished callback invoked from CRT. I see you have LoggingTransferListener enabled, did all requests get stuck (not printing out progress) or just a couple of requests? |
Our service tries downloading from many directories(could be more than 50) once triggered. I think all requests get stuck. I am not sure if the 2.20.26 version still has the NPE we encountered with PREVIEW version because NPE is a transient problem which only happens when downloading big files. The problem we have now is that the process is completely blocked, and we have not reached the part where NPE may occur. |
I see, I'd recommend slowing down the request rate (for example queuing the requests) or trying with larger instances. I've identified a couple of potential improvements and will create a PR. I'm still not able to reproduce the exact issue you were seeing and I'll try with higher concurrency. |
I think your theory about thread starvation seems to be valid. I created #3867 to fix it. |
I tried to build my project with this patch but got this error below. I also tried to build the whole project by `Execution failed for task ':compileJava'.
|
Ah, can you try |
I did local testing and it seems that the problem of high CPU usage has been solved. However, since this is not a release version, it is difficult for me to deploy it to our real testing environment for further testing. May I ask when this issue will be merged into one of your release versions? |
Good to hear, the PR has been merged and will be released today. |
The fix has been released. Could you try with the latest version |
I have tested the |
It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it. |
Describe the bug
Hi team,
Our service is using S3TransferManager to download multiple directories from an S3 bucket. The progress stuck on the following stack trace:
All 8 CPU cores are running the code above so the CPU usage is about 800%.
The threads above might prevent other threads from running. For example, the following RUNNABLE thread didn't get chance to run since the cpu time is only 0.25ms:
Expected Behavior
Download multiple directories(>50) successfully with reasonable CPU usage.
Current Behavior
See above.
Reproduction Steps
The issue is not a transient issue, we can reproduce it using the following steps:
directories
under a bucket(>50).directory
(10-30)Possible Solution
From AsyncBufferingSubscriber, looks like if the
numRequestsInFlight
is larger than 0, and the event isON_COMPLETE
, the following code will be in an infinite while loop. If all cores are running the following code, it probably would starve other threads, so thenumRequestsInFlight
will never be decremented.Additional Information/Context
Version:
EC2 instance type: c5.2xlarge
AWS Java SDK version used
NA
JDK version used
openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.10+9) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.10+9, mixed mode)
Operating System and version
Linux ip-10-16-81-190 4.14.305-155.531.amzn1.x86_64 #1 SMP Tue Feb 14 10:36:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: