-
Notifications
You must be signed in to change notification settings - Fork 9k
YARN-10058. Handle uncaught exception for async-scheduling threads to prevent scheduler hangs #7129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@shameersss1 @slfan1989 Could you help to review this PR please? Thanks. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@TaoYang526 I'm sorry I've been busy this past week. I'll take a look at the PR you submitted over the weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up quickly.
The changes look good to me +1
…o prevent scheduler hangs.
Thanks @shameersss1 for the review. I'll merge and then force-push to trigger testing for the patch. |
808208b
to
edf8db8
Compare
🎊 +1 overall
This message was automatically generated. |
@slfan1989 - Could you please review this as well ? (This is a minor change) |
🎊 +1 overall
This message was automatically generated. |
@slfan1989 Gentle reminder for review. |
@TaoYang526 - What could be the real life cases when the Async thread could crash ? Just curious to know |
@shameersss1 Async-thread may crash when runtime exceptions like NPE are thrown inside the scheduling process, this PR can guarantee that RM will have a chance to be recovered in those scenarios, otherwise RM will hang without obvious error (so that user have to dig ERROR info from large volume of RM logs). There's no known cases for me, but there may still be unresolved issues or new problems introduced in the future. |
@Hexiaoqiao Could you please help to review this PR at your convenience? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 from my side. Will commit if no more comment while two workdays later. Thanks.
LGTM +1. |
@TaoYang526 Can we fix the checkstyle issues?
|
@slfan1989 Sure, thank for reminding that! |
b548f32
to
806cdde
Compare
Removed the unused import for UT. |
+1 |
@TaoYang526 Could you please help review this PR(#7138) ? While I have confidence in @shameersss1 code quality, I'm not very familiar with the capacity scheduler, so I'm hesitant to draw any conclusions. I know @brumi1024 is familiar with this, and I’ve already invited him to help with the review. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Committed to trunk. Thanks @TaoYang526 for your contribution and @shameersss1 @slfan1989 for your reviews. |
Description of PR
Handle uncaught exception for async-scheduling threads to prevent scheduler hangs.
How was this patch tested?
Includes 2 unit tests:
(1) HA-disabled: RM should be shutdown after async-scheduling thread exit.
(2) HA-enabled: RM1 should be transitioned from active to standby after async-scheduling thread exit, and this RM should work once it is activated again.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?