-
Notifications
You must be signed in to change notification settings - Fork 10
feat: adding graceful shutdown for druid process #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding graceful shutdown for druid process #23
Conversation
@sydefz Thank you for your PR. We will take a look at it and get back to you. |
hi @dougtoppin any updates on this? |
@sydefz Sorry about the delay. We have been working on some other tasks. We will get to this soon. |
@@ -109,6 +109,8 @@ waitForProcess() { | |||
else | |||
echo "The new node is up. Stopping old node..." | |||
$SUPERVISORCTL_CMD stop $process_name | |||
# wait gracefulShutdownTimeout for 30 seconds | |||
sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in stopwaitsecs
from node config into this script instead of hardcoding it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good idea, eventually we should avoid hard code, make it configurable and have the flexibility to handle shorter or longer graceful shutdown periods.
I guess the priority and purpose of this PR is to ensure we don't throw 500s during deployment, I'm happy to extend this to make it flexible later.
@@ -58,6 +58,7 @@ user=${USER_NAME} | |||
autorestart=true | |||
redirect_stderr=true | |||
stdout_logfile=/var/log/supervisor/historical.log | |||
stopwaitsecs=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in common_user_data
so we don't have to repeat it for each individual service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the supervisord config for each druid component is generated in individual user_data. They are slightly different, bringing these customisations into common_user_data would defeat the purpose of having common
.
I'd keep it like this for now.
thanks for the submission @sydefz |
Description of changes:
Sometimes we can observe after stopping historical the host gets terminated immediately, this causes the in-flight requests to fail and results in 500s.
This PR adds a 30s sleep after the historical/query/master stop, so it has enough time to gracefully shutdown the process before terminating the host. Also updates the supervisord config to wait for 30s before sending the KILL signal.
I chose 30s because the default druid.server.http.gracefulShutdownTimeout is PT30s (https://druid.apache.org/docs/latest/configuration/#historical-query-configs)
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.