Skip to content

webserver's healthcheck monitors and diagnoses slow callbacks as unhealthy #1406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 26, 2020

Conversation

pcrespov
Copy link
Member

@pcrespov pcrespov commented Mar 25, 2020

What do these changes do?

Introduces some diagnostics in webserver that monitors slow callbacks. Under certain conditions () the diagnose can determine that the service in unhealthy and swarm will automatically restart it.

  • servicelib.monitor_slow_callbacks: hooks look event handler and registers an incident when there is a slow callback
  • webserver new diagnostics modules that keeps track of incidents
  • webserver healthcheck now returns 503 when it is overloaded to long-delayed callbacks. Then swarm will restart the webserver.

Related issue number

Platform Stability
Split from PR #1401

How to test

cd package/service-library
make install-dev
make tests

cd ../services/web/server
make install-dev
make tests-unit

Manual Test: Artificially enforced slowdown and diagnostics flagged the service as unhealthy. Swarm reacted by restarting it
image

Checklist

  • Did you change any service's API? Then make sure to bundle document and upgrade version (make openapi-specs, git commit ... and then make version-*)
  • Unit tests for the changes exist
  • Runs in the swarm
  • Documentation reflects the changes
  • New module? Add your github username to .github/CODEOWNERS

@codecov
Copy link

codecov bot commented Mar 25, 2020

Codecov Report

Merging #1406 into master will increase coverage by 0.99%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1406      +/-   ##
==========================================
+ Coverage   69.76%   70.76%   +0.99%     
==========================================
  Files         222      225       +3     
  Lines        8818     8913      +95     
  Branches      968      979      +11     
==========================================
+ Hits         6152     6307     +155     
+ Misses       2383     2324      -59     
+ Partials      283      282       -1
Flag Coverage Δ
#integrationtests 57.16% <48.83%> (-0.09%) ⬇️
#unittests 65.33% <93.75%> (-2.01%) ⬇️
Impacted Files Coverage Δ
...ver/src/simcore_service_webserver/rest_handlers.py 87.5% <100%> (+3.12%) ⬆️
...erver/src/simcore_service_webserver/application.py 93.65% <100%> (+0.2%) ⬆️
...e-library/src/servicelib/monitor_slow_callbacks.py 100% <100%> (ø)
...ckages/service-library/src/servicelib/incidents.py 88.23% <88.23%> (ø)
...erver/src/simcore_service_webserver/diagnostics.py 93.75% <93.75%> (ø)
.../director/src/simcore_service_director/producer.py 67.01% <0%> (+0.25%) ⬆️
...core-sdk/src/simcore_sdk/node_ports/_items_list.py 90.69% <0%> (+2.32%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2b3e6b...95ec007. Read the comment docs.

@pcrespov pcrespov self-assigned this Mar 25, 2020
@pcrespov pcrespov added this to the Dim Sum milestone Mar 25, 2020
@pcrespov pcrespov marked this pull request as ready for review March 25, 2020 12:50
@pcrespov pcrespov changed the title WIP: webserver's healthcheck monitors and diagnoses slow callbacks as unhealthy webserver's healthcheck monitors and diagnoses slow callbacks as unhealthy Mar 25, 2020
Copy link
Member

@odeimaiz odeimaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no more every second day manual restarting!

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! just an empty test to clean away

@pcrespov pcrespov added the a:webserver issue related to the webserver service label Mar 26, 2020
@pcrespov pcrespov merged commit 282174b into ITISFoundation:master Mar 26, 2020
@pcrespov pcrespov deleted the enh-healthcheck branch March 26, 2020 11:22
@sanderegg sanderegg linked an issue Apr 6, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:webserver issue related to the webserver service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

platform stability
3 participants