Skip to content

Test services resilience #1039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
pcrespov opened this issue Aug 28, 2019 · 2 comments
Closed
2 tasks

Test services resilience #1039

pcrespov opened this issue Aug 28, 2019 · 2 comments
Assignees
Labels
a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:services-library issues on packages/service-libs t:enhancement Improvement or request on an existing feature t:maintenance Some planned maintenance work

Comments

@pcrespov
Copy link
Member

pcrespov commented Aug 28, 2019

Found some situations in which service suddenly crashes and the swarm constantly restarts it ... and enters in an endless stop-restart loop.

This happens either because code is broken (normally caught by pylint but not always) or a faulty design

The fact is: All services should guarantee certain level of resilience.

  • only situation it stops is if configuration is wrong?
  • does not crash upon startup (ensure app is running after startup and container does not restart)
  • if a dependent backend service is down, it should NOT stop but rather disable the affected functionality

Examples

  • If specs validation fails, webserver will retry ... as if he failed to connect apihub!
  • If minio/s3 access is not available. services should not collapse but just report error
    - storage GET /locations/?user_id ... gets an error in db (connection drops) and
  • shall be able to start service from CLI w/o any service around up
  • if redis fails to start, webserver fails
  • Migration service takes some time to upgrade. Webserver and other might be resilient to failure until migration is achieved! This happens e.g. in the e2e where migration takes a some time to start ... and in the meantime the webserver has failed and restarted. Notice that startup cannot wait eternaly because we also want a fast startup ... to achieve zero-downtime!

TODO

Related with PBI

  • Software stability
@pcrespov pcrespov added this to the Brindisi milestone Aug 28, 2019
@pcrespov pcrespov self-assigned this Aug 28, 2019
@pcrespov
Copy link
Member Author

Can use scripts in ITISFoundation/osparc-op to start stacks with failures or stress test the system

@pcrespov pcrespov added a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:services-library issues on packages/service-libs labels Sep 19, 2019
@sanderegg sanderegg self-assigned this Oct 21, 2019
@pcrespov pcrespov modified the milestones: Brindisi, Fourecks or XXXX Oct 25, 2019
@sanderegg sanderegg removed their assignment Feb 18, 2020
@pcrespov pcrespov removed this from the Fourecks or XXXX milestone Jul 10, 2020
@pcrespov pcrespov added the bug buggy, it does not work as expected label Jul 10, 2020
@pcrespov
Copy link
Member Author

pcrespov commented Feb 10, 2021

Case: A connection dropped unexpectedly

In this test:

  • front-end hits run pipeline
  • webserver -> (start pipeline) -> director_v2
  • client in webserver raised aiohttp.client_exceptions.ServerDisconnectedError and was translated into 500
  • backend responded with 500
  • the front-end did not handle it

-> client retry policy
-> front-end should handle it since a simply ask the user to retry.

@pcrespov pcrespov added t:enhancement Improvement or request on an existing feature t:maintenance Some planned maintenance work and removed bug buggy, it does not work as expected labels Jun 9, 2021
@pcrespov pcrespov closed this as not planned Won't fix, can't repro, duplicate, stale Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:services-library issues on packages/service-libs t:enhancement Improvement or request on an existing feature t:maintenance Some planned maintenance work
Projects
None yet
Development

No branches or pull requests

2 participants