Test services resilience #1039

pcrespov · 2019-08-28T09:47:37Z

Found some situations in which service suddenly crashes and the swarm constantly restarts it ... and enters in an endless stop-restart loop.

This happens either because code is broken (normally caught by pylint but not always) or a faulty design

The fact is: All services should guarantee certain level of resilience.

only situation it stops is if configuration is wrong?
does not crash upon startup (ensure app is running after startup and container does not restart)
if a dependent backend service is down, it should NOT stop but rather disable the affected functionality

Examples

If specs validation fails, webserver will retry ... as if he failed to connect apihub!
If minio/s3 access is not available. services should not collapse but just report error
~~- storage GET /locations/?user_id ... gets an error in db (connection drops) and~~
shall be able to start service from CLI w/o any service around up
if redis fails to start, webserver fails
Migration service takes some time to upgrade. Webserver and other might be resilient to failure until migration is achieved! This happens e.g. in the e2e where migration takes a some time to start ... and in the meantime the webserver has failed and restarted. Notice that startup cannot wait eternaly because we also want a fast startup ... to achieve zero-downtime!

TODO

define resilience guarantee levels (analogous to exceptions guarantees )
create generic tests. Should involve starting a service within its container (see ideas on /docker-makefile-x-ops)
add to cookiecutters

Related with PBI

Software stability

The text was updated successfully, but these errors were encountered:

pcrespov · 2019-08-29T19:12:24Z

Can use scripts in ITISFoundation/osparc-op to start stacks with failures or stress test the system

pcrespov · 2021-02-10T19:46:26Z

Case: A connection dropped unexpectedly

In this test:

front-end hits run pipeline
webserver -> (start pipeline) -> director_v2
client in webserver raised aiohttp.client_exceptions.ServerDisconnectedError and was translated into 500
backend responded with 500
the front-end did not handle it

-> client retry policy
-> front-end should handle it since a simply ask the user to retry.

pcrespov added this to the Brindisi milestone Aug 28, 2019

pcrespov self-assigned this Aug 28, 2019

pcrespov mentioned this issue Sep 19, 2019

Webserver: avoid retry if specs validation failure #370

Closed

pcrespov added a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:services-library issues on packages/service-libs labels Sep 19, 2019

sanderegg self-assigned this Oct 21, 2019

pcrespov modified the milestones: Brindisi, Fourecks or XXXX Oct 25, 2019

sanderegg removed their assignment Feb 18, 2020

pcrespov removed this from the Fourecks or XXXX milestone Jul 10, 2020

pcrespov added the bug buggy, it does not work as expected label Jul 10, 2020

GitHK mentioned this issue Feb 11, 2021

Run button api 500 reply #2146

Closed

sanderegg mentioned this issue Feb 25, 2021

platform stability #1426

Closed

pcrespov added t:enhancement Improvement or request on an existing feature t:maintenance Some planned maintenance work and removed bug buggy, it does not work as expected labels Jun 9, 2021

pcrespov mentioned this issue Nov 16, 2022

Define policy upon redis failure (an in general for other services failures) #3572

Open

pcrespov closed this as not planned Won't fix, can't repro, duplicate, stale Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test services resilience #1039

Test services resilience #1039

pcrespov commented Aug 28, 2019 •

edited

Loading

pcrespov commented Aug 29, 2019

pcrespov commented Feb 10, 2021 •

edited

Loading

Test services resilience #1039

Test services resilience #1039

Comments

pcrespov commented Aug 28, 2019 • edited Loading

Examples

TODO

Related with PBI

pcrespov commented Aug 29, 2019

pcrespov commented Feb 10, 2021 • edited Loading

Case: A connection dropped unexpectedly

pcrespov commented Aug 28, 2019 •

edited

Loading

pcrespov commented Feb 10, 2021 •

edited

Loading