Deployments field tests before the [big] release #505

YuryHrytsuk · 2024-01-11T10:58:18Z

Actions

simcore traefik --> increase replicas
make revision on manager EC2 machines

Tests

crucial manager --> put down and test
simcore node --> put down and test

OPS tests

make sure that postgres backup apply works
make sure aws snapshot apply works

To observe

What happens if Rabbit is 10 minutes down
What happens if Postgres is 10 minutes down
What happens if Redis is 10 minutes down

Insights

director-v2 --> not restartable
director-v2 --> one instance
storage --> one instance

@mrnicegyu11 @sanderegg @matusdrobuliak66

YuryHrytsuk · 2024-01-24T13:45:41Z

The tests are scheduled for Mon (Jan. 29) from 8:00 AM till 12:00 PM

matusdrobuliak66 · 2024-01-29T07:54:40Z

Idea:

Keep a spare fully prepared node in the lcuster(in drain mode), so that when one node breaks, we can swiftly introduce a healthy node

YuryHrytsuk · 2024-01-29T08:22:46Z

Observations

When Rabbit is down, we return 404 (webserver is continuously restarting)
When Redis is down, the webpage is up. We only show Red Cloud
When Potsgres is down, we show black screen

Recovering

After Rabbit was down, automatic recovery (no actions required)
After Redis was down, automatic recovery (no actions required)
After Postgres was down:
- The state gets unclean and requires cleaning (i.e. cleaning hanging sidecars, restarting storage and director (v2?))
- So, it requires maintenance

YuryHrytsuk · 2024-02-02T08:01:44Z

Field Test: Simcore Node is down

Preparation

Prepare addition (2nd) Simcore node so that the swarm cluster has 2 Simcore nodes (HA purpose)
Redeploy simcore stack to distribute services across 2 simcore nodes
Stop one of the aws instances (simcore node)

First field test attempt

It didn't go well since many services would not start on a second (simcore field test) node due to resource constraints. Although this node have the same resources
- While recovering the cluster, it turned out that some of the services will not start on a ususal simcore node (again resource constraints) which is weird
Simcore node on NIH STAG also run rediscommander, jaeger and prometheuscaadvisor.
- Q: do these services need data migration while moving to another node

Second field test attempt

All went good. There was a short downtime for some services but as a user I didn't notice disruptions
- all services migrated from broken node to a working one. HA worked good
- it all of course depends on which services will be migrating. E.g. if director-v2 or api-server will be migratiing for "broken" node to working node, then there will be down time

Conclusions

Consider if running jaeger, rediscommander, ... (having docker node labels other than simcore) is a good practice
- Once node is down and this services need to migrate to a new node, we will loose data. How critical is this data?
We need to have >1 replicas for every simcore service to smooth situations when one of the simcore nodes is down

YuryHrytsuk · 2024-02-02T14:42:18Z

Field Test: Manager Node is down

Preparation

Add 2 manager nodes with labels manager, ops, traefik, dasksidecar, dynamicsidecar
Increase traefik replicas to 3
Copy traefik certs
Update AWS LB Target Group manually

Observations

it takes 1.5 minutes of bad gateway because LB healthcheck mechanism
We lose Appmotion Stack, Prometheus, Graylog, Portainer, Rabbit, Redis, Adminer --> no node labels
webserver hangs because of rabbit --> platform is not operatable
The platform didn't recover automatically because rabbit/redis is down

Recovering

Manually add rabbit, redis, appmotiondb labels to one of the nodes
Appmotion DB manual data migration ?

Conclusions

The platform will not recover automatically if manager with rabbit / redis/ appmotiondb node labels will go down
AWS LB requires ~1.5 min to realize that node is broken. So, we get Bad Gateway for 1.5 minutes (I don't know why there is no route distribution happening)
The platform can become operatable after adding rabbit/redis labels to some other node.

YuryHrytsuk · 2024-02-02T14:45:47Z

Things to consider to increase platform robustness and recoverability

Shall simcore node have only simcore label and run services that can easily move to another node
- e.g. no rabbit / redis data-dependent services shall be run on it
- how sensitive rabbit / redis to restart on another node?
We need to run services with >1 replicas, otherwise Node HA may not help in case of node disruptions (node goes down)
- e.g. director-v2 / api-server will be migrating from broken node to a working node and this will cause downtime
AWS LB healthcheck mechanism will consider node unhealthy in >1.5 min. Isn't it too long?
- Reconsider AWS LB Target Group healthcheck settings
- The docker swarm dispatcher_heartbeat_period is set to 60s, so the order of magnitude of 90s for the healthcheck might be fine
We should have at least 2 nodes that are available for every service we run. If one node goes down, then the 2nd one should be able to accommodate all the services.
- For example, 1 of 2 simcore nodes goes down. Will the 2nd Node have enough resources to accommodate all simcore services?
When Node goes down, the other node (that is here for HA) shall have all the docker node labels required in placement constraints of the service. At the moment, it cannot happen since some [data-dependent] services should be bound to a single node (e.g. Portainer)
Appmotion DB is bound to a node

YuryHrytsuk · 2024-02-02T15:44:44Z

EC2 machines Revision

NIH PROD:

Manager 1,2,3 --> Overprovisioned
OPS 1,2 --> Overprovisioned
Simcore 1,2 --> Overprovisioned
GPU --> No recommendations
CPU --> OK

NIH STAG:

Manager --> Overprovisioned [action required] (Optimise NIH STAG Machines (manager & simcore) #542)
Simcore --> Underprovisioned [action required] (Optimise NIH STAG Machines (manager & simcore) #542)
CPU --> Overprovisiobed
GPU --> No recommendations

ZMT:

Enable AWS Computer Optimizer [action required] --> Done

fyi: @matusdrobuliak66

YuryHrytsuk · 2024-02-06T13:31:42Z

Postgtres [in-house] backups work. Helpful resources to apply a backup:

YuryHrytsuk · 2024-02-06T14:16:46Z

AWS Snapshop [postgres] restore:

it works
while restoring, you need to provide all the settings of the NEW RDS instance that you want to restore the snapshot to
- apparently, there is no way to restore the snapshop into existing database (https://stackoverflow.com/a/24279273)

YuryHrytsuk added the p:mid-prio label Jan 11, 2024

YuryHrytsuk assigned YuryHrytsuk and matusdrobuliak66 Jan 15, 2024

YuryHrytsuk added this to the This is Sparta! milestone Jan 15, 2024

YuryHrytsuk added the p:high-prio label Jan 15, 2024

YuryHrytsuk mentioned this issue Jan 24, 2024

Bump simcore traefik replicas to 2 #524

Merged

1 task

YuryHrytsuk added the EPIC label Feb 2, 2024

YuryHrytsuk mentioned this issue Feb 2, 2024

Optimise NIH STAG Machines (manager & simcore) #542

Closed

YuryHrytsuk closed this as completed Feb 6, 2024

mrnicegyu11 mentioned this issue Feb 8, 2024

Use AWS Hosted Redis #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deployments field tests before the [big] release #505

Deployments field tests before the [big] release #505

YuryHrytsuk commented Jan 11, 2024 •

edited

Loading

YuryHrytsuk commented Jan 24, 2024

Uh oh!

matusdrobuliak66 commented Jan 29, 2024

Uh oh!

YuryHrytsuk commented Jan 29, 2024 •

edited

Loading

Uh oh!

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading

Uh oh!

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading

Uh oh!

YuryHrytsuk commented Feb 2, 2024 •

edited by mrnicegyu11

Loading

Uh oh!

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading

Uh oh!

YuryHrytsuk commented Feb 6, 2024

Uh oh!

YuryHrytsuk commented Feb 6, 2024

Uh oh!

Deployments field tests before the [big] release #505

Deployments field tests before the [big] release #505

Comments

YuryHrytsuk commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

YuryHrytsuk commented Jan 24, 2024

Uh oh!

matusdrobuliak66 commented Jan 29, 2024

Uh oh!

YuryHrytsuk commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YuryHrytsuk commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Field Test: Simcore Node is down

Preparation

First field test attempt

Second field test attempt

Uh oh!

YuryHrytsuk commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Field Test: Manager Node is down

Uh oh!

YuryHrytsuk commented Feb 2, 2024 • edited by mrnicegyu11 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Things to consider to increase platform robustness and recoverability

Uh oh!

YuryHrytsuk commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EC2 machines Revision

Uh oh!

YuryHrytsuk commented Feb 6, 2024

Uh oh!

YuryHrytsuk commented Feb 6, 2024

Uh oh!

YuryHrytsuk commented Jan 11, 2024 •

edited

Loading

YuryHrytsuk commented Jan 29, 2024 •

edited

Loading

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading

YuryHrytsuk commented Feb 2, 2024 •

edited by mrnicegyu11

Loading

YuryHrytsuk commented Feb 2, 2024 •

edited

Loading