Skip to content

Deployments field tests before the [big] release #505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 tasks done
YuryHrytsuk opened this issue Jan 11, 2024 · 9 comments
Closed
9 tasks done

Deployments field tests before the [big] release #505

YuryHrytsuk opened this issue Jan 11, 2024 · 9 comments

Comments

@YuryHrytsuk
Copy link
Collaborator

YuryHrytsuk commented Jan 11, 2024

Actions

  • simcore traefik --> increase replicas
  • make revision on manager EC2 machines

Tests

  • crucial manager --> put down and test
  • simcore node --> put down and test

OPS tests

  • make sure that postgres backup apply works
  • make sure aws snapshot apply works

To observe

  • What happens if Rabbit is 10 minutes down
  • What happens if Postgres is 10 minutes down
  • What happens if Redis is 10 minutes down

Insights

  • director-v2 --> not restartable
  • director-v2 --> one instance
  • storage --> one instance

@mrnicegyu11 @sanderegg @matusdrobuliak66

@YuryHrytsuk YuryHrytsuk added this to the This is Sparta! milestone Jan 15, 2024
@YuryHrytsuk
Copy link
Collaborator Author

The tests are scheduled for Mon (Jan. 29) from 8:00 AM till 12:00 PM

@matusdrobuliak66
Copy link
Contributor

Idea:

  • Keep a spare fully prepared node in the lcuster(in drain mode), so that when one node breaks, we can swiftly introduce a healthy node

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Jan 29, 2024

Observations

  • When Rabbit is down, we return 404 (webserver is continuously restarting)
  • When Redis is down, the webpage is up. We only show Red Cloud
  • When Potsgres is down, we show black screen

Recovering

  • After Rabbit was down, automatic recovery (no actions required)
  • After Redis was down, automatic recovery (no actions required)
  • After Postgres was down:
    • The state gets unclean and requires cleaning (i.e. cleaning hanging sidecars, restarting storage and director (v2?))
    • So, it requires maintenance

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Feb 2, 2024

Field Test: Simcore Node is down

Preparation

  1. Prepare addition (2nd) Simcore node so that the swarm cluster has 2 Simcore nodes (HA purpose)
  2. Redeploy simcore stack to distribute services across 2 simcore nodes
  3. Stop one of the aws instances (simcore node)

First field test attempt

  • It didn't go well since many services would not start on a second (simcore field test) node due to resource constraints. Although this node have the same resources
    • While recovering the cluster, it turned out that some of the services will not start on a ususal simcore node (again resource constraints) which is weird
  • Simcore node on NIH STAG also run rediscommander, jaeger and prometheuscaadvisor.
    • Q: do these services need data migration while moving to another node

Second field test attempt

  • All went good. There was a short downtime for some services but as a user I didn't notice disruptions
    • all services migrated from broken node to a working one. HA worked good
    • it all of course depends on which services will be migrating. E.g. if director-v2 or api-server will be migratiing for "broken" node to working node, then there will be down time

Conclusions

  • Consider if running jaeger, rediscommander, ... (having docker node labels other than simcore) is a good practice
    • Once node is down and this services need to migrate to a new node, we will loose data. How critical is this data?
  • We need to have >1 replicas for every simcore service to smooth situations when one of the simcore nodes is down

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Feb 2, 2024

Field Test: Manager Node is down

Preparation

  1. Add 2 manager nodes with labels manager, ops, traefik, dasksidecar, dynamicsidecar
  2. Increase traefik replicas to 3
  3. Copy traefik certs
  4. Update AWS LB Target Group manually

Observations

  • it takes 1.5 minutes of bad gateway because LB healthcheck mechanism
  • We lose Appmotion Stack, Prometheus, Graylog, Portainer, Rabbit, Redis, Adminer --> no node labels
  • webserver hangs because of rabbit --> platform is not operatable
  • The platform didn't recover automatically because rabbit/redis is down

Recovering

  • Manually add rabbit, redis, appmotiondb labels to one of the nodes
  • Appmotion DB manual data migration ?

Conclusions

  • The platform will not recover automatically if manager with rabbit / redis/ appmotiondb node labels will go down
  • AWS LB requires ~1.5 min to realize that node is broken. So, we get Bad Gateway for 1.5 minutes (I don't know why there is no route distribution happening)
  • The platform can become operatable after adding rabbit/redis labels to some other node.

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Feb 2, 2024

Things to consider to increase platform robustness and recoverability

  1. Shall simcore node have only simcore label and run services that can easily move to another node
    • e.g. no rabbit / redis data-dependent services shall be run on it
    • how sensitive rabbit / redis to restart on another node?
  2. We need to run services with >1 replicas, otherwise Node HA may not help in case of node disruptions (node goes down)
    • e.g. director-v2 / api-server will be migrating from broken node to a working node and this will cause downtime
  3. AWS LB healthcheck mechanism will consider node unhealthy in >1.5 min. Isn't it too long?
    • Reconsider AWS LB Target Group healthcheck settings
    • The docker swarm dispatcher_heartbeat_period is set to 60s, so the order of magnitude of 90s for the healthcheck might be fine
  4. We should have at least 2 nodes that are available for every service we run. If one node goes down, then the 2nd one should be able to accommodate all the services.
    • For example, 1 of 2 simcore nodes goes down. Will the 2nd Node have enough resources to accommodate all simcore services?
  5. When Node goes down, the other node (that is here for HA) shall have all the docker node labels required in placement constraints of the service. At the moment, it cannot happen since some [data-dependent] services should be bound to a single node (e.g. Portainer)
  6. Appmotion DB is bound to a node

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Feb 2, 2024

EC2 machines Revision

NIH PROD:

  • Manager 1,2,3 --> Overprovisioned
  • OPS 1,2 --> Overprovisioned
  • Simcore 1,2 --> Overprovisioned
  • GPU --> No recommendations
  • CPU --> OK

NIH STAG:

ZMT:

  • Enable AWS Computer Optimizer [action required] --> Done

fyi: @matusdrobuliak66

@YuryHrytsuk
Copy link
Collaborator Author

@YuryHrytsuk
Copy link
Collaborator Author

AWS Snapshop [postgres] restore:

  • it works
  • while restoring, you need to provide all the settings of the NEW RDS instance that you want to restore the snapshot to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants