Skip to content

sim4life.io - WP4: Computational backend #950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mguidon opened this issue May 9, 2023 · 12 comments
Open

sim4life.io - WP4: Computational backend #950

mguidon opened this issue May 9, 2023 · 12 comments
Assignees
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io

Comments

@mguidon
Copy link
Member

mguidon commented May 9, 2023

Description

With the latest version of sim4life.io, we are introducing an improved computational backend that ensures reliable and efficient job scheduling via the computational backend. Moving forward, all solver jobs will be scheduled via these facilities, enabling users to choose the hardware on which their jobs should run and providing the ability to inspect and operate on the job queue (subject to sufficient permissions).

This robust backend will be capable of handling 100s of concurrent jobs, ensuring that even the busiest periods will not cause any disruptions to service.

Furthermore, the backend functionality will also be made available through the API, allowing for integration with external systems (e.g. the sim4life desktop application) and further expanding the possibilities for users.

## Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4643
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4530
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4525
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/921
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/982
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/617
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/3999
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5094
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4524
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5073
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5074
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1196
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5000
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5293
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5436
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4880
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5336
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4526
### Enchanted Odyssey
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5493
### Schoggilebe
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5497
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5437
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5294
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5339
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5290
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1277
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5403
### This is Sparta!
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5218
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4727
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1218
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1219
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5251
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5203
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5237
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5261
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5264
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5149
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5252
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5287
### Kobayashi Maru
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5087
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1181
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5071
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5024
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5101
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5129
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5146
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5108
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5120
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5141
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5147
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5155
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5162
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5164
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5163
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5165
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5167
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5195
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5204
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5201
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5076
### 7Peaks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4159
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4958
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4781
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/621
- [x] Preferences: add preferences for max number of concurrent jobs
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1180
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4999
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4975
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5008
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5010
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5013
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5042
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5054
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5018
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5025
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5031
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5026
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5032
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5066
### Microhistory
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1034
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4915
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4930
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/3209
### Quilmes
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4517
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4756
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1126
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4376
### Sundae
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4429
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4153
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4523
### Baklava
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4637
@mguidon mguidon added PO issue Created by Product owners s4l:web sim4life product in osparc.io labels May 9, 2023
@mguidon mguidon mentioned this issue May 9, 2023
@mguidon mguidon changed the title sim4life.io - Computational backend sim4life.io - WP4: Computational backend May 9, 2023
@pcrespov pcrespov added this to the Pastel de Nata milestone May 12, 2023
@sanderegg
Copy link
Member

sanderegg commented May 14, 2023

Goal for sprint Pastel de Nata

  • progress on AppTeam Std Simulations, ideally run CF use-case
  • refactoring on computational backend, progress on separating PublicAPI calls from webserver load, return solver progress
  • progress on Public API missing entrypoints, and bug fixes
  • if possible progress on personalized resource limits

@mguidon
Copy link
Member Author

mguidon commented Jul 6, 2023

Update Watermelon

Done:

Ongoing:

  • Robustness improvements/refactoring

@sanderegg
Copy link
Member

sanderegg commented Aug 9, 2023

Update Sundae

Done:

  • bugfixes #4153
  • connection of computational backend to resource usage tracking service #4523
  • new clusters keeper service to automatically create computational clusters in AWS #4591

Ongoing:

@sanderegg
Copy link
Member

sanderegg commented Sep 6, 2023

The below schema shows the overall architecture for the on-demand clusters.
Some important points here are:

  • the computational clusters are created per user/wallet
  • in case of maintenance in simcore, these clusters shall be able to continue running independently

Image

@sanderegg sanderegg modified the milestones: Baklava, the nameless Sep 18, 2023
@sanderegg
Copy link
Member

sanderegg commented Oct 31, 2023

Update Microhistory

Done and working

  • Separate cluster is created for each set of user/wallet combination on demand in Amazon AWS,
  • Cluster is a primary machine (t2.micro), on which a stack containing dask-scheduler, autoscaling, redis, dask-sidecar services is started, dask-sidecar only runs on worker machines,
  • autoscaling service creates 1 worker machine (g4dn.xlarge),
  • Only computational services that use a pricing unit defined as g4dn.xlarge machine can run,
  • computational service uses the all the resources provided by the machine (a bit less than 16Gb/4CPUs)

--> Running computational service should work for one service at a time, provided they are set up to use a g4dn.xlarge machine type, there is no upscaling of the machines so parallel jobs will have to wait in line (if multiple isolve jobs are sent, they will be executed one after the other).

should work in 3 weeks

  • Cluster shall create correct machine based on plan (not just g4dn.xlarge), so potentially better machine fit/performance,
  • identify computational child jobs (for example started from s4l) and show them in UI
  • maybe upscaling of separate cluster (needs discussions on how to do it, it has influence on costs, etc)

should not be available in 3 weeks

  • upscaling?
  • optimisations

@sanderegg
Copy link
Member

sanderegg commented Nov 28, 2023

Update 7peaks

Summary

It is now possible to run computational services on their required AWS instance types. Also child computational job logs show up in the logs of the parent service (e.g. sim4life/jupyterlab starting a computational job).
Upscaling is still not implemented.

Done

Ongoing

  • bugfixing
  • improvements on user feedback (cluster status, number of machines, etc...)

@sanderegg
Copy link
Member

sanderegg commented Jan 7, 2024

Update Kobayashi Maru

Summary

  • bugfixes:
    • handling of on-demand computational clusters (timeouts, reported states)
    • concurrent computing of tasks
  • monitoring & manual interventions:
    • CLI tool to monitor on-demand computational clusters and dynamic service machines
    • partially clear jobs in a specific cluster
    • allow tracing of created machines via tags on EC2 instances

Done ✅

  • various fixes for GPU-based computational services on multi-GPU machines
  • migration of sleepers test to Playwright framework to have more reliable and more flexible E2E testing and compatibility with on-demand computational clusters
  • various fixes regarding invalid state reported by the computational clusters
  • added timeout in case of non responding cluster for more than 10 minutes
  • improvement of response time when retrieving the computational clusters state via Public API
  • new CLI-based monitoring tool to check current state of auto-scaled EC2 instances and their running states

Problematic issues (being worked on) 🚧

Open Features 🚧

@sanderegg sanderegg modified the milestones: Kobayashi Maru, This is Sparta! Jan 11, 2024
@sanderegg
Copy link
Member

sanderegg commented Jan 30, 2024

@bisgaard-itis bisgaard-itis removed this from the This is Sparta! milestone Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io
Projects
None yet
Development

No branches or pull requests

9 participants