Parsl slurm executor allocation: Starts a GPU node before it is needed #94

jluethi · 2022-07-05T08:55:18Z

When running a parsl workflow that contains both tasks needing to run on the GPU & CPU, Parsl allocates both from the beginning, even though the GPU node isn't needed for a while.

I'm testing the uzh_9x8 example. It runs on a single CPU node for all the Zarr parsing, illumination correction & MIP. The labeling task should run on the GPU.

It already allocates both nodes at the start of the processing:

Given the limited availability of GPU nodes on the Pelkmans lab cluster (and the general much higher cost per hour of GPU nodes than CPU nodes), we should make sure to fix this. Especially if we want to use it in a fashion where multiple GPU nodes can be used.

(Just making sure the issue already exists and we don't forget it. Let's make the labeling (#64) work first of course and worry about this afterwards)

jluethi · 2022-07-05T14:58:16Z

Also, even more problematic, the current parsl setup always allocates a GPU node for the full run duration, even for workflows that don't require one.

For example, I'm running the FMI 2 well test set at the moment. As soon as it starts, a CPU & a GPU node are allocated. Once the zarr file is initialized, it starts a second CPU node to process the second well in parallel. Thus, parsl is upscaling nodes only when necessary, but always seems to keep at least one node of each type running, which is quite problematic for our use cases.

(Node 1 is a CPU node, node 37 a GPU node)

tcompa · 2022-07-06T13:23:05Z

I partly fixed this issue with 6253476 (and ca18b6e), but we may want to look at it a bit further (or at least wait until we have used the new version a few times, to see if something weird happens).

What it does is to use parsl automatic scaling strategy, in Config, with a timeout of 60 seconds. This means that both jobs are submitted at the beginning of the workflow, but the GPU one should disappear after 60 seconds, and it will only be re-activated when needed. This is currently being tested.

jluethi · 2022-07-06T13:27:13Z

In this case, what would happen if no GPU node is available at the start of the submission? We could test this by just defining a new executor with specifications that aren't available (e.g. 1 TB of RAM) on the cluster

jluethi · 2022-08-04T17:06:48Z

I think we can close this for now. Parsl opens the GPI executor, but closes it after 1min as @tcompa described. And it doesn't interfere with a pure CPU pipeline if it's run while no GPU nodes are available (it's requested for a minute or so, but the request is cancelled if it's not available).
Using this for the last month hasn't caused any issues and the GPU node usage through this is minimal.

jluethi added the parsl label Jul 5, 2022

jluethi added this to Fractal Project Management Jul 5, 2022

jluethi moved this to TODO in Fractal Project Management Jul 5, 2022

tcompa added a commit that referenced this issue Jul 6, 2022

Use parsl htex_auto_scale strategy (ref #94)

6253476

tcompa added a commit that referenced this issue Jul 6, 2022

Use parsl htex_auto_scale strategy (ref #94)

6b4fdfc

jluethi self-assigned this Jul 12, 2022

jluethi closed this as completed Aug 4, 2022

Repository owner moved this from TODO to Done in Fractal Project Management Aug 4, 2022

jluethi moved this from Done to Done Archive in Fractal Project Management Oct 5, 2022

jacopo-exact pushed a commit that referenced this issue Nov 16, 2022

Comment out dfk.cleanup() at the end of submit_workflow (ref #94)

8152767

jluethi removed this from Fractal Project Management Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsl slurm executor allocation: Starts a GPU node before it is needed #94

Parsl slurm executor allocation: Starts a GPU node before it is needed #94

jluethi commented Jul 5, 2022

jluethi commented Jul 5, 2022 •

edited

Loading

Uh oh!

tcompa commented Jul 6, 2022

Uh oh!

jluethi commented Jul 6, 2022

Uh oh!

jluethi commented Aug 4, 2022

Uh oh!

Parsl slurm executor allocation: Starts a GPU node before it is needed #94

Parsl slurm executor allocation: Starts a GPU node before it is needed #94

Comments

jluethi commented Jul 5, 2022

jluethi commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcompa commented Jul 6, 2022

Uh oh!

jluethi commented Jul 6, 2022

Uh oh!

jluethi commented Aug 4, 2022

Uh oh!

jluethi commented Jul 5, 2022 •

edited

Loading