Skip to content

Parsl slurm executor allocation: Starts a GPU node before it is needed #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jluethi opened this issue Jul 5, 2022 · 4 comments
Closed
Assignees

Comments

@jluethi
Copy link
Collaborator

jluethi commented Jul 5, 2022

When running a parsl workflow that contains both tasks needing to run on the GPU & CPU, Parsl allocates both from the beginning, even though the GPU node isn't needed for a while.

I'm testing the uzh_9x8 example. It runs on a single CPU node for all the Zarr parsing, illumination correction & MIP. The labeling task should run on the GPU.

It already allocates both nodes at the start of the processing:
Screenshot 2022-07-05 at 10 53 11

Given the limited availability of GPU nodes on the Pelkmans lab cluster (and the general much higher cost per hour of GPU nodes than CPU nodes), we should make sure to fix this. Especially if we want to use it in a fashion where multiple GPU nodes can be used.

(Just making sure the issue already exists and we don't forget it. Let's make the labeling (#64) work first of course and worry about this afterwards)

@jluethi jluethi added the parsl label Jul 5, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Jul 5, 2022

Also, even more problematic, the current parsl setup always allocates a GPU node for the full run duration, even for workflows that don't require one.

For example, I'm running the FMI 2 well test set at the moment. As soon as it starts, a CPU & a GPU node are allocated. Once the zarr file is initialized, it starts a second CPU node to process the second well in parallel. Thus, parsl is upscaling nodes only when necessary, but always seems to keep at least one node of each type running, which is quite problematic for our use cases.

Screenshot 2022-07-05 at 16 54 43

(Node 1 is a CPU node, node 37 a GPU node)

@tcompa
Copy link
Collaborator

tcompa commented Jul 6, 2022

I partly fixed this issue with 6253476 (and ca18b6e), but we may want to look at it a bit further (or at least wait until we have used the new version a few times, to see if something weird happens).

What it does is to use parsl automatic scaling strategy, in Config, with a timeout of 60 seconds. This means that both jobs are submitted at the beginning of the workflow, but the GPU one should disappear after 60 seconds, and it will only be re-activated when needed. This is currently being tested.

@jluethi
Copy link
Collaborator Author

jluethi commented Jul 6, 2022

In this case, what would happen if no GPU node is available at the start of the submission? We could test this by just defining a new executor with specifications that aren't available (e.g. 1 TB of RAM) on the cluster

@jluethi jluethi self-assigned this Jul 12, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Aug 4, 2022

I think we can close this for now. Parsl opens the GPI executor, but closes it after 1min as @tcompa described. And it doesn't interfere with a pure CPU pipeline if it's run while no GPU nodes are available (it's requested for a minute or so, but the request is cancelled if it's not available).
Using this for the last month hasn't caused any issues and the GPU node usage through this is minimal.

@jluethi jluethi closed this as completed Aug 4, 2022
Repository owner moved this from TODO to Done in Fractal Project Management Aug 4, 2022
@jluethi jluethi moved this from Done to Done Archive in Fractal Project Management Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants