-
Notifications
You must be signed in to change notification settings - Fork 1
Parsl slurm executor allocation: Starts a GPU node before it is needed #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also, even more problematic, the current parsl setup always allocates a GPU node for the full run duration, even for workflows that don't require one. For example, I'm running the FMI 2 well test set at the moment. As soon as it starts, a CPU & a GPU node are allocated. Once the zarr file is initialized, it starts a second CPU node to process the second well in parallel. Thus, parsl is upscaling nodes only when necessary, but always seems to keep at least one node of each type running, which is quite problematic for our use cases. (Node 1 is a CPU node, node 37 a GPU node) |
I partly fixed this issue with 6253476 (and ca18b6e), but we may want to look at it a bit further (or at least wait until we have used the new version a few times, to see if something weird happens). What it does is to use parsl automatic scaling strategy, in Config, with a timeout of 60 seconds. This means that both jobs are submitted at the beginning of the workflow, but the GPU one should disappear after 60 seconds, and it will only be re-activated when needed. This is currently being tested. |
In this case, what would happen if no GPU node is available at the start of the submission? We could test this by just defining a new executor with specifications that aren't available (e.g. 1 TB of RAM) on the cluster |
I think we can close this for now. Parsl opens the GPI executor, but closes it after 1min as @tcompa described. And it doesn't interfere with a pure CPU pipeline if it's run while no GPU nodes are available (it's requested for a minute or so, but the request is cancelled if it's not available). |
When running a parsl workflow that contains both tasks needing to run on the GPU & CPU, Parsl allocates both from the beginning, even though the GPU node isn't needed for a while.
I'm testing the uzh_9x8 example. It runs on a single CPU node for all the Zarr parsing, illumination correction & MIP. The labeling task should run on the GPU.
It already allocates both nodes at the start of the processing:

Given the limited availability of GPU nodes on the Pelkmans lab cluster (and the general much higher cost per hour of GPU nodes than CPU nodes), we should make sure to fix this. Especially if we want to use it in a fashion where multiple GPU nodes can be used.
(Just making sure the issue already exists and we don't forget it. Let's make the labeling (#64) work first of course and worry about this afterwards)
The text was updated successfully, but these errors were encountered: