Skip to content

Task performances within/without parsl #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tcompa opened this issue May 26, 2022 · 31 comments
Closed

Task performances within/without parsl #57

tcompa opened this issue May 26, 2022 · 31 comments
Assignees

Comments

@tcompa
Copy link
Collaborator

tcompa commented May 26, 2022

This is a discussion issue to monitor our improved understanding of what's going on within parsl, and on how it manages the resources to assign to each task. More on this later.

@jluethi
Copy link
Collaborator

jluethi commented May 26, 2022

As we discussed, I think this will be especially interesting when scaling to larger workloads. E.g. when using the 23 well dataset I've been using to look at scaling. Input data for this is here: /data/active/jluethi/20200810-CardiomyocyteDifferentiation14/Cycle1/images_renamed
It has dims 9,8 if you want to try running it as well for some scaling tests, took about 4-5 hours running on 23 nodes on the pelkmans lab cluster using the Luigi scheduler

@jluethi
Copy link
Collaborator

jluethi commented May 27, 2022

I ran the 23 well dataset and it ran through in a similar timeframe with the parsl scheduler.

I used the following parameters in parsl config:

max_workers = 2
nodes_per_block = 1  # This implies that a block corresponds to a node
max_blocks = 12  # Maximum number of blocks (=nodes) that parsl can use
cores_per_node = 8
mem_per_node_GB = 60

Now, what it actually uses resource-wise was a bit confusing:
Bildschirmfoto 2022-05-26 um 20 58 02

I would expect it needs 12 nodes, running 2 workers per node.
It does create 12 jobs, but the resource allocation isn't intuitive to me:

  1. Even though cores_per_node = 8, it uses 16 or 32 cores on each node (the max that is available for the node) => is cores_per_node not actually being enforced?
  2. It uses the 60 GB of memory I specified on each node, so looks like that parameter is working as expected.

@tcompa
Copy link
Collaborator Author

tcompa commented May 30, 2022

Concerning your point 1 (cores_per_node not being enforced), this should be due to the True default on this argument from SlurmProvider:

exclusive (bool (Default = True)) – Requests nodes which are not shared with other running jobs.

In my current experience, I see that a single-well yokogawa_to_zarr task rarely reaches 500-600% CPU, so that I think one could set

max_workers = 1
cores_per_node = 8
exclusive = False

I've just added the exclusive=False argument in commit 4653b8c, but haven't tested it yet.

The expected behavior in this case is that parsl submits 23 jobs, each one requiring 8 cores.

@jluethi
Copy link
Collaborator

jluethi commented May 30, 2022

Hey @tcompa Sounds promising. Are you running tests for that?
And in that case: Would there be a difference between running max_worker = 1 with cores_per_node=8 or max_worker = 4 with cores_per_node=32 in how it uses the 32 core nodes? (given we do the same scaling for memory)

may be a nice way to tune # of slurm jobs being created depending on how we want to use a cluster :)

@tcompa
Copy link
Collaborator Author

tcompa commented May 30, 2022

I agree: in principle your two options should be equivalent. On our current cluster (with mixed 16- and 32-cores machines) there could be some issues (if the large machines are busy), but in general I agree.

I'm not running tests yet, as I first want to see one of the "big" runs reaching the end. Also, I'm observing some non-ideal behavior in the monitoring (very low CPU usage) that I want to clarify, and that IMHO has higher priority than parallelization. More on this later.

(btw: perhaps a smaller dataset would be useful for testing.. 23 wells is fine, but maybe a few less sites?)

@jluethi
Copy link
Collaborator

jluethi commented May 30, 2022

I agree! Been thinking about creating a smaller subset for these tests. We have the 4 well, 2x2 sites test set already.

I could create a subset that is 10 wells, 5x5 sites? Or what size would you aim for as an intermediate test?

Also, my big run with 23 wells & 9 pyramid levels successfully finished on Friday and demos very nicely :)

@tcompa
Copy link
Collaborator Author

tcompa commented May 31, 2022

We are already seeing that the parallelization over wells basically works (up to some details..), so that there's no need for many wells (~10 seems fine).

As for the number of sites, we already had a test with wells of 6x6 sites, but I suspect that something was different because that was taking quite a short time (<10 minutes), compared to the wells of 9x8 sites (2-3 hours for the yokogawa_to_zarr part). Then 5x5 sites seems ok.

By the way: is there a clear reason for such a speed difference? The number of sites gets multiplied by two (36 -> 72), but the time it takes grows significantly more. This calls for yet another test dataset: could we have one with just one large well of 9x8 sites? This way we can compare it directly with the 6x6 single-well case.

Briefly, it would be great to have two new datasets:

  • 10 wells of 5x5 sites.
  • 1 well of 9x8 sites.

@tcompa
Copy link
Collaborator Author

tcompa commented May 31, 2022

Concerning your point 1 (cores_per_node not being enforced), this should be due to the True default on this argument from SlurmProvider:

exclusive (bool (Default = True)) – Requests nodes which are not shared with other running jobs.

In my current experience, I see that a single-well yokogawa_to_zarr task rarely reaches 500-600% CPU, so that I think one could set

max_workers = 1
cores_per_node = 8
exclusive = False

I've just added the exclusive=False argument in commit 4653b8c, but haven't tested it yet.

The expected behavior in this case is that parsl submits 23 jobs, each one requiring 8 cores.

Quick note: what I wrote in that comment was wrong.
The max_workers parameter refers to a block, while the cores_per_node refers to a node. If we have more than one block active on each node (note that this is not forbidden by the nodes_per_block=1 argument, which only sets the maximum number of nodes/block, not the maximum number of blocks/node), and we set cores_per_node=8, then this will limit the number of cores that can be used by all workers (not the number of cores that can be used by each worker).

@jluethi
Copy link
Collaborator

jluethi commented May 31, 2022

Great, I will create this test cases.
On the cores_per_node question: The explanation makes sense, so with exclusive=False, that should then mean that with cores_per_node=8, we use 8 cores and split them up between the number of workers (blocks), right?

If I submit something like cores_per_node=16, memory to 60GB, would it run 2 of those jobs on a single 32 core node? (would be neat if it works that way)

@tcompa
Copy link
Collaborator Author

tcompa commented May 31, 2022

On the cores_per_node question: The explanation makes sense, so with exclusive=False, that should then mean that with cores_per_node=8, we use 8 cores and split them up between the number of workers (blocks), right?

If I submit something like cores_per_node=16, memory to 60GB, would it run 2 of those jobs on a single 32 core node? (would be neat if it works that way)

These are my expectations, but only tests will tell ;)

@jluethi
Copy link
Collaborator

jluethi commented Jun 13, 2022

@tcompa I created the two new test sets we wanted to test Parsl behavior & resource scaling better:

The single-well, 9x8 sites test set: /data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_9x8_singleWell
The 10-well, 5x5 site test set: /data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_5x5_10wells

Let's run these tests and check:

  1. Can a single well with 9x8 vs. 5x5 sites be processed with similar amounts of memory usage (full data never should be in memory at any given point in time, so should be possible, no?)
  2. How much CPU is used for the processing and how does it differ between 9x8 & 5x5?
  3. How well does the 10-well test case parallelize? Can we use nodes optimally to process that?

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 13, 2022

Thanks for the new datasets!

I'm running a first bunch of tests with this fractal_config.py

# Parameters of parsl.executors.HighThroughputExecutor
max_workers = 1  # This is the maximum number of workers per block
# Parameters of parsl.providers.SlurmProvider
nodes_per_block = 1  # This implies that a block corresponds to a node
max_blocks = 15  # Maximum number of blocks (=nodes) that parsl can use
exclusive = False
cores_per_node = 16
mem_per_node_GB = 60
[...]

and with cores_per_worker=8 in HighThroughputExecutor. Note that the latter is not yet present on github, but I think it doesn't play a role at the moment (it is probably just a nicer way to set the maximum number of tasks/node, in a way which works both for nodes with 16 or 32 CPUs.. soon to be verified). Anyway it is a parameter for parsl heuristic, which doesn't enter the SLURM configuration.

Tests include the yokogawa_to_zarr+maximum_intensity_projection tasks (with coarsening_xy=3, coarsening_z=1, num_levels"=5). Any of us can look at the monitoring via

parsl-visualize -d sqlite:////data/active/fractal/Monitoring/20220613_tests_uzh_data/monitoring.db

I'm reporting here the total duration of workflows (as from the monitoring homepage) and the details of the first yokogawa_to_zarr task (which is often also the only one, apart from the case of 10 wells). Here we go:

GLOBAL DURATION:

Screenshot from 2022-06-13 14-53-10

FIRST/ONLY YOKOGAWA_TO_ZARR:

Screenshot from 2022-06-13 14-36-01
Screenshot from 2022-06-13 14-36-08
Screenshot from 2022-06-13 14-36-04
Screenshot from 2022-06-13 14-36-06

Partial answer to @jluethi's questions:

  1. Can a single well with 9x8 vs. 5x5 sites be processed with similar amounts of memory usage (full data never should be in memory at any given point in time, so should be possible, no?)
    In these tests, yes!
  2. How much CPU is used for the processing and how does it differ between 9x8 & 5x5?
    Not so clear in these tests. The 5x5 seems a bit more optimized (that is, it has a larger CPU usage). To be understood!
  3. How well does the 10-well test case parallelize? Can we use nodes optimally to process that?
    It does scale as expected across several nodes: one task (=one well) every 16 cores, for a total of five 32-CPU nodes. But there is some bad performance in the multithreading of each task. Many of them take more than 10 minutes for yokogawa_to_zarr, while the entire workflow with larger 6x6 wells only takes less than 5 minutes! And this goes together with very mild CPU usage, up to a certain point where it rapidly increases. Something very similar (actually worse) happens for the MIP task.
    What is happening in these (long) transients? What is preventing the tasks from using more resources? I/O bottlenecks? Concurrent execution of dask tasks? OPENBLASTHREADS=1? To be understood!

By now I'll keep at this (rough) level of detail, but over these days I can try to make some more quantitative summaries (also including the illumination-correction task). The single-well tests are quite convincing, but something is definitely odd for the multi-well one.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 13, 2022

Important information to add to the previous benchmark: the number of Z levels.

Workflow                Number of Z levels
uzh_1_well_2x2_sites    10
uzh_1_well_6x6_sites    19
uzh_1_well_9x8_sites    19
uzh_10_well_5x5_sites   29

tcompa added a commit that referenced this issue Jun 13, 2022
@jluethi
Copy link
Collaborator

jluethi commented Jun 13, 2022

Memory usage is looking very promising indeed!
The thing that confuses me from those runtimes: Why does the 10-well case take 33 minutes (it's minutes, right?), while the 1-well, 6x6 case only takes 4:48? If things are parallelized optimally, the 10-well case should be about as fast as the 6-well case (because each well only contains 5x5 sites instead of 6x6 sites and should run in parallel)

Regarding Z planes: uzh_10_well_5x5_sites actually varies in # of Z planes per well (an interesting thing to be tested that we handle that correctly, see https://github.com/fractal-analytics-platform/mwe_fractal/issues/42). They actually go up to 42 for some wells. This could explain a bit of the slowdown vs. the 6x6 case, but not all of it.

I created a new subset (just a set of soft-linked files, not full copies) of the 10 well case though that only contains the first 19 Z planes per well, so that we can compare the wells better and don't have confusions in the comparison because of the varying number of Z planes. That decreases the number of files from 24900 to 14250. New test case is available here: /data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_5x5_10wells_constantZ.

I wonder whether access to those files is being slowed down because there are so many files in the raw data folder (or whether there are calls to list all available files that cause the slow-down when we scale up the experiment size)

@jluethi jluethi moved this from Ready to In Progress (current sprint) in Fractal Project Management Jun 13, 2022
@tcompa
Copy link
Collaborator Author

tcompa commented Jun 14, 2022

I wonder whether access to those files is being slowed down because there are so many files in the raw data folder (or whether there are calls to list all available files that cause the slow-down when we scale up the experiment size)

Quick answer about the listing:

$ time python -c 'import glob; files = glob.glob("/data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_5x5_10wells/*.png"); print(f"Number of files: {len(files)}")'
Number of files: 24900

real	0m2.832s
user	0m0.060s
sys	0m0.020s

Within a task, I don't think the listing should be the bottleneck.

tcompa added a commit that referenced this issue Jun 14, 2022
@tcompa
Copy link
Collaborator Author

tcompa commented Jun 14, 2022

Another quick comment: with the new 10_5x5 dataset with homogeneous Z planes (thanks @jluethi) the total runtime goes down from 33 to 20 minutes. Still not acceptable (it should be comparable to the 5 minutes of the 6x6 wells, since it is embarrassingly parallel!), but at least it goes in the good direction ;)

Note: take all these numbers with a grain of salt, by now. I'm not sure they're fully robust (e.g. depending on the nodes that are used). More systematic (and possibly somewhat reproducible) tests are planned, also at the single-script level (that is, without parsl managing submissions).

@jluethi
Copy link
Collaborator

jluethi commented Jun 14, 2022

That is good to hear! So now we know what we want: Figure out how we get it from 20 min to 5 min, because the datasets are comparable :)

Given that the jobs also don't use that much CPU power: Is it possible the whole thing is IO limited somewhere else and our IO just doesn't handle parallel read access well enough, there is a bottleneck somewhere? How would we test that?

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 15, 2022

By mistake, I had left the debugging=True flags in all the tests I ran. Needless to say, this makes the timings unreliable, and possibly also interferes with parallelization.. New runs (without debugging) look much better, but I'll try to understand a few things before reporting here.

That's great news!

More on that probably tomorrow.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 15, 2022

(10-wells example runs through in 17 minutes.. but there are a few other things I changed at the same time and I'll need to check them one by one)

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 16, 2022

I was too optimistic on the importance of debugging, too bad.

Still, I found one useful piece of information: running the MIP task on one of our 5x5 wells takes 30 seconds, when it's done directly via SLURM (without parsl). Yes, this is to be compared to roughly 10 minutes when it runs through parsl!

This means that the issue at hand is much better defined than it was earlier: parsl is adding a 20x slow-down for (some) tasks. The reason is still unknown, but I think we are better defining the scope of the problem.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 21, 2022

TL;DR
Nothing new, I'm just logging some reference runs.

Here is the global CPU usage for example_uzh_10_well_5x5_sites.sh at the current version (c49baa7). There are 10 wells with 19 Z planes, each one with 5x5 sites. The workflow includes image parsing, illumination correction and MIP. Memory is under control (peak is 20G of global usage).
We observe the usual annoying behavior (low activity for a long-ish time, and the very rapid execution with high CPU usage).

Screenshot from 2022-06-21 09-51-20

@tcompa tcompa changed the title How does parsl handle resources? Task performances within/without parsl Jun 21, 2022
@tcompa
Copy link
Collaborator Author

tcompa commented Jun 21, 2022

Here's another reference run.

Same workflow as in #57 (comment), but for FMI data with 2 wells, 5x4 sites, 84 Z layers, 4 channels. The global issue is still there (long time with low activity, and then high-CPU-usage peaks), but for this data it seems much less severe (that is, in the low-activity intervals we almost never have less than a global 500% CPU usage, with only two tasks). I think that the t2200 s deep is related to slowdown of all activities on pelkmanslab cluster, which does happen from time to time.

Notice the high parallelization reached at some times during the yokogawa_to_zarr task, with each task almost saturating its 1600% CPU limit, at some point in time (I verified this by looking at single-task monitoring). Also notice that the MIP tasks (starting around time=2900 s) don't show any waiting time: they reach >700% for each task after a few seconds, and they run in 3 minutes.

Question for the future us: why does this dataset appear as less problematic? Could it be due to the large number of Z planes, or is it a coincidence?

Screenshot from 2022-06-21 11-57-27

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 21, 2022

One last bit of information, again as a reference for future tests.

For the test in #57 (comment), I re-ran the MIP part alone in a SLURM job (without parsl, and with --cpus-per-task=16), via this script:

#!/bin/bash
OLDZARR=/data/active/fractal/tests/Temporary_data_UZH_10_well_5x5_sites/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr
NEWZARR=/data/active/fractal/tests/Temporary_data_UZH_10_well_5x5_sites/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr

rm -rf $NEWZARR

WELLS="B/09 B/11 C/08 C/10 D/09 D/11 E/08 E/10 F/09 F/11"

poetry run python ../../fractal/tasks/replicate_zarr_structure.py \
                  -zo $OLDZARR \
                  -zn $NEWZARR \

echo "[MIP] START"
date
for WELL in $WELLS; do
    echo $WELL
    time poetry run python ../../fractal/tasks/maximum_intensity_projection.py \
                           -z ${OLDZARR}/${WELL}/0 \
                           -cxy 2
done
date

Note that wells are analyzed one after the other, and not in parallel.

This test ran in ~13 minutes, that is, ~80 seconds per well (on average). The same MIP processing of the 10 wells takes 248 s per well (on average) when running through parsl. Thus it seems that parsl is adding a 3x slowdown, for this specific task and dataset.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 28, 2022

Adding some more pieces to the puzzle
(note: I'll add more info in this issue in the coming days, just to report results of tests.. when I reach some important understanding I'll make it clear)

TL;DR

  1. Running a single large (9x8) well has a CPU profile which looks good.
  2. Running the same workflow by replacing python_app with bash_app (in the time-consuming tasks) has performances which are identical, but I'd like to repeat the same comparison also in one of the "bad" cases.

MORE DETAILS

  • I am running a workflow which includes yokogawa_to_zarr, illumination_correction and maximum_intensity_projection, for a dataset with a single well, 72 sites (in a single-FOV scheme), 3 channels, 19 Z planes. The zarr file (before MIP) takes ~16G.
  • The workflow has no parallelization of different task (because it's a single well), but only multithreading inside tasks.
  • Context: the cluster is essentially empty (just one interactive sbatch session running), and there is >1T of free space on our share. I don't know whether this maters.
  • The workflow runs in 22 minutes, roughly corresponding to 13 + 7.5 + 1 for the three tasks.
  • The CPU profile shows nothing weird -- see below. The smallest usage is still ~400% (at the beginning of yokogawa_to_zarr, and peaks reach 1200% (all with a single task). The MIP task is especially fast, in this example (~1 minute), and for sure it doesn't have any low-CPU interval.
  • Memory is under control, with a peak of 3.4 G.
  • I re-ran the same workflow by replacing the main python_apps with equivalent bash_apps, and nothing changed.
  • Perspective 1: What was causing "bad" performances in the tests described earlier in this issue? The first candidate is the simultaneous execution of several tasks.
  • Perspective 2: The timing is comparable to the one running on 10 small wells (5x5 sites, meaning 1/3 of the 9x8 case), which is very counterintuitive. Either parallelization over wells introduces a bottleneck, or maybe processing "small" wells is suboptimal.
  • Perspective 3: Let's keep in mind that it looks like parsl is introducing a slowdown as compared to plain SLURM execution (Task performances within/without parsl #57 (comment)). The current test does not address this point.

newplot

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 28, 2022

Got it!

TL;DR
For our usecases, embarassingly parallel tasks scale linearly should be replaced with something like embarassingly parallel tasks scale linearly if they are not competing for IO resources (and in fact they are).

Consider the test with 10 5x5 wells. In #57 (comment) I ran it through SLURM directly (no parsl), with the wells running one after the other, and MIP tasks were taking ~80 seconds per well. The same tasks would take much longer when running in parallel (over wells) within parsl, which pointed at some parsl issue.

Now I tested the same MIP task within SLURM, but running 10 jobs (one per well) at the same time. Performances are in fact the same as in the parsl case: each SLURM job takes about 4 minutes. This corresponds to the 3x slowdown already observed when using parsl in the comment above (for this same example), and we should conclude that parsl is not adding any relevant friction.

The obvious candidate for this bottleneck is IO. Each one of the 10 tasks is writing data (about 0.5G per well, in the MIP task), and it seems that they are saturating some maximum IO capacity. This is not something we can avoid easily, and we may have to live with it up to a certain level.

Possible next moves:

  1. We try to understand in detail (with Francesco) what aspect of the filesystem is the limiting factor, to see whether we can improve things somehow. This is admittedly very vague, as I'm no expert of this topic. I wouldn't pursue this direction unless someone with a bit more expertise chimes in.
  2. We start experimenting with different kinds of zarr storage, and testing how performances change (there will be some overhead for the compression, but there could be an advantage if simultaneous writing into a compressed zarr storage works better than simultaneous writing of many little files.. to be tested!). This was already on the radar, e.g. in https://github.com/fractal-analytics-platform/mwe_fractal/issues/59#issuecomment-1141957836, and I think now it's the right moment for trying.
  3. We can also accept the current status (at least for the moment)! The 23-9x8-wells worfklow (including yokogawa_to_zarr, illumination correction and MIP) ran in 4 hours, which is no so bad.

@jluethi
Copy link
Collaborator

jluethi commented Jun 28, 2022

Great that we know this! As feared, the limits are IO then. Could well be that the current share setup at the pelkmanslab is quite suboptimal for this and limits IO fast. If there are very low-hanging fruits to improve this, certainly interesting (going in the direction of 1). I agree though, we shouldn't dig too deep into filesystem optimizations if we can avoid it.

Regarding 2: I think it's very interesting to pursue this direction. We should also keep object storage approaches in mind here. Would be interesting how it would scale with a system like local S3 buckets, as this may be where Zarr files go long term.
Also, about the Zipped Zarr files: It was my understanding that we can't modify zipped zarr stores once they are written. If that is true, then this may provide bottlenecks. But worth having a closer look at, especially at the question of whether it's faster to write to them (though I could imagine it remains at the same speed, who knows).

On 3: Yes, I think we have decent enough performance so that we don't need to stop our other work. We can keep testing with the current setup. But general scaling beyond the 23 well test case is a big interest for Fractal, so I think we take this as baseline performance and keep working on improving it :)

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 28, 2022

Great that we know this! As feared, the limits are IO then. Could well be that the current share setup at the pelkmanslab is quite suboptimal for this and limits IO fast. If there are very low-hanging fruits to improve this, certainly interesting (going in the direction of 1). I agree though, we shouldn't dig too deep into filesystem optimizations if we can avoid it.

In fact I wouldn't mind seeing the monitoring of some workflow running at FMI. Perhaps @gusqgm could run such a test? For instance the UZH dataset with 10 5x5 wells is one we looked at several times. If transfering data is easy, then modifying the example script should only take a few minutes.

@jluethi
Copy link
Collaborator

jluethi commented Jun 28, 2022

Yes, we have some of the test data already on FMI servers, will need to check for the 10 well test case. But that will also be an interesting comparison! :)

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 29, 2022

In case we need some more precise and systematic benchmark for the UZH filesystem, we could try this: https://github.com/deggio/cephfs_bench/blob/main/run_ior.sh (but first I would look at the monitoring of a Fractal run).

@tcompa
Copy link
Collaborator Author

tcompa commented Jul 4, 2022

One last bit of information, again as a reference for future tests.

For the test in #57 (comment), I re-ran the MIP part alone in a SLURM job (without parsl, and with --cpus-per-task=16), via this script:

#!/bin/bash
OLDZARR=/data/active/fractal/tests/Temporary_data_UZH_10_well_5x5_sites/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr
NEWZARR=/data/active/fractal/tests/Temporary_data_UZH_10_well_5x5_sites/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr

rm -rf $NEWZARR

WELLS="B/09 B/11 C/08 C/10 D/09 D/11 E/08 E/10 F/09 F/11"

poetry run python ../../fractal/tasks/replicate_zarr_structure.py \
                  -zo $OLDZARR \
                  -zn $NEWZARR \

echo "[MIP] START"
date
for WELL in $WELLS; do
    echo $WELL
    time poetry run python ../../fractal/tasks/maximum_intensity_projection.py \
                           -z ${OLDZARR}/${WELL}/0 \
                           -cxy 2
done
date

Note that wells are analyzed one after the other, and not in parallel.

This test ran in ~13 minutes, that is, ~80 seconds per well (on average). The same MIP processing of the 10 wells takes 248 s per well (on average) when running through parsl. Thus it seems that parsl is adding a 3x slowdown, for this specific task and dataset.

This should be re-tested in view of #92.

@jluethi
Copy link
Collaborator

jluethi commented Jul 27, 2022

The conclusion here is that the limiting factor is disk IO on the Pelkmans lab shares, so Fractal is running as fast as it reasonably can for those tasks.

Thus, let's close this issue related to parsl performance and open new issues regarding IO when they become more urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants