Extend/improve copy-zarr task #279

tcompa · 2023-01-04T09:52:56Z

EDIT: I'm revamping this somewhat old discussion, based on last week meetings. The new comments start from #279 (comment).

As per discussion with @gusqgm this morning.
It's a task that copies a subset of a zarr, something like:

def copy_zarr_subset(input_zarr, output_zarr, a_list_of_filters):
    pass

gusqgm · 2023-01-04T12:29:59Z

Thank you @tcompa for adding this!

The main contextual example for this would be in my opinion the case scenario where a user is creating a new workflow from scratch and needs to test several parameters from one or more tasks. Instead of doing this over the entire .zarr data, the user can, at the desired moment of the workflow, generate a .zarr file with a subset of the data in order to use it for all required tests.

This .zarr file structure can be short-lived, i.e. created and used only for testing purposes, and later on discarded once the workflow is set and complete to run on full datasets. Could also be used to share data among collaborators, along with workflows, for example. However, we would need to enforce some information to be added to this zarr file so that it is not confused with its parent dataset, maybe adding a '_subset' as a suffix of the name?

Also, the largest drawback of this is that it requires the user to be vigilant and not have unnecessary multiple partial copies of the same data. The copy_zarr_subset task could function either at the beginning, i.e. creating the partial copy of the data before running a task for testing, or used at the end for the output, once the task being test has run over partial part of the main dataset. I assume that the second option is safer to avoid multiple identical copies of the data, however could be more cumbersome to implement as to create a partial input of the task which is not a separate .zarr on itself. What do you think?

I will think of more points as well.

jluethi · 2023-01-17T08:00:49Z

To quickly summarize my comments on this from the call:
I think it's a good idea to have this as part of the "allow users to experiment with parameters". And it's an area of improved flexibility we can work on before figuring out the whole history of a dataset question.

Regarding cleanup, number of copies etc: I suggest we make a tmp folder in the output folder with such intermediary OME-Zarr files. Let's not get fancy about sharing or cleanup for the start, but just have them in their own space. The major goal: Allow users to test some parameters, check them on a small subset of the output and then adapt their workflows accordingly.

Another question: Should this be a task?

Technically maybe. But it's something very different in a typical user story and a typical flow. A user may have a workflow of existing tasks but e.g. want to try a few different parameters for the cellpose task on a single FOV.
Now, this user could define an additional workflow that goes "copy OME-Zarr subset, then run a cellpose task". Maybe that's how we build it under the hood.
But the user should eventually be able to say: I have this workflow, let me only run it on FOV 7 as a parameter test => gets that output to check (in a separate file is a good idea).

jluethi · 2023-01-30T14:43:16Z

Just to note this down before I forget:
When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

tcompa · 2023-07-17T07:20:48Z

Just to note this down before I forget: When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

This has been already covered by the image-glob-pattern argument of zarr-creation tasks, so that the current issue only concerns the situation where we already have an OME-Zarr.

tcompa · 2023-07-17T07:33:20Z

Based on last week meetings, it seems that an improved version of the copy-ome-zarr task could be a nice starting point for the "let me work on an experimental branch of my workflow" use case - even if this use case is not yet fully defined on the server/web side.

Some of the proposed new features:

The task should offer the option to copy data as well, on top of the OME-Zarr structure and metadata.
For the moment we should also maintain the option of not copying any data, since it's how MIP works (hopefully this will change in the future).
The task should offer the option to only select a subset of the OME-Zarr components - see below.
Copying data should all happen in the main task, even if in principle it could be parallelized over wells. There are multiple reasons for this:
- Compound prepare&fill tasks are not intuitive, when building a workflow -> let's reduce their use as much as possible
- When selecting a subset of the OME-Zarr data, it would be complex to let the server build the appropriate component list.
- Copying a small array should still be a reasonably fast operation, and we can verify that it gets a bit faster by increasing the CPU requirements. Copying a large array is not something which we should ever encourage, so that we don't need to optimize this use case.
The writing of updated metadata will then need to be aligned with Replace parallelization_level with more structured options of what to run fractal-server#792.

Concerning the subset-filter, here are some possibilities (sorted by increasing complexity):
V0: select a single well, or a list of wells
V1: select the same ROI from all wells (TBD what to do if it does not exist)
V2: same as V1, but handling edge cases
V3: select a specific ROI from each well
V4: select N ROIs from M well => into individual OME-Zarr
V5: select N ROIs from M well => into the same OME-Zarr

tcompa mentioned this issue Jan 4, 2023

Running a workflow on a subset of data fractal-analytics-platform/fractal-server#109

Closed

jluethi added this to Fractal Project Management Jan 4, 2023

github-project-automation bot moved this to TODO in Fractal Project Management Jan 4, 2023

tcompa added the Priority Important, but not the highest priority label Jan 11, 2023

jluethi added this to the 1) Improve Workflow Flexibility & Data Lifecycle milestone Jan 20, 2023

tcompa mentioned this issue Feb 16, 2023

How do task pairs share information? #299

Closed

This was referenced Jul 10, 2023

Support "resume workflow execution" fractal-analytics-platform/fractal-server#788

Closed

Workflow submission scenarios fractal-analytics-platform/fractal-server#261

Closed

tcompa changed the title ~~Discussion: do we need a copy-zarr-subset task~~ Extend/improve copy-zarr task Jul 17, 2023

tcompa mentioned this issue Jul 17, 2023

Review MIP task: make it work by ROI? Make it work at the whole-plate level? #115

Closed

tcompa added the july2023 Maintenance work planned for July 2023 label Jul 17, 2023

This was referenced Jul 17, 2023

Test multi-plate support in cellvoyager converter #457

Closed

Add overwrite argument to cellpose task #458

Closed

tcompa added flexibility Support more workflow-execution use cases and removed july2023 Maintenance work planned for July 2023 labels Sep 15, 2023

jluethi mentioned this issue Sep 27, 2023

Data lifecycle: Workflow Flexibility fractal-analytics-platform/fractal-server#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend/improve copy-zarr task #279

Extend/improve copy-zarr task #279

tcompa commented Jan 4, 2023 •

edited

Loading

gusqgm commented Jan 4, 2023

Uh oh!

jluethi commented Jan 17, 2023

Uh oh!

jluethi commented Jan 30, 2023

Uh oh!

tcompa commented Jul 17, 2023

Uh oh!

tcompa commented Jul 17, 2023

Uh oh!

Extend/improve copy-zarr task #279

Extend/improve copy-zarr task #279

Comments

tcompa commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gusqgm commented Jan 4, 2023

Uh oh!

jluethi commented Jan 17, 2023

Uh oh!

jluethi commented Jan 30, 2023

Uh oh!

tcompa commented Jul 17, 2023

Uh oh!

tcompa commented Jul 17, 2023

Uh oh!

tcompa commented Jan 4, 2023 •

edited

Loading