Skip to content

Extend/improve copy-zarr task #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tcompa opened this issue Jan 4, 2023 · 5 comments
Open

Extend/improve copy-zarr task #279

tcompa opened this issue Jan 4, 2023 · 5 comments
Labels
flexibility Support more workflow-execution use cases Priority Important, but not the highest priority

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jan 4, 2023

EDIT: I'm revamping this somewhat old discussion, based on last week meetings. The new comments start from #279 (comment).


As per discussion with @gusqgm this morning.
It's a task that copies a subset of a zarr, something like:

def copy_zarr_subset(input_zarr, output_zarr, a_list_of_filters):
    pass
@gusqgm
Copy link

gusqgm commented Jan 4, 2023

Thank you @tcompa for adding this!

The main contextual example for this would be in my opinion the case scenario where a user is creating a new workflow from scratch and needs to test several parameters from one or more tasks. Instead of doing this over the entire .zarr data, the user can, at the desired moment of the workflow, generate a .zarr file with a subset of the data in order to use it for all required tests.

This .zarr file structure can be short-lived, i.e. created and used only for testing purposes, and later on discarded once the workflow is set and complete to run on full datasets. Could also be used to share data among collaborators, along with workflows, for example. However, we would need to enforce some information to be added to this zarr file so that it is not confused with its parent dataset, maybe adding a '_subset' as a suffix of the name?

Also, the largest drawback of this is that it requires the user to be vigilant and not have unnecessary multiple partial copies of the same data. The copy_zarr_subset task could function either at the beginning, i.e. creating the partial copy of the data before running a task for testing, or used at the end for the output, once the task being test has run over partial part of the main dataset. I assume that the second option is safer to avoid multiple identical copies of the data, however could be more cumbersome to implement as to create a partial input of the task which is not a separate .zarr on itself. What do you think?

I will think of more points as well.

@tcompa tcompa added the Priority Important, but not the highest priority label Jan 11, 2023
@jluethi
Copy link
Collaborator

jluethi commented Jan 17, 2023

To quickly summarize my comments on this from the call:
I think it's a good idea to have this as part of the "allow users to experiment with parameters". And it's an area of improved flexibility we can work on before figuring out the whole history of a dataset question.

Regarding cleanup, number of copies etc: I suggest we make a tmp folder in the output folder with such intermediary OME-Zarr files. Let's not get fancy about sharing or cleanup for the start, but just have them in their own space. The major goal: Allow users to test some parameters, check them on a small subset of the output and then adapt their workflows accordingly.


Another question: Should this be a task?

Technically maybe. But it's something very different in a typical user story and a typical flow. A user may have a workflow of existing tasks but e.g. want to try a few different parameters for the cellpose task on a single FOV.
Now, this user could define an additional workflow that goes "copy OME-Zarr subset, then run a cellpose task". Maybe that's how we build it under the hood.
But the user should eventually be able to say: I have this workflow, let me only run it on FOV 7 as a parameter test => gets that output to check (in a separate file is a good idea).

@jluethi
Copy link
Collaborator

jluethi commented Jan 30, 2023

Just to note this down before I forget:
When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

@tcompa tcompa changed the title Discussion: do we need a copy-zarr-subset task Extend/improve copy-zarr task Jul 17, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Jul 17, 2023

Just to note this down before I forget: When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

This has been already covered by the image-glob-pattern argument of zarr-creation tasks, so that the current issue only concerns the situation where we already have an OME-Zarr.

@tcompa
Copy link
Collaborator Author

tcompa commented Jul 17, 2023

Based on last week meetings, it seems that an improved version of the copy-ome-zarr task could be a nice starting point for the "let me work on an experimental branch of my workflow" use case - even if this use case is not yet fully defined on the server/web side.

Some of the proposed new features:

  1. The task should offer the option to copy data as well, on top of the OME-Zarr structure and metadata.
  2. For the moment we should also maintain the option of not copying any data, since it's how MIP works (hopefully this will change in the future).
  3. The task should offer the option to only select a subset of the OME-Zarr components - see below.
  4. Copying data should all happen in the main task, even if in principle it could be parallelized over wells. There are multiple reasons for this:
    • Compound prepare&fill tasks are not intuitive, when building a workflow -> let's reduce their use as much as possible
    • When selecting a subset of the OME-Zarr data, it would be complex to let the server build the appropriate component list.
    • Copying a small array should still be a reasonably fast operation, and we can verify that it gets a bit faster by increasing the CPU requirements. Copying a large array is not something which we should ever encourage, so that we don't need to optimize this use case.
  5. The writing of updated metadata will then need to be aligned with Replace parallelization_level with more structured options of what to run fractal-server#792.

Concerning the subset-filter, here are some possibilities (sorted by increasing complexity):
V0: select a single well, or a list of wells
V1: select the same ROI from all wells (TBD what to do if it does not exist)
V2: same as V1, but handling edge cases
V3: select a specific ROI from each well
V4: select N ROIs from M well => into individual OME-Zarr
V5: select N ROIs from M well => into the same OME-Zarr

@tcompa tcompa added the july2023 Maintenance work planned for July 2023 label Jul 17, 2023
@tcompa tcompa added flexibility Support more workflow-execution use cases and removed july2023 Maintenance work planned for July 2023 labels Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flexibility Support more workflow-execution use cases Priority Important, but not the highest priority
Projects
Development

No branches or pull requests

3 participants