Skip to content

Fractal default arguments for tasks #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tcompa opened this issue Nov 4, 2022 · 13 comments
Closed

Fractal default arguments for tasks #177

tcompa opened this issue Nov 4, 2022 · 13 comments
Labels
High Priority Current Priorities & Blocking Issues

Comments

@tcompa
Copy link
Collaborator

tcompa commented Nov 4, 2022

NOTE: this is not an urgent issue, but one that we will need to tackle at some point


Currently fractal requires a set of standard arguments for each task, on top of the task-specific ones:

def some_task(
    *,
    input_paths: Sequence[Path],  # required
    output_path: Path,            # required
    metadata: Dict[str, Any],     # required
    component: str,               # required for parallel tasks
    some_other_arg_1: str,
    some_other_arg_2: int = None,
):

While the metadata and component one will likely stay, we need to re-evaluate the assumptions behind the I/O paths arguments. Briefly, some comments on how things work at the moment:

  1. Each task goes from N>=1 input paths to a single output path. Is it always the case? We never really use N>1 inputs, for the moment (apart from maybe in create_zarr_structure), but we kept it there especially in view of multiplexing.
  2. I/O paths typically look like /some/path/*.extension. We often neglect the extension part. Is it always required? Is it sometimes redundant/wrong? (spoiler: yes, sometimes it is irrelevant). Can we drop it?
  3. Tasks in a workflow form a chain, where the N-th input is always the (N-1)-th output. Are there use cases where this is not sufficiently flexible?

An obvious possibility is to remove the input_paths and output_path arguments from the fractal-required ones, and to shift their definition more towards the user. I'm not (yet) suggesting we do it, but we should explore this approach and assess pros/cons.

@jluethi
Copy link
Collaborator

jluethi commented Nov 4, 2022

Good points, let's evaluate this! Two quick thought:

Tasks in a workflow form a chain, where the N-th input is always the (N-1)-th output. Are there use cases where this is not sufficiently flexible?

Let's look at this also in the setting of more flexible workflows, resubmissions and running on partial datasets. Maybe we'll need some of the flexibility we barely use for that.

An obvious possibility is to remove the input_paths and output_path arguments from the fractal-required ones, and to shift their definition more towards the user. I'm not (yet) suggesting we do it, but we should explore this approach and assess pros/cons.

I'd like to maintain the tasks as something that can be run outside of Fractal => the tasks do somehow need to get input & output paths

@tcompa
Copy link
Collaborator Author

tcompa commented Nov 4, 2022

I'd like to maintain the tasks as something that can be run outside of Fractal => the tasks do somehow need to get input & output paths

To clarify: what I meant was not to remove the arguments, but to have them handled in the same way as the other non-fractal-specific arguments are handled. The behavior for someone who runs the task from outside Fractal would be identical, and the only difference would be for someone who runs the tasks through Fractal.

@tcompa
Copy link
Collaborator Author

tcompa commented Nov 23, 2022

Related to this part of the specs:

* [ ] S8. There are relevant warnings/barriers that stop users from rerunning individual tasks when this would lead to issues (e.g. rerunning tasks that overwrite images without rerunning the original parsing of the data)
* [ ] S9. As a user, I can control at submission whether the existing outputs will be overwritten or new outputs will be created

a possible way forward is to include a default overwrite=False argument in all tasks. In this way, the server could easily identify "dangerous" tasks and raise a warning or an error.
Note that only a few tasks would actually do something with this overwrite parameter, while it could be irrelevant for others.

@jluethi
Copy link
Collaborator

jluethi commented Nov 23, 2022

Yes, I think such a flag is generally a good idea for tasks. It's quite a top-level question whether a task should overwrite existing data or not and may influence the handling quite a bit. Some tasks already have this, like illumination correction.

To address S8:
Typical behavior is that we want to run illumination correction ONCE on the images. That time, it does not need to raise any warnings. What should not be possible is to the rerun the same illumination correction task again easily (without having rerun the parsing).
Thus, not sure whether the overwrite flag solves those issues

@jluethi jluethi added Priority Important, but not the highest priority High Priority Current Priorities & Blocking Issues and removed Priority Important, but not the highest priority labels Feb 13, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Feb 16, 2023

Some discussion points for upcoming call:

Tasks-side:

  1. Should we remove the pydantic layer from tasks?
  2. Known problem: extensions are often useless. I think we can simplify that part.

Server/task-integration:

  1. Known problem: some tasks require two inputs: empty zarr structure + actual input data (either images or another zarr). This is currently handled with some workarounds (using Dataset.metadata), which doesn't look ideal.
  2. MIP workflows: 3D -> 2D.. how can I get back to 3D data?

Out of scope for today (I'd say):

  1. Handling point S8 above. This is related to the broad topic "can we store some status information inside the ome-ngff metadata?"

@jluethi
Copy link
Collaborator

jluethi commented Feb 16, 2023

On 1.
We keep pydantic for the moment to get input validation.
But:

  1. Remove paths from everywhere
  2. Remove default values from TaskArguments => via Optional, then .dict(exclude_none=True)

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 16, 2023

On 2:

  • Remove glob_pattern from input_paths and output_path, and include it as an additional task argument when needed (namely for the ome-zarr creation and yokogawa-to-zarr conversion).
  • more general glob_patterns to only select a few images? Interesting, let's open another issue

@jluethi
Copy link
Collaborator

jluethi commented Feb 16, 2023

Also on 2: Potentially add the glob_pattern to the metadata to pass between those tasks

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 16, 2023

(reserved-keyword arguments: input_paths, output_path, metadata, component)

  1. When the task acts on a zarr array, the input_paths argument should look like ["/some/path/to/a.zarr"] (and not ["/some/path/to/*.zarr"], nor ["/some/path/to/"]).
  2. Can we relax the reserved-keyword constraint, and allow a user to specify input_paths as an argument of a WorkflowTask?
  3. Current choice: some tasks (e.g. illumination correction, or possibly ROI-selection from a different (2D) zarr) may still take part of their input as separate arguments (e.g. the correction-matrices paths, or path to the 2D zarr with ROIs). This means there is not urgent need to automate this argument definition on the server side (this means they can be separate arguments, and do not need to enter one of the reserved-keywords arguments).
  4. Should we add a Dataset.resource.info additional attribute? That could be (1) multiplexing cycle label ("monday", "tuesday", ..), or (2) something in {"3D", "3D-illumination-corrected", "2D", ...}.
    • How to update it dynamically during workflow execution (where DB is not known)
    • Related to this: how do we update metadata after a task is over? Probably we should enforce an addition-only policy, with no replacements. This is partly mitigated by points 1 and 2, at least for the problem of doing 3D->2D->3D workflows.
    • Relate to this: let's make sure that info information are transferred back to the DB at the end of execution (if they are modified)
  5. Concerning compound tasks: they are logical units of two tasks (a non-parallel preliminary task and a parallel execution) that need to share some data (possibly also parameters, but Dataset.metadata is a reasonable place for that). Note that the shared data are mostly a few paths, not the actual image data. The sharing can happen on-disk, possibly in a "temporary" location, or in the metadata inputs/outputs. Current preference is for using special key-value pairs in metadata dictionary, so that the workflow&task definition remains the same, and the compound-task would "only" exist as an additional endpoint.

@jluethi
Copy link
Collaborator

jluethi commented Feb 16, 2023

@tcompa Only thing I can think of at the moment to add:
How do we pass the glob pattern (/ *.ext) between the create-ome-zarr and the yokogawa-to-ome-zarr tasks? Should it go into the metadata dictionary? => part of the discussion here: #299

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 16, 2023

Should it go into the metadata dictionary?

Yes, I would say the first implementation follows the standard behavior which we use for num_levels or coarsening_xy, namely the new argument (glob_pattern) is added to the metadata and the downstream task reads from there.
Later on we can decide to include this parameter as part of the specific key-value pair of information to be shared among the two tasks - as in #299.

@jluethi
Copy link
Collaborator

jluethi commented Feb 16, 2023

Makes sense.

One thing I just realized though: Currently, to the user, the '.png' or '.tif' is not a task parameter, but something that is defined when adding a resource => the task can just read it from the resource? Or how does it go from the resource to the task? Does Fractal-server combine it with the base folder string and just provide that combination?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Current Priorities & Blocking Issues
Projects
None yet
Development

No branches or pull requests

2 participants