Skip to content

OME-Zarr parsing has non-deterministic assignments of channels per well #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jluethi opened this issue Jun 1, 2022 · 30 comments
Closed
Labels
bug Something isn't working High Priority Current Priorities & Blocking Issues

Comments

@jluethi
Copy link
Collaborator

jluethi commented Jun 1, 2022

The current OME-Zarr parsing has no fixed mapping between the channel name in the original filename (or any channel metadata we may want to include) and the channel number it assigns when parsed to an OME-Zarr file. That leads to the bug where images assigned to channel 1 in well B03 may be from a different channel than images assigned to channel 1 in well C06.

Because channels are displayed as layers in napari, that then leads to a layer containing different information depending on the well, which we need to fix.
See here: The same channel is DAPI (nuclear marker) vs. Lamin B1 (nuclear envelope marker), depending on the well in the 23 well dataset (=> one looks like solid, round objects, the other is more pronounced at the rim of the object).
Bildschirmfoto 2022-06-01 um 11 57 08
Bildschirmfoto 2022-06-01 um 11 56 45

What we would like to achieve:

  1. A given channel (e.g. C01) is always parsed to the same channel number in the OME-Zarr file
  2. This can get somewhat more complicated down the road: We may have multiple so-called cycles all having channels C01, C02 etc. And users may (but rarely do) acquire multiple C01s in the same cycle, with varying acquisition numbers. E.g. A01_C01, A02_C01 etc.
  3. We should parse additional metadata for channels (see also here: Increase metadata parsing to OME-Zarr & update to v0.4 (or v0.5) ome-ngff standard #44). I'm currently looking into things like contrast-limits, colormaps and channel naming to see where we would need to place them in the OME-Zarr file and which actually get processed by OME-Zarr so far.
@tcompa
Copy link
Collaborator

tcompa commented Jun 15, 2022

As or our last call today:

With 6d176a7, I only scratched the issue of inconsistent channel assignment. To be honest I don't even know if it is enough, but anyway it is obviously not robust.

By now, let's proceed this way:

  1. The user should provide a mapping from channel indices to their names (by filling a simple template).
  2. Upon reading the image files, we parse the channel index from the filenames and we check that it belongs to the user's table.
  3. This index is used everywhere in Fractal, while the positional index (whether a given channel is the first or second or third.. for a certain well) becomes totally irrelevant.
  4. The channel names are stored in the metadata -- see Increase metadata parsing to OME-Zarr & update to v0.4 (or v0.5) ome-ngff standard #44 (comment).

Example

The user provides this table

1 -> DAPI
2 -> AAA
3 -> BBB
4 -> CCC
5 -> DDD

Images in a given well have filenames ending with C01.png, C03.png or C04.png. Images in a different well have filenames ending with C05.png only. Upon loading images in the two different wells, we obtain that the zarr subfolder for the first well includes channels 1,3,4, while the subfolder for the second well includes only channel 5.

Notes:

  1. The relevant "channel number" is the one appearing in the user's table, while its enumeration index for a given well (first, second, third, ..) must never be used.
  2. The channel list (which is a typical input of most tasks) becomes well-dependent.

@tcompa
Copy link
Collaborator

tcompa commented Jun 15, 2022

(@jluethi: no need that you run the 23-well example with the current code, since there's a lot of things that will have to change anyway)

@tcompa
Copy link
Collaborator

tcompa commented Jun 15, 2022

Quick comment upon thinking about this:

As soon as we allow a well to have channels {1,3,4}, for instance, we cannot any more simply rely on to_zarr to create the zarr structure at the channel level, because it would write them to channels {0,1,2} and loose the information we want to preserve.

This means that we should call to_zarr for each level, and the .zarray file should move down by one level within the zarr hierarchy (it would describe the 3D ZYX arrays, rather than the 4D CZYX ones).

Also, we should modify the pyramid creation to work with 3D arrays (which is a trivial change in #53), and obviously the create_zarr_structure+yokogawa_to_zarr tasks. Note that this is all doable.

The other option is that we keep the current behavior, where to_zarr saves channels {1,3,4} to subfolders {0,1,2}, and we store the information on the actual indices {1,3,4} somewhere else. Labels will be obviously stored in the omero metadata, but perhaps we would like to also save the indices to avoid repeating often a slightly-annoying procedure (reading the user's table, parsing the metadata file, matching the label to an index).

Briefly, the main options are

  1. Replace to_zarr with code that writes channels {1,3,4} to subfolders {1,3,4}.
  2. Keep channels {1,3,4} in subfolders {0,1,2}, and rely on metadata file (with slightly-annoying procedure) to map 0->1, 1->3, 2->4.
  3. Keep channels {1,3,4} in subfolders {0,1,2}, and also store the actual indices {1,3,4} somewhere (so that the metadata file is correct but not crucial).

IMPORTANT: This is quite a relevant change, so I'll wait for some feedback before proceeding.

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 15, 2022

This is a very important discussion @tcompa , you are right.

This index is used everywhere in Fractal, while the positional index (whether a given channel is the first or second or third.. for a certain well) becomes totally irrelevant.

This is tricky. At least the current implementation of the ome-zarr-reader assumes that the folder names (e.g. {0,1,2}) are consistent between wells and it will put {0] of one well into the same napari layer as {0} of a different well.

I think it does get a bit simpler though, because we don't really care about the C01, C03, C04 part anymore if they are labeled and parsed consistently.

Here's my vision of this
The user provides a list of channel names:

A01_C01: "DAPI",
A02_C03: "nanog",
A03_C04: "Lamin"

Before parsing, we check through all the channels and decide a fixed mapping

A01_C01: "DAPI" => 0
A02_C03: "nanog" => 1
A03_C04: "Lamin" => 2

The Zarr file then always contains the channel under these folder numbers, every folder has these channels (lowest level of the folder tree displayed here:

.
├── 20200812-CardiomyocyteDifferentiation14-Cycle1.zarr
│   └── B
│       └── 03                       # COLUMN
│           ├── 0                     # Field of view  => OMERO metadata
│           │   ├── 0                 # Pyramid level
│           │   │   ├── 0             # DAPI
│           │   │   ├── 1             # nanog
│           │   │   └── 2             # Lamin
│           │   ├── 1
│           │   │   ├── 0             # DAPI
│           │   │   ├── 1             # nanog
│           │   │   └── 2             # Lamin
│           │   ├── 2
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 3
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 4
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── labels
│           │   │   └── label_image
│           │   └── tables
│           │       └── regions_table
│           ├── 1
│           │   ├── 0
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 1
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 2
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 3
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 4
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── labels
│           │   │   └── label_image
│           │   └── tables
│           │       └── regions_table
│           ├── 2
│           │   ├── 0
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 1
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 2
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 3
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── 4
│           │   │   ├── 0
│           │   │   ├── 1
│           │   │   └── 2
│           │   ├── labels
│           │   │   └── label_image
│           │   └── tables
│           │       └── regions_table
│           └── 3
│               ├── 0
│               │   ├── 0
│               │   ├── 1
│               │   └── 2
│               ├── 1
│               │   ├── 0
│               │   ├── 1
│               │   └── 2
│               ├── 2
│               │   ├── 0
│               │   ├── 1
│               │   └── 2
│               ├── 3
│               │   ├── 0
│               │   ├── 1
│               │   └── 2
│               ├── 4
│               │   ├── 0
│               │   ├── 1
│               │   └── 2
│               ├── labels
│               │   └── label_image
│               └── tables
│                   └── regions_table

Both the name of the channel (e.g. DAPI) and the C01 (plus additional info) are saved into the OMERO metadata per field of view.

The number like C01 probably doesn't need to be accessed much (maybe for some fancy inference of what channel it is for illumination correction, but the user can also look that up based on the label (e.g. DAPI) they gave and there original label table.
The label (e.g. DAPI) needs to be in the metadata, but Fractal doesn't use it during computation.

As far as Fractal is concerned, it is handling channel 0, channel 1 or channel 2, as saved in Zarr. As long as the A01_C01 that the user names DAPI is always put into channel 0 during the initial parsing, we just rely on this from there on now.

Hope this explanation helps. If not, let's quickly have a zoom call (I'd be available just now or at 3pm again) :)

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 15, 2022

Bit of added complexity: What if there are more channels? e.g. multiplexing, there is DAPI in every cycle. Or some channels are not imaged in all wells?

There should in my opinion still be a definitive mapping of channel inputs (e.g. A04_C05 in case of a new channel) to an index in Zarr, e.g. now A04_C05 => 3. And Zarr only creates the folders for which channels exist (would need to test with the reader plugin if that can handle missing channels per well).

If there was multiplexing done, the user may have 3 folders, each of them containing

A01_C01
A02_C03
A03_C04

In that case, we could map something like this:

Folder 1:
A01_C01: "1_DAPI"          => 0
A02_C03: "1_nanog"      => 1
A03_C04: "1_Lamin"      => 2

Folder 2:
A01_C01: "2_DAPI"      => 3
A02_C03: "2_Sox17"      => 4
A03_C04: "2_BMP4"      => 5

Folder 3:
A01_C01: "3_DAPI"        => 6
A02_C03: "3_TroponinT"  => 7
A03_C04: "3_GATA4"    => 8

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 15, 2022

Handling missing channels is a bit of an optional use case, let's see if it's easy, otherwise we do it later

@tcompa
Copy link
Collaborator

tcompa commented Jun 15, 2022

After a quick Zoom call with @jluethi, here's a scheme for the first implementation.

Assumptions:

  1. All wells have the same channels (i.e. no wells with missing channels, at the moment).
  2. We only deal with a single folder (i.e. no multiplexing with three folders where A01_C01 maps to different channels). This means that the user's table needs to map a combination of fields A and C to a specific name, e.g. A01_C01 --> DAPI. In a future upgrade, this will be for a combination of folder+A+C.
  3. When a .zarray file mentions the presence of three channels, it doesn't matter whether they are in folders 0,1,2 or 19,23,55. To verify this: scramble names of some channel subfolders in a clean zarr file, and try to load it with from_zarr [more on this in next comment].
  4. Complementary to previous point: make sure that to_zarr acting within the channel folder does not lead to changes in the .zarray file. This is something I don't understand, at the moment. I think that repeating data[channel].to_zarr(PATH/TO/CHANNEL) for every channel will write a .zarray file (for 3D arrays) within each channel subfolder. This is wrong [more on this in next comment].
  5. All channels should be saved in the same zarr files (including those which share the same C01 suffix, but for different A values).
  6. Writing the channel label in the omero metadata is essential (for readability by the user), but irrelevant on the Fractal side. Of course we are still doing it ;)

Scheme:

  1. Filename parsing within create_zarr_structure needs to also read the A field (trivial but necessary change).
  2. Somewher (in create_zarr_structure, I guess), we should import the user's table and assign indices to all relevant channels.
  3. Current use of to_zarr needs to be replaced (everywhere!) with several to_zarr calls (one per channel), because we want to provide custom names to channel folders (19,23,55 instead of 0,1,2).
  4. Every time a list of channels (typically named chl_list) is given as an input for a task, this needs to be with the correct indexing (e.g. 19,23,55, and not 0,1,2), since it will be used to read the appropriate zarr subfolder. With this in-place, changes to tasks like maximum_intensity_projection will be minimal (if any).
  5. If a channel is missing in a certain well (not the current usecase, but something for the future), the corresponding folder won't be created.
  6. Both the original and user-provided channel labels (A01_C01 and DAPI) should be stored in the omero metadata.

Note that, at the moment, the pyramid-creation function remains the same (acting on 4D arrays), since it doesn't care about the channel names/indices.

@tcompa
Copy link
Collaborator

tcompa commented Jun 15, 2022

I think mulitple calls to to_zarr won't work for our goal.

See this example:

import shutil
import dask.array as da


# List of channel indices
list_channels = [1, 22, 33]

# Data to be stored (channel, Z, Y, X)
x = da.ones((3, 1, 200, 200), chunks=(1, 1, 100, 100))

# (1) Create standard zarr
x.to_zarr("a.zarr", dimension_separator="/")
y = da.from_zarr("a.zarr")
print(y.shape)

# (2) Create custom zarr, via shutil.move
x.to_zarr("b.zarr", dimension_separator="/")
for idx, channel in enumerate(list_channels):
    shutil.move(f"b.zarr/{idx}", f"b.zarr/{idx}_TMP")
for idx, channel in enumerate(list_channels):
    shutil.move(f"b.zarr/{idx}_TMP", f"b.zarr/{channel}")
y = da.from_zarr("b.zarr")
print(y.shape)

# (3) Create custom zarr, via multiple to_zarr calls
for idx, channel in enumerate(list_channels):
    x[idx].to_zarr("c.zarr", dimension_separator="/", component=f"{channel}")
y = da.from_zarr("c.zarr")
print(y.shape)

Option 1 works, of course, but it neglects the actual channel indices.
Option 2 works, at the cost of calling shutil here and there.
Option 3 fails, because it creates one .zarray file in each folder c.zarr/1, c.zarr/22, c.zarr/33, and then from_zarr cannot read the 4D array (although we could obviously make it read the 3D arrays with multiple calls to from_zarr).

I think the standard specs require us to keep the .zarray file above the 4D (or possibly 5D, when time index is there) array, so that at the moment option 2 seems the only viable one.

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 16, 2022

Thanks for the explanation @tcompa.
Sounds like Option 2 will be the solution here. Will it be cleaner to call shutil during the writing (benefit: After that task, the Zarr file could already be opened & the metadata added later is just some sugar on top) or to do these operations in a second "wrap up" task?

Idea behind the wrap up task:
We could think about introducing a "wrap up" task at the end of the Zarr parsing, which could take care of these kind of corrections (move all the .zarray files, ensure the final .zarray file at the pyramid level is correct (the individual tasks writing jobs would potentially not know yet how many channels will eventually exist).
Such a task could then also handle more of the metadata parsing to make sure the OME-Zarr gets metadata filled in at the correct location.

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 16, 2022

Going through part of the parsing, a wrap-up metadata parsing task may actually not be feasible, because some metadata needs to be related to the original filenames, thus probably needs to be parsed together with the file reading to be able to match them properly.

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

Quick information: if channel subfolders are not like 0,1,2 (but rather 0,1,20), napari doesn't read the third channel.

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 16, 2022

Ok. I'd suggest we implement the case were we have continuous channels 0 to n for now. And then worry about missing channels in a well if that use case arises :)

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

I'll start working on this, in dev-channels.

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

A first version is almost ready in dev-channels (see commits in https://github.com/fractal-analytics-platform/mwe_fractal/commits/dev-channels). At the moment I only updated:

  • create_zarr_structure.py
  • yokogawa_to_zarr.py
  • replicate_zarr_structure_mip.py
  • maximum_intensity_projection.py

@jluethi, could you provide a set of realistic labels?
I'm currently using /data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_testSubset/, where channels are A01_C01, A01_C02 and A02_C03. At the moment I just harcoded some fake labels, and the omero metadata read

$ tail -n19 /data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites_newchannels/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/.zattrs 
    "omero": [
        {
            "channels": [
                {
                    "label": "label_A01_C01"
                },
                {
                    "label": "label_A01_C02"
                },
                {
                    "label": "label_A02_C03"
                }
            ],
            "id": "TBD",
            "name": "TBD",
            "version": "TBD"
        }
    ]
}

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 16, 2022

Hey @tcompa Here is some example omero metadata I wrote for this dataset:

    "omero":
      {
        "id": 1,
        "name": "TBD",
        "version": "0.4",
        "channels": [
            {
                "active": true,
                "coefficient": 1,
                "color": "00FFFF",
                "family": "linear",
                "inverted": false,
                "label": "DAPI",
                "window": {
                    "end": 600,
                    "max": 65535,
                    "min": 0,
                    "start": 110
                }
            },
            {
                "active": true,
                "coefficient": 1,
                "color": "FF00FF",
                "family": "linear",
                "inverted": false,
                "label": "nanog",
                "window": {
                    "end": 200,
                    "max": 65535,
                    "min": 0,
                    "start": 115
                }
            },
            {
                "active": true,
                "coefficient": 1,
                "color": "FFFF00",
                "family": "linear",
                "inverted": false,
                "label": "Lamin B1",
                "window": {
                    "end": 1000,
                    "max": 65535,
                    "min": 0,
                    "start": 115
                }
            }
        ]
      }

You can take over just the subset of the information that is relevant here, e.g. just the label, not the window and other parts.
(window is a very nice parameter we can parse later that defines default rescaling in the napari viewer, thus makes it appear nicer on initial loading. Combined with color that defines how the layers are actually displayed)

The matching would be:
A01_C01 => "DAPI"
A01_C02 => "nanog"
A02_C03 => "Lamin B1"

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

Note to future self: true and false are JSON keywords, and writing them from our python script is a bit annoying. We'll start by ignoring inverted and active, and think about it later.

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

Channel labels (written together with the other omero metadata) are read correctly within napari. That's a good check.

Screenshot from 2022-06-16 14-55-23

The next step I have on the list is to move from hard-coded channel information to something user-provided. And then a couple more tasks need to be updated to the new channel scheme.

@tcompa
Copy link
Collaborator

tcompa commented Jun 16, 2022

(this is all under the no-missing-channel assumption)

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 16, 2022

On the Pelkmans lab side, we always worked under a "no missing channels" assumption in our processing in the past, so while we could think about making this more flexible at some point, it wouldn't be a step back for us.
I talked quickly with Nicole and it sounds like the Liberalis also don't have any common use cases where this assumption would break. (@gusqgm or does one come to mind for you?)

Thus, I would suggest we accept the "no missing channels" limitation and maybe even include a check for it for the moment such that the parsing would complain if channels are missing. If we ever need to support those use cases later, we can come back to this, but isn't a priority goal from my side.

@tcompa
Copy link
Collaborator

tcompa commented Jun 17, 2022

Update on the current status:

The user should provide a dictionary like this:

    dict_channels = {
        "A01_C01": dict(label="DAPI", colormap="00FFFF", start=110),
        "A01_C02": dict(label="nanog", colormap="FF00FF", start=115),
        "A02_C03": dict(label="Lamin B1", colormap="FFFF00", start=115),
    }

At the moment this dictionary should be complete, but then we can obviously set some default choices for colormap and start so that the minimal input will be like

    dict_channels = {
        "A01_C01": {"label": DAPI"},
        "A01_C02": {"label": "nanog"},
        "A02_C03": {"label": "Lamin B1"},
    }

The first task (create_zarr_structure, at the moment, or its multi-fov version) checks which channels are actually present in the dataset (all of them, in our current test, but in principle some of them may not be there) and produces a table like

actual_channels = [
  [0, 'A01_C01', 'DAPI']
  [1, 'A01_C02', 'nanog']
  [2, 'A02_C03', 'Lamin B1']
]

Let's not focus on the structure (first column is probably redundant, and maybe list of lists can be simplified), but on the fact that this table contains the list of all channels which are present in the dataset. The assumption is that they are also present for each well, and this is explicitly verified (throwing an exception otherwise).

From now on, all tasks receive actual_channels as an input (whenever needed). Note that often the important input is simply num_channels = len(actual_channels).

Channels are always sorted according to this order, in all wells, and upon writing the zarr files subfolders are named 0,1,2,...

A few things still need to be streamlined, but this implementation is almost ready. Does such a scheme seem reasonable? Comments are welcome :)

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 17, 2022

The scheme looks very reasonable to me for parsing the data in.

The assumption is that they are also present for each well, and this is explicitly verified (throwing an exception otherwise).

That's great!

The only thing I'm not sure about is:

all tasks receive actual_channels as an input (whenever needed)

This information is part of the Zarr file once it's written, no? So there would be no need to pass it again and find places to save it in between (/ read it from file again).
num_channels is the number of folders per pyramid level, index 0 of each channel list are the folder names, index 2 is saved as part of OMERO metadata and we can think about saving index 1 as metadata as well (though no task should need that specific information, e.g. the illumination correction task will probably need slightly different metadata that we are parsing.

At the moment this dictionary should be complete, but then we can obviously set some default choices

Great! Making arguments optional eventually is a good idea, because we can e.g. create a task that infers start based on the image data.
Also, if we have start as an argument, let's also have end as an input argument here, because then the user fully specifies the display (start, end & colormap describe the display settings).

@tcompa
Copy link
Collaborator

tcompa commented Jun 17, 2022

The scheme looks very reasonable to me for parsing the data in.

The assumption is that they are also present for each well, and this is explicitly verified (throwing an exception otherwise).

Thanks. Then I'll move on and finalize it as much as possible, aiming for pushing to main on Monday.

The only thing I'm not sure about is:

all tasks receive actual_channels as an input (whenever needed)

This information is part of the Zarr file once it's written, no? So there would be no need to pass it again and find places to save it in between (/ read it from file again). num_channels is the number of folders per pyramid level, index 0 of each channel list are the folder names, index 2 is saved as part of OMERO metadata and we can think about saving index 1 as metadata as well (though no task should need that specific information, e.g. the illumination correction task will probably need slightly different metadata that we are parsing.

Agreed, this structure is highly redundant (it was the very first try). What is clearly not needed is:

  • The 0,1,2,.. indices, provided that either (A) there is a sorted list somewhere or (B) we already made sure that subfolder names are consistent across all wells. And (B) is exactly what we are achieving now.
  • The labels (DAPI, nanog, ..), which are stored in the omero metadata once and for all and which are irrelevant on the Fractal side from then on.

At the moment we still need the sorted list ['A01_C01', 'A01_C02', 'A02_C03'] in the yokogawa_to_zarr task (where we want to match images via their A/C fields). Note that this isn't needed for other processing tasks (like the MIP one), where only the number of channels matter (and I agree: if it is only the number of channels we can just retrieve it on-the-fly).

Two questions:

  1. @jluethi, do you propose to store the channel names (A01_C01, ...) somewhere in the Zarr file and then retrieve it from there when needed (e.g. in yokogawa_to_zarr)? Should it be in some custom file or should it be a custom key-value in the omero metadata (next to label, we could have our own channel_name)?
  2. What about illumination_correction? I think that we will need to match correction-matrix filenames with channel names (things like A01_C01), and then it would still be useful to have the information about ['A01_C01', 'A01_C02', 'A02_C03']` (either by having the list as a task input, or by storing this information somewhere in the zarr).

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 17, 2022

At the moment we still need the sorted list ['A01_C01', 'A01_C02', 'A02_C03'] in the yokogawa_to_zarr task

How about for the moment, both the initialization task and the yokogawa_to_zarr task get the dict_channels dictionary? (or e.g. a path to a json file that stores that dictionary)
We can then think down the road on how exactly the user provides this information and that both tasks read it from there? Or think about having a wrapper-task / meta-task that performs multiple tasks like init & yokogawa_to_zarr & more metadata parsing that just gets this as an input :)

@jluethi, do you propose to store the channel names (A01_C01, ...) somewhere in the Zarr file and then retrieve it from there when needed (e.g. in yokogawa_to_zarr)? Should it be in some custom file or should it be a custom key-value in the omero metadata (next to label, we could have our own channel_name)?

Hmm, I don't know how the omero part deals with custom metadata keys, but that is another place we could put it. Not 100% sure how we'd save this information (i.e. A01_C01 may not be specific enough for all downstream uses of channel information) => could be a start, but not yet complete.

What about illumination_correction?

That is actually slightly more complicated. A01_C01 describes a specific channel for a specific experiment, but depending on some microscope settings & user inputs, A01_C01 in one experiment is not the same wavelength of light & filter set at the microscope as A01_C01 in a second experiment. So we can't directly match on that, we will need additional metadata for direct matching.
=> For the moment, I think the user should specify the match of illumination correction matrix to channel manually via some input, we can think about parsing additional metadata to create reasonable defaults, but we will never know 100% what the correct illumination profile would be for a given experiment.

@jluethi jluethi moved this from Ready to In Progress (current sprint) in Fractal Project Management Jun 17, 2022
@tcompa
Copy link
Collaborator

tcompa commented Jun 20, 2022

Current status (as of b3fd38e)

What is there:

  1. create_zarr_structure correctly reads a JSON file with a dictionary of the allowed channels and their properties (or the hard-coded version of this dictionary);
  2. create_zarr_structure iterates through folders to find plates (possibly with the same name, and then getting suffixes _1, _2, ..), but at the moment it does not support multiplexing (that is, it cannot combine different folders into a single plate);
  3. create_zarr_structure verifies that all plates have the same set of channels (that is, no missing channels) and that channels are present in the user-provided dictionary of allowed channels;
  4. yokogawa_to_zarr receives a list like ['A01_C01', 'A01_C02', 'A02_C03'] [@jluethi at the moment this version seems cleaner than your suggestion, since in that case we would need to replicate all the steps to identify channels and match them to the user-provided ones (which is already done in create_zarr_structure);
  5. maximum_illumination_correction doesn't require any channel-related argument, since it just reads the number of channels from the shape of the array.

What is not there:

  • We should provide some heuristics for the default colormap/start/end parameters for each channel;
  • We may want to store the list ['A01_C01', 'A01_C02', 'A02_C03'] somewhere, to be used whenever necessary within tasks (still to be decided where to store it);
  • create_zarr_structure_multifov should have the same features as in create_zarr_structure --> to do;
  • yokogawa_to_zarr_multifov should have the same structure as in yokogawa_to_zarr --> to do;
  • fractal_cmd currently doesn't support applying yokogawa_to_zarr (or its multifov version) to plates coming from different folders --> to do;

Just for completeness, illumination_correction requires independent updates, which (for the moment) will make the channel disambiguation of little relevance - since we are planning to provide very explicit information about the correction-matrix paths.

For me this version is good enough to be merged into main.

tcompa added a commit that referenced this issue Jun 20, 2022
Deterministic channel assignment across wells/plates [ref #61]
@tcompa
Copy link
Collaborator

tcompa commented Jun 20, 2022

With 386569c, a minimal working example is available in examples/example_uzh_1_well_2x2_sites.sh (notice the updated wf_params_uzh_10_well_5x5_sites.json file, and the new wf_params_uzh_10_well_5x5_sites_channels.json file).

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 20, 2022

@tcompa I ran the 2x2 test case and it ran through successfully and added the Omero metadata. I then tried the 10wel, 5x5 site test case. It also runs through, but doesn't save any omero metadata to the .zattrs. Any idea what could be going wrong? I gave them the same channels.json file, so that shouldn't make a difference. I have pushed it to main (there was no wf_params_uzh_10_well_5x5_sites_channels.json so far, so I used the wf_params_uzh_1_well_2x2_sites_channels.json)

tcompa added a commit that referenced this issue Jun 21, 2022
@tcompa
Copy link
Collaborator

tcompa commented Jun 21, 2022

@tcompa I ran the 2x2 test case and it ran through successfully and added the Omero metadata. I then tried the 10wel, 5x5 site test case. It also runs through, but doesn't save any omero metadata to the .zattrs. Any idea what could be going wrong? I gave them the same channels.json file, so that shouldn't make a difference. I have pushed it to main (there was no wf_params_uzh_10_well_5x5_sites_channels.json so far, so I used the wf_params_uzh_1_well_2x2_sites_channels.json)

Thanks for testing! It should work now.

EDIT: here's the final MIP, with napari correctly reading channel labels:

Screenshot from 2022-06-21 09-54-41

tcompa added a commit that referenced this issue Jun 21, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Jun 21, 2022

@tcompa Yes, also works on my side now. Really cool to see the channels parsed this way and rescaled nicely from the start!
Let's close the issue with this, the problem of non-deterministic channel assignment has been solved.

We can refactor later to decide how we pass channel lists in detail, how we handle multiplexing with this and refactor how the inputs are originally provided (i.e. do we stay with one json file per task or start with a combined json file setup)

@jluethi jluethi closed this as completed Jun 21, 2022
Repository owner moved this from In Progress (current sprint) to Done in Fractal Project Management Jun 21, 2022
@tcompa
Copy link
Collaborator

tcompa commented Jun 21, 2022

Here is well B/07 of FMI plate 210305NAR005AAN (after the MIP), correctly showing the channel labels:

Screenshot from 2022-06-21 13-45-16

@tcompa
Copy link
Collaborator

tcompa commented Jul 27, 2022

Note to future self: true and false are JSON keywords, and writing them from our python script is a bit annoying. We'll start by ignoring inverted and active, and think about it later.

This was actually much simpler than we thought back then, see

import json
d = {1: True}
with open("/tmp/out.json", "w") as f:
    json.dump(d, f)

leads to

$ cat /tmp/out.json 
{"1": true}

We were probably confused by other things, since this is actually trivial (as it should be).

@jluethi jluethi moved this from Done to Done Archive in Fractal Project Management Oct 5, 2022
jacopo-exact pushed a commit that referenced this issue Nov 16, 2022
dev to main (require fractal-tasks-core 0.1.4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High Priority Current Priorities & Blocking Issues
Projects
None yet
Development

No branches or pull requests

2 participants