"Big new design" for nowcasting_dataset #213

JackKelly · 2021-10-08T15:08:29Z

Detailed Description

Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.

This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)

The plan:

In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:

First, do some stand-alone "preparatory" work. Specifically:

Implement and write tests for some of the functions in the draft design in the GitHub comment below. (for now, these functions won't actually be called in the code):

sample_spatial_and_temporal_locations_for_examples() (done in PR Implement DataSourceList.sample_spatial_and_temporal_locations_for_examples() #278)
DataSource.prepare_batches()

Then, the big one (where the code size will hopefully shrink a lot!):

Remove:

All of dataset/datamodule.py
dataset.datasets.NowcastingDataset
dataset.datasets.worker_init_fn()
All the batch_to_dataset and dataset_to_batch stuff
All the to_numpy and from_numpy stuff (assuming we can go straight from xr.Dataset to torch.Tensor)

Related issues

A bit more context

I think I've made our lives far harder than they need to be by trying to support two different use-cases:

Loading training data on-the-fly from multiple Zarr files during ML training; and
Pre-preparing batches.

I think we can make life way easier by dropping support for use-case 1 :)

Here's the broad proposal that it'd be great to discuss:

We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow). nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).

This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209). e.g. Satellite data stays in an xr.Dataset. Each modality would be processed concurrently in different processes; and would be output into different directories (e.g. train/satellite/ and train/nwp/) (#202).

Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).

The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain 🙂

Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model. But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now 🙂

I certainly can't claim to have thought this all through properly! And everything's up for discussion, of course!

The text was updated successfully, but these errors were encountered:

JackKelly · 2021-10-08T17:05:43Z

In terms of the order in which to do this work... here's some initial thoughts (very much up for discussion!)

The thinking being that we can't keep each modality in a single datatype until we've removed pytorch (because pytorch requires us to move data from xarray to numpy to tensors); which we can't do until we've removed the pytorch dataloader; and the best way to remove the pytorch dataloader is to implement #202 :) Does that sound roughly right?!

peterdudfield · 2021-10-09T19:59:19Z

Great work @JackKelly putting this all together. I think this will be super useful at making a really solid base for our ML learning problems we are trying to solve.

I'm slightly biased by 1. , but yea I would do 1. first. #195
2. #202 then is a good next logic step. I would be tempted to split this into two tasks.
A. Change current save to netcdf to save several files. B. Change core code sot hat it can run in different processes. Doing A, first will allows us to remake datasets more easily, and just copy the ones that are the same from previous versions.
3. 4. Perhaps this can be done at the same time. I feel like doing #209 first gives us a quicker win, and makes things simpler.
5. Then #211 will be a nice bonus on the end.

I'll just share my thoughts on how #209 can be done, that allows data to end up as numpy / tensor values

jacobbieker · 2021-10-11T08:19:33Z

Yeah, great work! I also think the outline here for moving forward makes sense. For #86, I think that can mostly come as we refactor the code, we remove PyTorch as we see it, I think I agree with @peterdudfield that we can probably do 3 and 4 at the same time.

With the focus on removing the on-the-fly loading, would we be moving the PyTorch dataloader that does currently load the prepared batches to somewhere else, like nowcasting_utils? Or would we be leaving it in here? I generally would think keeping it in nowcasting_dataset keeps it tied to how the data is being created easier, but then we would still have some dependency on PyTorch in this repo.

peterdudfield · 2021-10-11T08:30:12Z

Yeah, great work! I also think the outline here for moving forward makes sense. For #86, I think that can mostly come as we refactor the code, we remove PyTorch as we see it, I think I agree with @peterdudfield that we can probably do 3 and 4 at the same time.

With the focus on removing the on-the-fly loading, would we be moving the PyTorch dataloader that does currently load the prepared batches to somewhere else, like nowcasting_utils? Or would we be leaving it in here? I generally would think keeping it in nowcasting_dataset keeps it tied to how the data is being created easier, but then we would still have some dependency on PyTorch in this repo.

I like the idea of leaving it here too, some how keeps it rapped up nicely. I wonder if we could do it as an optional requirement. That means people dont' have to install it, but they do need to install it to make the data loader. And it keeps things all here

jacobbieker · 2021-10-11T08:44:37Z

Yeah, I think an optional requirement would be the way to go with it

JackKelly · 2021-10-11T15:58:25Z

Here's a draft design sketch for the end-goal, once all the issues listed above are done:

To recap: One of the main aims is to significantly simplify nowcasting_dataset (maybe reducing the total lines of code to something like half the current size!) Some specific design features of this new design:

Data for each modality flows through nowcasting_dataset completely independently of other modalities. As such, the data can stay in one data type throughout the code (e.g. xr.Dataset), and the data never needs to leave its DataSource.
Each modality will be saved to a different directory, concurrently

This makes it easy to:

run each data source concurrently.
run multiple processes per data source (by splitting positions_of_each_batch into, say, quarters and sending each quarter to a different process).
start processing at any batch number (e.g. if processing crashed part-way-through and needs to re-start).

Once this is implemented, I think we may be able to literally delete several large chunks of code:

All of dataset/datamodule.py
dataset.datasets.NowcastingDataset
dataset.datasets.worker_init_fn()
All the batch_to_dataset and dataset_to_batch stuff
All the to_numpy and from_numpy stuff (assuming we can go straight from xr.Dataset to torch.Tensor)

`prepare_ml_data.py` rough sketch

# Instantiate the `DataSource` objects requested by the user at the command line and/or config YAML file.
# The first data_source will be used to define the geospatial location of each example.
# We maybe want to make a few small changes to the Configuration model:
# - include a list of DataSources.
# - maybe split the config up so each data source has its own section, so that entire
#   section can be passed into the DataSource object.  e.g. there would be an `NWP` section, 
#   with fields zarr_path, image_size, and channels.
data_sources = get_data_sources(config)

# If necessary, pre-compute three `positions_of_each_example.csv` files:
# One file for `train/`, one for `test/`, and one for `/validation`. 
# This CSV file will have four columns: example_number, t0_datetime, center_x, center_y.
# (We might want to use a binary file rather than CSV.  
# But CSV is nice because it's super-easy to inspect manually).
# (These `positions_of_each_example.csv` files could be re-used when we prepare different versions of
# the pre-prepared batches, to minimise variation between different datasets).
filenames_of_positions_of_each_example: dict[SplitName, Pathy] = get_filenames_of_positions_of_each_example(
    config.output_data.filepath)

if not exists(filenames_of_positions_of_each_example[SplitName.TRAIN]):
    compute_and_save_positions_of_each_example_of_each_split(
        data_sources[0], 
        split_method=config.process.split_method, 
        dst_path=config.output_data.filepath)

# Generate the batches.  Something like this:
for split_name, filename_of_positions_of_each_example in filenames_of_positions_of_each_example.items():
    dst_path = config.output_data.filepath / split_name
    positions_of_each_example = pd.read_csv(filename_of_positions_of_each_example)
    positions_for_each_batch = group_positions_of_each_example_into_batches(
        positions_of_each_example, config.process.batch_size)

    with futures.ProcessPoolExecutor() as executor:
        for data_source in data_sources:
            executor.submit(data_source.prepare_batches, positions_for_each_batch, dst_path, overwrite=False)

validate_ml_data()  # ?

`compute_and_save_positions_of_each_example_of_each_split()` rough sketch

def compute_and_save_positions_of_each_example_of_each_split(
    data_source: DataSource, # The DataSource that defines the location of each example.
    split_method: str,
    n_examples_per_split: dict[SplitName, int],
    dst_path: Pathy
) -> None:
    # Get intersection of all available t0_datetimes.  Current done by NowcastingDataModule._get_datetimes():
    # github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/datamodule.py#L364
    t0_datetimes_for_all_data_sources = [data_source.get_t0_datetimes() for data_source in data_sources]
    intersection_of_t0_datetimes = nd_time.intersection_of_datetimeindexes(t0_datetimes_for_all_data_sources)
    
    # Split t0_datetimes into train, test and validation sets (being careful to ensure each group is
    # at least `total_seq_duration` apart).  Currently done by NowcastingDataModule._split_data():
    # github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/datamodule.py#L315
    t0_datetimes_per_split: dict[SplitName, pd.DatetimeIndex] = split_datetimes(
        intersection_of_t0_datetimes, method=split_method)
    
    for split_name, t0_datetimes_for_split in t0_datetimes_per_split.items():
        n_examples = n_examples_per_split[split_name]
        positions = compute_positions_of_each_example(t0_datetimes_for_split, data_source, n_examples)
        filename = dst_path / split_name / 'positions_of_each_example.csv'
        positions.to_csv(filename)


def compute_positions_of_each_example(
    t0_datetimes_for_split: pd.DatetimeIndex, 
    data_source: DataSource, 
    n_examples: int
) -> pd.DataFrame:
    
    positions = pd.DataFrame(index=range(n_examples), columns=['t0_datetime', 'x_center', 'y_center'])
    sampled_t0_datetimes = np.random.choice(t0_datetimes_for_split, size=n_examples)
    for example_i, t0_datetime in enumerate(sampled_t0_datetimes):
        x_center, y_center = data_source.get_geo_location_for_t0_datetime(t0_datetime)
        positions.iloc[i] = dict(t0_datetime=t0_datetime, x_center=x_center, y_center=y_center)
    
    return positions

`DataSource.prepare_batches()` rough sketch

def prepare_batches(
    self, 
    positions_of_each_batch: pd.DataFrame, 
    dst_path: Pathy,
    overwrite: bool = False,
    temp_path: Optional[Pathy] = None,
    upload_every_n_batches: Optional[int] = None
) -> None:
    """
    Args:
        positions_of_each_batch: DataFrame with MultiIndex of the batch_index and the the example_index. 
            The columns are: t0_datetime, x_center, y_center
        dst_path: The path to save to.  Can start with '/' (for local filesystem) or 's3://' or 'gs://'.
        overwrite: If True then overwrite existing batches on disk.
    """
    
    self.open()  # Need to open Zarr stores in each process.
    
    dst_path = dst_path / self.name  # Where `name` is a string like 'NWP' or 'satellite'.
    if not overwrite:
        indicies_of_existing_batches = self._find_existing_batches(dst_path)
        _LOG.info(f"Found {len(indicies_of_existing_batches)} existing batches in {dst_path}.")
        indicies_of_existing_batches = positions_of_each_batch.drop(indicies_of_existing_batches)
    
    for positions_for_batch in positions_for_each_batch:
        batch = self.get_batch(positions_for_batch)
        # TODO: Logic to save upload_every_n_batches to temp_path, then upload.  But only if on the cloud!
        self.save(batch, dst_path)

Comments more than welcome!

JackKelly · 2021-10-11T16:29:16Z

(UPDATE: Plan moved to top of this thread!)

Does that sound OK, @peterdudfield ?

peterdudfield · 2021-10-11T17:11:44Z

Think that sounds good. I would move the test writing from 2. to 1.

JackKelly · 2021-10-12T12:53:02Z

I've got the afternoon free for coding so I'm going to make a start on the plan in the comment above... Unless that will clash with your work, @peterdudfield ?

peterdudfield · 2021-10-12T13:19:06Z

go for it

peterdudfield · 2021-10-12T14:13:36Z

Just to share, this is my nice dream of how the ML models use the data

…es(). Change README so it reflects the fact that we're no longer supporting on-the-fly loading. Tests pass. Still need to simplify DataModule._get_datetimes() and remove fill_30_minutes_timestamps_to_5_minutes(). #204 #213

JackKelly · 2021-10-12T15:10:09Z

I love the idea of re-using stuff!

But, a few thoughts and questions about doing x = Batch(**x) inside forward(x):

Does converting x from a dict to a Batch help make our code more readable? My expectation is that, within forward(x), x will be a Dict[str, Tensor]. So, as is the case now, we'll be able to write human-readable code like x['satellite'] to get the satellite data.
I'd be a bit worried that this will force PyTorch to move the data from GPU memory (when x is a dict[str, Tensor]), to CPU system memory (when x is a Batch and is being validated by Pydantic), and back to GPU memory for training. That may slow training down because the PCIe bus is way slower than the GPU's memory bus.
Pydantic's validation logic may slow down the training loop. I'd advocate for doing all our data validation when we prepare the training batches. Then, during the training loop, we should just trust that the data has already been validated 🙂 . I agree that's perhaps as not as belt-and-braces as validating the data multiple times, but hopefully it's sufficient to validate the data once (when we pre-prepare the ML batches).

What do you think?

peterdudfield · 2021-10-12T15:14:31Z

Ah, that's where our thoughts seem to be different. I think we should try and move away from creating 'Dicts' where we dont know the entry fields, hence use a 'Pydantic' object. But maybe this is being overly ambitious.
Yea agree, we dont want to be doing that, probably have to tests that
we can turn off validation pretty easily.

JackKelly · 2021-10-12T15:20:49Z

I think we should try and move away from creating 'Dicts' where we dont know the entry fields, hence use a 'Pydantic' object.

Ah, yes, I agree - that would be a nice thing to be able to do, and is a really good idea! (as long as we can make sure Pydantic doesn't slow down the training loop 🙂 )

peterdudfield · 2021-10-12T15:22:59Z

Ill try and put together a test

JackKelly · 2021-10-12T15:29:46Z

That would be really useful, thanks load! 🙂

I'm sure you've thought about this already but, just for the sake of being overly cautious: In order to show a measurable latency, the test probably needs to use batches which are of similar size to our "real" batches; and probably needs to use an ML model which is similarly sized to our "real" ML models. And will probably need to try training on a bunch of batches (a single batch might be cached in wierd ways)

peterdudfield · 2021-10-12T15:38:35Z

That would be really useful, thanks load! 🙂

I'm sure you've thought about this already but, just for the sake of being overly cautious: In order to show a measurable latency, the test probably needs to use batches which are of similar size to our "real" batches; and probably needs to use an ML model which is similarly sized to our "real" ML models. And will probably need to try training on a bunch of batches (a single batch might be cached in wierd ways)

peterdudfield · 2021-10-12T16:51:52Z

https://github.com/openclimatefix/nowcasting_dataset/blob/peter/explore/notebooks/2021-10/2021-10-12/pydantic_ml_test.py

Not using pydanitc is slightly (but significant - using t-test)

Ran this code 10 times,
3 times the means durations, of 50 batches, were the same
7 times the means durations of 50 batches, were different. Dict always faster

Dict was faster by about ~0.1% of the running time.

So its slowly, but its a debate of how much slowly is ok, for easier readability of the code.
My feeling is the small difference in times, would be a small percentage for a bit model, and the slightly better readibility is worth it

JackKelly · 2021-10-12T16:59:15Z

Sure, sounds good, thanks loads for doing the test! That's super-useful. (Yay for empirically testing stuff!) Let's go for it!

peterdudfield · 2021-10-13T08:42:13Z

Here's a useful method to covert a xr.Dataset to dictionary of torch items

peterdudfield · 2021-10-13T11:30:32Z

** move to #209

Yea I will branch off.

Yea I think in 'main',

I will rename Satellite to SatelliteML,
Strip out unnecessary 'to_numpy' code, and maybe doing some tidying
Make Satellite as described above.
Change current 'Batch' to 'BatchML', and make 'Batch'

My feeling is this can be done in parallel to #213 comment

I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object.

JackKelly · 2021-10-13T11:37:52Z

Cool beans, all sounds great! Thank you! I'm genuinely excited to see nowcasting_dataset really mature and become well-designed and easy to modify - thank you for all your work!

(edit: I'll copy this comment to #209!)

jacobbieker · 2021-10-13T14:10:29Z

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

peterdudfield · 2021-10-13T14:13:43Z

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

Yea, I think thats the safes way to deal with them at the moment

jacobbieker · 2021-10-13T14:14:43Z

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

Yea, I think thats the safes way to deal with them at the moment

As just ints? And if so, are they just seconds since epoch?

JackKelly · 2021-10-13T14:16:53Z

Hehe, yeah, we should document precisely what's going to go into the NetCDF files... I'll start a new issue to create that document (UPDATE: #227)... (the shapes etc. will be described in the pydantic models... but it might be more human-readable to also document the contents of the NetCDF files in a human-readable file somewhere, too... If only - as Peter mentioned the other day - to help define the interface between nowcasting_dataset and nowcasting_dataloader)

As just ints? And if so, are they just seconds since epoch?

In main at the moment, I think all the datetimes are stored as ints (seconds since the epoch (1970)) in the NetCDF files.

Once all the stuff in this issue is implemented, I think the NetCDF files will have "proper" datetimes in them. And, IIUC, those datetimes will get converted to ints by nowcasting_dataloader (because I don't think tensors can hold datetimes?) before the data gets into PyTorch. Is that right, @peterdudfield ?

jacobbieker · 2021-10-13T14:19:26Z

Yeah, tensors can't hold datetimes, but having them in datetimes makes it a bit easier to compute the time of day, day of year, etc. features for the positional encodings, otherwise, I'd probably go with converting them back to datetimes to do so

JackKelly · 2021-10-13T14:26:01Z

If you can wait a few days, hopefully we'll have modified nowcasting_dataset to put "proper" datetimes into the NetCDF files :)

jacobbieker · 2021-10-13T14:26:54Z

Yeah! Sounds good, I'll get it working assuming that

JackKelly added enhancement New feature or request refactoring labels Oct 8, 2021

peterdudfield mentioned this issue Oct 11, 2021

Remove PyTorch from the code #86

Closed

4 tasks

JackKelly self-assigned this Oct 12, 2021

JackKelly mentioned this issue Oct 12, 2021

Do we still need all the representations of sequence lengths in DataSource? #219

Open

This was referenced Oct 12, 2021

Simplify the calculation of available datetimes across all DataSources #220

Merged

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Closed

peterdudfield closed this as completed Oct 12, 2021

peterdudfield reopened this Oct 12, 2021

jacobbieker mentioned this issue Oct 14, 2021

Add Absolute Position Encoder openclimatefix/nowcasting_dataloader#2

Merged

7 tasks

flowirtz moved this to In Progress in Nowcasting Oct 15, 2021

flowirtz added this to Nowcasting Oct 15, 2021

This was referenced Oct 19, 2021

Multi process when saving to netcdf #244

Closed

Use independent processes for each "modality" #202

Closed

Implement DataSourceList.sample_spatial_and_temporal_locations_for_examples() #278

Merged

JackKelly linked a pull request Oct 25, 2021 that will close this issue

Implement DataSourceList.sample_spatial_and_temporal_locations_for_examples() #278

Merged

15 tasks

JackKelly closed this as completed in #278 Oct 25, 2021

Repository owner moved this from In Progress to Done in Nowcasting Oct 25, 2021

JackKelly reopened this Oct 26, 2021

Repository owner moved this from Done to In Progress in Nowcasting Oct 26, 2021

This was linked to pull requests Oct 28, 2021

Big new design part 1 :) #300

Closed

Big new design Part 2 :) #307

Merged

JackKelly mentioned this issue Oct 29, 2021

Big new design Part 2 :) #307

Merged

30 tasks

JackKelly added a commit that referenced this issue Oct 29, 2021

Finally, a full complete draft of #213. Not yet tested

72e39c8

JackKelly closed this as completed in #307 Nov 2, 2021

Repository owner moved this from In Progress to Done in Nowcasting Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Big new design" for nowcasting_dataset #213

"Big new design" for nowcasting_dataset #213

JackKelly commented Oct 8, 2021 •

edited

Loading

JackKelly commented Oct 8, 2021 •

edited

Loading

peterdudfield commented Oct 9, 2021

jacobbieker commented Oct 11, 2021

peterdudfield commented Oct 11, 2021

jacobbieker commented Oct 11, 2021

JackKelly commented Oct 11, 2021 •

edited

Loading

JackKelly commented Oct 11, 2021 •

edited

Loading

peterdudfield commented Oct 11, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 13, 2021

peterdudfield commented Oct 13, 2021 •

edited

Loading

JackKelly commented Oct 13, 2021 •

edited

Loading

jacobbieker commented Oct 13, 2021

peterdudfield commented Oct 13, 2021

jacobbieker commented Oct 13, 2021 •

edited

Loading

JackKelly commented Oct 13, 2021 •

edited

Loading

jacobbieker commented Oct 13, 2021

JackKelly commented Oct 13, 2021

jacobbieker commented Oct 13, 2021

"Big new design" for nowcasting_dataset #213

"Big new design" for nowcasting_dataset #213

Comments

JackKelly commented Oct 8, 2021 • edited Loading

Detailed Description

Related issues

A bit more context

JackKelly commented Oct 8, 2021 • edited Loading

peterdudfield commented Oct 9, 2021

jacobbieker commented Oct 11, 2021

peterdudfield commented Oct 11, 2021

jacobbieker commented Oct 11, 2021

JackKelly commented Oct 11, 2021 • edited Loading

prepare_ml_data.py rough sketch

compute_and_save_positions_of_each_example_of_each_split() rough sketch

DataSource.prepare_batches() rough sketch

JackKelly commented Oct 11, 2021 • edited Loading

peterdudfield commented Oct 11, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield commented Oct 12, 2021 • edited Loading

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 13, 2021

peterdudfield commented Oct 13, 2021 • edited Loading

JackKelly commented Oct 13, 2021 • edited Loading

jacobbieker commented Oct 13, 2021

peterdudfield commented Oct 13, 2021

jacobbieker commented Oct 13, 2021 • edited Loading

JackKelly commented Oct 13, 2021 • edited Loading

jacobbieker commented Oct 13, 2021

JackKelly commented Oct 13, 2021

jacobbieker commented Oct 13, 2021

JackKelly commented Oct 8, 2021 •

edited

Loading

JackKelly commented Oct 8, 2021 •

edited

Loading

JackKelly commented Oct 11, 2021 •

edited

Loading

`prepare_ml_data.py` rough sketch

`compute_and_save_positions_of_each_example_of_each_split()` rough sketch

`DataSource.prepare_batches()` rough sketch

JackKelly commented Oct 11, 2021 •

edited

Loading

peterdudfield commented Oct 12, 2021 •

edited

Loading

peterdudfield commented Oct 13, 2021 •

edited

Loading

JackKelly commented Oct 13, 2021 •

edited

Loading

jacobbieker commented Oct 13, 2021 •

edited

Loading

JackKelly commented Oct 13, 2021 •

edited

Loading