Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

"Big new design" for nowcasting_dataset #213

Closed
20 of 38 tasks
JackKelly opened this issue Oct 8, 2021 · 34 comments · Fixed by #278 or #307
Closed
20 of 38 tasks

"Big new design" for nowcasting_dataset #213

JackKelly opened this issue Oct 8, 2021 · 34 comments · Fixed by #278 or #307
Assignees
Labels
enhancement New feature or request refactoring

Comments

@JackKelly
Copy link
Member

JackKelly commented Oct 8, 2021

Detailed Description

Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.

This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)

The plan:

In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:

First, do some stand-alone "preparatory" work. Specifically:

Implement and write tests for some of the functions in the draft design in the GitHub comment below. (for now, these functions won't actually be called in the code):

Then, the big one (where the code size will hopefully shrink a lot!):

Remove:

  • All of dataset/datamodule.py
  • dataset.datasets.NowcastingDataset
  • dataset.datasets.worker_init_fn()
  • All the batch_to_dataset and dataset_to_batch stuff
  • All the to_numpy and from_numpy stuff (assuming we can go straight from xr.Dataset to torch.Tensor)

Related issues

A bit more context

I think I've made our lives far harder than they need to be by trying to support two different use-cases:

  1. Loading training data on-the-fly from multiple Zarr files during ML training; and
  2. Pre-preparing batches.

I think we can make life way easier by dropping support for use-case 1 :)

Here's the broad proposal that it'd be great to discuss:

We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow). nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).

This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209). e.g. Satellite data stays in an xr.Dataset. Each modality would be processed concurrently in different processes; and would be output into different directories (e.g. train/satellite/ and train/nwp/) (#202).

Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).

The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain 🙂

Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model. But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now 🙂

I certainly can't claim to have thought this all through properly! And everything's up for discussion, of course!

@JackKelly JackKelly added enhancement New feature or request refactoring labels Oct 8, 2021
@JackKelly
Copy link
Member Author

JackKelly commented Oct 8, 2021

In terms of the order in which to do this work... here's some initial thoughts (very much up for discussion!)

  1. Merge Issue/166 batch pydantic #195 into main DONE
  2. Simplify the calculation of available datetimes across all DataSources #204 [newly added]
  3. Tweak the way DataSources are represented in the Configuration model #217 [newly added]
  4. Use independent processes for each "modality" #202
  5. Remove PyTorch from the code #86
  6. Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209
  7. Machine-readable schema & validator for xarray.Dataset #211

The thinking being that we can't keep each modality in a single datatype until we've removed pytorch (because pytorch requires us to move data from xarray to numpy to tensors); which we can't do until we've removed the pytorch dataloader; and the best way to remove the pytorch dataloader is to implement #202 :) Does that sound roughly right?!

@peterdudfield
Copy link
Contributor

Great work @JackKelly putting this all together. I think this will be super useful at making a really solid base for our ML learning problems we are trying to solve.

I'm slightly biased by 1. , but yea I would do 1. first. #195
2. #202 then is a good next logic step. I would be tempted to split this into two tasks.
A. Change current save to netcdf to save several files. B. Change core code sot hat it can run in different processes. Doing A, first will allows us to remake datasets more easily, and just copy the ones that are the same from previous versions.
3. 4. Perhaps this can be done at the same time. I feel like doing #209 first gives us a quicker win, and makes things simpler.
5. Then #211 will be a nice bonus on the end.

I'll just share my thoughts on how #209 can be done, that allows data to end up as numpy / tensor values
Screenshot 2021-10-09 at 20 58 35

@jacobbieker
Copy link
Member

Yeah, great work! I also think the outline here for moving forward makes sense. For #86, I think that can mostly come as we refactor the code, we remove PyTorch as we see it, I think I agree with @peterdudfield that we can probably do 3 and 4 at the same time.

With the focus on removing the on-the-fly loading, would we be moving the PyTorch dataloader that does currently load the prepared batches to somewhere else, like nowcasting_utils? Or would we be leaving it in here? I generally would think keeping it in nowcasting_dataset keeps it tied to how the data is being created easier, but then we would still have some dependency on PyTorch in this repo.

@peterdudfield
Copy link
Contributor

Yeah, great work! I also think the outline here for moving forward makes sense. For #86, I think that can mostly come as we refactor the code, we remove PyTorch as we see it, I think I agree with @peterdudfield that we can probably do 3 and 4 at the same time.

With the focus on removing the on-the-fly loading, would we be moving the PyTorch dataloader that does currently load the prepared batches to somewhere else, like nowcasting_utils? Or would we be leaving it in here? I generally would think keeping it in nowcasting_dataset keeps it tied to how the data is being created easier, but then we would still have some dependency on PyTorch in this repo.

I like the idea of leaving it here too, some how keeps it rapped up nicely. I wonder if we could do it as an optional requirement. That means people dont' have to install it, but they do need to install it to make the data loader. And it keeps things all here

@jacobbieker
Copy link
Member

Yeah, I think an optional requirement would be the way to go with it

@JackKelly
Copy link
Member Author

JackKelly commented Oct 11, 2021

Here's a draft design sketch for the end-goal, once all the issues listed above are done:

To recap: One of the main aims is to significantly simplify nowcasting_dataset (maybe reducing the total lines of code to something like half the current size!) Some specific design features of this new design:

  • Data for each modality flows through nowcasting_dataset completely independently of other modalities. As such, the data can stay in one data type throughout the code (e.g. xr.Dataset), and the data never needs to leave its DataSource.
  • Each modality will be saved to a different directory, concurrently

This makes it easy to:

  • run each data source concurrently.
  • run multiple processes per data source (by splitting positions_of_each_batch into, say, quarters and sending each quarter to a different process).
  • start processing at any batch number (e.g. if processing crashed part-way-through and needs to re-start).

Once this is implemented, I think we may be able to literally delete several large chunks of code:

  • All of dataset/datamodule.py
  • dataset.datasets.NowcastingDataset
  • dataset.datasets.worker_init_fn()
  • All the batch_to_dataset and dataset_to_batch stuff
  • All the to_numpy and from_numpy stuff (assuming we can go straight from xr.Dataset to torch.Tensor)

prepare_ml_data.py rough sketch

# Instantiate the `DataSource` objects requested by the user at the command line and/or config YAML file.
# The first data_source will be used to define the geospatial location of each example.
# We maybe want to make a few small changes to the Configuration model:
# - include a list of DataSources.
# - maybe split the config up so each data source has its own section, so that entire
#   section can be passed into the DataSource object.  e.g. there would be an `NWP` section, 
#   with fields zarr_path, image_size, and channels.
data_sources = get_data_sources(config)

# If necessary, pre-compute three `positions_of_each_example.csv` files:
# One file for `train/`, one for `test/`, and one for `/validation`. 
# This CSV file will have four columns: example_number, t0_datetime, center_x, center_y.
# (We might want to use a binary file rather than CSV.  
# But CSV is nice because it's super-easy to inspect manually).
# (These `positions_of_each_example.csv` files could be re-used when we prepare different versions of
# the pre-prepared batches, to minimise variation between different datasets).
filenames_of_positions_of_each_example: dict[SplitName, Pathy] = get_filenames_of_positions_of_each_example(
    config.output_data.filepath)

if not exists(filenames_of_positions_of_each_example[SplitName.TRAIN]):
    compute_and_save_positions_of_each_example_of_each_split(
        data_sources[0], 
        split_method=config.process.split_method, 
        dst_path=config.output_data.filepath)

# Generate the batches.  Something like this:
for split_name, filename_of_positions_of_each_example in filenames_of_positions_of_each_example.items():
    dst_path = config.output_data.filepath / split_name
    positions_of_each_example = pd.read_csv(filename_of_positions_of_each_example)
    positions_for_each_batch = group_positions_of_each_example_into_batches(
        positions_of_each_example, config.process.batch_size)

    with futures.ProcessPoolExecutor() as executor:
        for data_source in data_sources:
            executor.submit(data_source.prepare_batches, positions_for_each_batch, dst_path, overwrite=False)

validate_ml_data()  # ?

compute_and_save_positions_of_each_example_of_each_split() rough sketch

def compute_and_save_positions_of_each_example_of_each_split(
    data_source: DataSource, # The DataSource that defines the location of each example.
    split_method: str,
    n_examples_per_split: dict[SplitName, int],
    dst_path: Pathy
) -> None:
    # Get intersection of all available t0_datetimes.  Current done by NowcastingDataModule._get_datetimes():
    # github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/datamodule.py#L364
    t0_datetimes_for_all_data_sources = [data_source.get_t0_datetimes() for data_source in data_sources]
    intersection_of_t0_datetimes = nd_time.intersection_of_datetimeindexes(t0_datetimes_for_all_data_sources)
    
    # Split t0_datetimes into train, test and validation sets (being careful to ensure each group is
    # at least `total_seq_duration` apart).  Currently done by NowcastingDataModule._split_data():
    # github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/datamodule.py#L315
    t0_datetimes_per_split: dict[SplitName, pd.DatetimeIndex] = split_datetimes(
        intersection_of_t0_datetimes, method=split_method)
    
    for split_name, t0_datetimes_for_split in t0_datetimes_per_split.items():
        n_examples = n_examples_per_split[split_name]
        positions = compute_positions_of_each_example(t0_datetimes_for_split, data_source, n_examples)
        filename = dst_path / split_name / 'positions_of_each_example.csv'
        positions.to_csv(filename)


def compute_positions_of_each_example(
    t0_datetimes_for_split: pd.DatetimeIndex, 
    data_source: DataSource, 
    n_examples: int
) -> pd.DataFrame:
    
    positions = pd.DataFrame(index=range(n_examples), columns=['t0_datetime', 'x_center', 'y_center'])
    sampled_t0_datetimes = np.random.choice(t0_datetimes_for_split, size=n_examples)
    for example_i, t0_datetime in enumerate(sampled_t0_datetimes):
        x_center, y_center = data_source.get_geo_location_for_t0_datetime(t0_datetime)
        positions.iloc[i] = dict(t0_datetime=t0_datetime, x_center=x_center, y_center=y_center)
    
    return positions
    

DataSource.prepare_batches() rough sketch

def prepare_batches(
    self, 
    positions_of_each_batch: pd.DataFrame, 
    dst_path: Pathy,
    overwrite: bool = False,
    temp_path: Optional[Pathy] = None,
    upload_every_n_batches: Optional[int] = None
) -> None:
    """
    Args:
        positions_of_each_batch: DataFrame with MultiIndex of the batch_index and the the example_index. 
            The columns are: t0_datetime, x_center, y_center
        dst_path: The path to save to.  Can start with '/' (for local filesystem) or 's3://' or 'gs://'.
        overwrite: If True then overwrite existing batches on disk.
    """
    
    self.open()  # Need to open Zarr stores in each process.
    
    dst_path = dst_path / self.name  # Where `name` is a string like 'NWP' or 'satellite'.
    if not overwrite:
        indicies_of_existing_batches = self._find_existing_batches(dst_path)
        _LOG.info(f"Found {len(indicies_of_existing_batches)} existing batches in {dst_path}.")
        indicies_of_existing_batches = positions_of_each_batch.drop(indicies_of_existing_batches)
    
    for positions_for_batch in positions_for_each_batch:
        batch = self.get_batch(positions_for_batch)
        # TODO: Logic to save upload_every_n_batches to temp_path, then upload.  But only if on the cloud!
        self.save(batch, dst_path)

Comments more than welcome!

@JackKelly
Copy link
Member Author

JackKelly commented Oct 11, 2021

(UPDATE: Plan moved to top of this thread!)

Does that sound OK, @peterdudfield ?

@peterdudfield
Copy link
Contributor

Think that sounds good. I would move the test writing from 2. to 1.

@JackKelly
Copy link
Member Author

I've got the afternoon free for coding so I'm going to make a start on the plan in the comment above... Unless that will clash with your work, @peterdudfield ?

@JackKelly JackKelly self-assigned this Oct 12, 2021
@peterdudfield
Copy link
Contributor

go for it

@peterdudfield
Copy link
Contributor

Just to share, this is my nice dream of how the ML models use the data

Screenshot 2021-10-12 at 15 11 38

JackKelly added a commit that referenced this issue Oct 12, 2021
…es(). Change README so it reflects the fact that we're no longer supporting on-the-fly loading. Tests pass. Still need to simplify DataModule._get_datetimes() and remove fill_30_minutes_timestamps_to_5_minutes(). #204  #213
@JackKelly
Copy link
Member Author

I love the idea of re-using stuff!

But, a few thoughts and questions about doing x = Batch(**x) inside forward(x):

  1. Does converting x from a dict to a Batch help make our code more readable? My expectation is that, within forward(x), x will be a Dict[str, Tensor]. So, as is the case now, we'll be able to write human-readable code like x['satellite'] to get the satellite data.
  2. I'd be a bit worried that this will force PyTorch to move the data from GPU memory (when x is a dict[str, Tensor]), to CPU system memory (when x is a Batch and is being validated by Pydantic), and back to GPU memory for training. That may slow training down because the PCIe bus is way slower than the GPU's memory bus.
  3. Pydantic's validation logic may slow down the training loop. I'd advocate for doing all our data validation when we prepare the training batches. Then, during the training loop, we should just trust that the data has already been validated 🙂 . I agree that's perhaps as not as belt-and-braces as validating the data multiple times, but hopefully it's sufficient to validate the data once (when we pre-prepare the ML batches).

What do you think?

@peterdudfield
Copy link
Contributor

  1. Ah, that's where our thoughts seem to be different. I think we should try and move away from creating 'Dicts' where we dont know the entry fields, hence use a 'Pydantic' object. But maybe this is being overly ambitious.
  2. Yea agree, we dont want to be doing that, probably have to tests that
  3. we can turn off validation pretty easily.

@JackKelly
Copy link
Member Author

I think we should try and move away from creating 'Dicts' where we dont know the entry fields, hence use a 'Pydantic' object.

Ah, yes, I agree - that would be a nice thing to be able to do, and is a really good idea! (as long as we can make sure Pydantic doesn't slow down the training loop 🙂 )

@peterdudfield
Copy link
Contributor

Ill try and put together a test

@JackKelly
Copy link
Member Author

That would be really useful, thanks load! 🙂

I'm sure you've thought about this already but, just for the sake of being overly cautious: In order to show a measurable latency, the test probably needs to use batches which are of similar size to our "real" batches; and probably needs to use an ML model which is similarly sized to our "real" ML models. And will probably need to try training on a bunch of batches (a single batch might be cached in wierd ways)

@peterdudfield
Copy link
Contributor

That would be really useful, thanks load! 🙂

I'm sure you've thought about this already but, just for the sake of being overly cautious: In order to show a measurable latency, the test probably needs to use batches which are of similar size to our "real" batches; and probably needs to use an ML model which is similarly sized to our "real" ML models. And will probably need to try training on a bunch of batches (a single batch might be cached in wierd ways)

@peterdudfield
Copy link
Contributor

peterdudfield commented Oct 12, 2021

https://github.com/openclimatefix/nowcasting_dataset/blob/peter/explore/notebooks/2021-10/2021-10-12/pydantic_ml_test.py

Not using pydanitc is slightly (but significant - using t-test)

Ran this code 10 times,
3 times the means durations, of 50 batches, were the same
7 times the means durations of 50 batches, were different. Dict always faster

Dict was faster by about ~0.1% of the running time.


So its slowly, but its a debate of how much slowly is ok, for easier readability of the code.
My feeling is the small difference in times, would be a small percentage for a bit model, and the slightly better readibility is worth it

@JackKelly
Copy link
Member Author

Sure, sounds good, thanks loads for doing the test! That's super-useful. (Yay for empirically testing stuff!) Let's go for it!

@peterdudfield
Copy link
Contributor

Here's a useful method to covert a xr.Dataset to dictionary of torch items
Screenshot 2021-10-13 at 09 40 27

@peterdudfield
Copy link
Contributor

peterdudfield commented Oct 13, 2021

** move to #209

Yea I will branch off.

Yea I think in 'main',

  1. I will rename Satellite to SatelliteML,
  2. Strip out unnecessary 'to_numpy' code, and maybe doing some tidying
  3. Make Satellite as described above.
  4. Change current 'Batch' to 'BatchML', and make 'Batch'

My feeling is this can be done in parallel to #213 comment

I think pydantic validation, runs any something is changed, i.e. to make sure the user is doing correct things with the object.

@JackKelly
Copy link
Member Author

JackKelly commented Oct 13, 2021

Cool beans, all sounds great! Thank you! I'm genuinely excited to see nowcasting_dataset really mature and become well-designed and easy to modify - thank you for all your work!

(edit: I'll copy this comment to #209!)

@jacobbieker
Copy link
Member

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

@peterdudfield
Copy link
Contributor

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

Yea, I think thats the safes way to deal with them at the moment

@jacobbieker
Copy link
Member

jacobbieker commented Oct 13, 2021

Just because I got somewhat lost on this, and it relates to how I make openclimatefix/nowcasting_dataloader#2 work, are the time coordinates stored as datetimes in the prepared batches? Just ints? Same with the x and y coordinates?

Yea, I think thats the safes way to deal with them at the moment

As just ints? And if so, are they just seconds since epoch?

@JackKelly
Copy link
Member Author

JackKelly commented Oct 13, 2021

Hehe, yeah, we should document precisely what's going to go into the NetCDF files... I'll start a new issue to create that document (UPDATE: #227)... (the shapes etc. will be described in the pydantic models... but it might be more human-readable to also document the contents of the NetCDF files in a human-readable file somewhere, too... If only - as Peter mentioned the other day - to help define the interface between nowcasting_dataset and nowcasting_dataloader)

As just ints? And if so, are they just seconds since epoch?

In main at the moment, I think all the datetimes are stored as ints (seconds since the epoch (1970)) in the NetCDF files.

Once all the stuff in this issue is implemented, I think the NetCDF files will have "proper" datetimes in them. And, IIUC, those datetimes will get converted to ints by nowcasting_dataloader (because I don't think tensors can hold datetimes?) before the data gets into PyTorch. Is that right, @peterdudfield ?

@jacobbieker
Copy link
Member

Yeah, tensors can't hold datetimes, but having them in datetimes makes it a bit easier to compute the time of day, day of year, etc. features for the positional encodings, otherwise, I'd probably go with converting them back to datetimes to do so

@JackKelly
Copy link
Member Author

If you can wait a few days, hopefully we'll have modified nowcasting_dataset to put "proper" datetimes into the NetCDF files :)

@jacobbieker
Copy link
Member

Yeah! Sounds good, I'll get it working assuming that

@flowirtz flowirtz moved this to In Progress in Nowcasting Oct 15, 2021
Repository owner moved this from In Progress to Done in Nowcasting Oct 25, 2021
@JackKelly JackKelly reopened this Oct 26, 2021
Repository owner moved this from Done to In Progress in Nowcasting Oct 26, 2021
This was linked to pull requests Oct 28, 2021
@JackKelly JackKelly mentioned this issue Oct 29, 2021
30 tasks
Repository owner moved this from In Progress to Done in Nowcasting Nov 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request refactoring
Projects
No open projects
Status: Done
3 participants