Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Big new design Part 2 :) #307

Merged
merged 39 commits into from
Nov 2, 2021
Merged

Big new design Part 2 :) #307

merged 39 commits into from
Nov 2, 2021

Conversation

JackKelly
Copy link
Member

@JackKelly JackKelly commented Oct 29, 2021

Pull Request

Description

This is quite a big PR, sorry, because it's plugging together the new code!

The majority of the file changes are to get the changes to pass the linter! (Sorry, I really should've done that in a separate PR, if I'd understood how much more strict the new linter config would be!)

Broadly implements an updated version of the design first sketched out in #213 (comment)

Also implements / fixes some other issues which were blocking this PR:

The main bulk of this PR does several things (sorry that not all of these are strictly related to implementing issue #213!)

  • Implement a new Manager class which looks after multiple DataSources
  • Pre-prepare spatial_and_temporal_locations_of_each_example.csv files.
  • Read in the spatial_and_temporal_locations_of_each_example.csv files and start separate processes for each DataSource to prepare batches.
  • Implement Manager.create_batches() and Manager._get_first_batches_to_create()
  • Fix some of the linter errors raised by our new (and very good!) pre-commit checks (in retrospect, I should've fixed these as a separate PR, sorry!)
  • Simplify filesystem.utils.get_maximum_batch_id(), and use glob instead of ls to be sure that we're searching *.nc instead of *.
  • Simplify utils.get_netcdf_filename(batch_idx)
  • Should filenames of batches start with leading zeros, like 00001.nc? #308
  • Rename filesystem.utils.make_folder() to makedirs() to be consistent with fsspec. And simplify.
  • Remove all of dataset/datamodule.py
  • Remove all of dataset/datasets.py
  • Remove all the n_timesteps_per_batch stuff. Done in commit f896a5e
  • Get prepare_ml_data.py running!
  • Implement DataSource.check_directories() and call from DataSource.__init__ Implement DataSource.check_paths_exist() for all DataSources. #306
  • Update tests.
  • Remove batch_to_dataset and dataset_to_batch
  • Search through the code for TODOs associated with issues associated with this big design change.
  • Close all associated issues (there are quite a few!)

How Has This Been Tested?

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@JackKelly JackKelly self-assigned this Oct 29, 2021
@JackKelly JackKelly changed the base branch from main to jack/big-new-design October 29, 2021 10:24
@JackKelly JackKelly linked an issue Oct 29, 2021 that may be closed by this pull request
38 tasks
@JackKelly JackKelly added the enhancement New feature or request label Oct 29, 2021
@JackKelly
Copy link
Member Author

Hi @jacobbieker OK, I think this new PR is just about ready for review. It's still definitely a draft, and there are a bunch of things still to do (please see the tick-list at the top of this PR conversation) but please do take a look at comment on the broad shape of the thing 🙂

Copy link
Member

@jacobbieker jacobbieker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how it looks, I think the design is good!

zipped = list(zip(t0_datetimes, x_locations, y_locations))
batch_size = len(t0_datetimes)

with futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @JackKelly.

You can really see the fruits of your labour when lots of things simplify nicely.

I've added quite a few minor comments, and the only major comment is https://github.com/openclimatefix/nowcasting_dataset/pull/307/files#r740808191, but perhaps I'm missing something there

Ill that you left a few TODOs in there which are too critical. Good to break up the PR with these

@JackKelly JackKelly mentioned this pull request Nov 2, 2021
4 tasks
@JackKelly
Copy link
Member Author

OK! I'm going to go ahead and merge this now. All the tests pass and prepare_ml_data.py runs on my little laptop-like thing (pulling data from leonardo over 1gig LAN). It's not as fast as I would like but I suspect that's because it's running on a little laptop-like machine and pulling data over the LAN! I'll benchmark on leonardo when leonardo isn't quite so busy 🙂 If it's still too slow then I'll implement #311. And if that still isn't fast enough then I'll re-implement the sample-twice-from-each-disk-load thing.

I need to fix #325 but that shouldn't stop the script from being used.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.