-
-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.
…f_each_example.csv for each split
Hi @jacobbieker OK, I think this new PR is just about ready for review. It's still definitely a draft, and there are a bunch of things still to do (please see the tick-list at the top of this PR conversation) but please do take a look at comment on the broad shape of the thing 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how it looks, I think the design is good!
zipped = list(zip(t0_datetimes, x_locations, y_locations)) | ||
batch_size = len(t0_datetimes) | ||
|
||
with futures.ThreadPoolExecutor(max_workers=batch_size) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @JackKelly.
You can really see the fruits of your labour when lots of things simplify nicely.
I've added quite a few minor comments, and the only major comment is https://github.com/openclimatefix/nowcasting_dataset/pull/307/files#r740808191, but perhaps I'm missing something there
Ill that you left a few TODOs in there which are too critical. Good to break up the PR with these
OK! I'm going to go ahead and merge this now. All the tests pass and I need to fix #325 but that shouldn't stop the script from being used. |
Pull Request
Description
This is quite a big PR, sorry, because it's plugging together the new code!
The majority of the file changes are to get the changes to pass the linter! (Sorry, I really should've done that in a separate PR, if I'd understood how much more strict the new linter config would be!)
Broadly implements an updated version of the design first sketched out in #213 (comment)
Also implements / fixes some other issues which were blocking this PR:
load_solar_pv_data_from_gcs()
should usefsspec
and hence be able to load data from any compute environment #286split()
function #299DataSourceList
intoManager
; and maintain DataSources in adict
instead of a list? #298prepare_ml_data.py
#171The main bulk of this PR does several things (sorry that not all of these are strictly related to implementing issue #213!)
Manager
class which looks after multipleDataSources
spatial_and_temporal_locations_of_each_example.csv
files.spatial_and_temporal_locations_of_each_example.csv
files and start separate processes for eachDataSource
to prepare batches.Manager.create_batches()
andManager._get_first_batches_to_create()
filesystem.utils.get_maximum_batch_id()
, and useglob
instead ofls
to be sure that we're searching*.nc
instead of*
.utils.get_netcdf_filename(batch_idx)
00001.nc
? #308filesystem.utils.make_folder()
tomakedirs()
to be consistent with fsspec. And simplify.dataset/datamodule.py
dataset/datasets.py
n_timesteps_per_batch
stuff. Done in commit f896a5eprepare_ml_data.py
running!DataSource.check_directories()
and call fromDataSource.__init__
ImplementDataSource.check_paths_exist()
for all DataSources. #306batch_to_dataset
anddataset_to_batch
How Has This Been Tested?
Checklist: