Skip to content

[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

phoeenniixx
Copy link

Description

This PR implements the basic skeleton of D1 and D2 layer for v2

@phoeenniixx
Copy link
Author

FYI @fkiraly @agobbifbk, I have removed the imputation as I just found one thing related to handling missing data:
constant_fill_strategy that too handles only the missing timesteps and doesnot handle nans,
Also see: https://pytorch-forecasting.readthedocs.io/en/latest/faq.html#creating-datasets

Here it says, the nans should be handled before-hand

Copy link

codecov bot commented Apr 7, 2025

Codecov Report

Attention: Patch coverage is 96.76113% with 8 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@4cfa1b8). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/data_module.py 95.85% 7 Missing ⚠️
pytorch_forecasting/data/timeseries.py 98.71% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1811   +/-   ##
=======================================
  Coverage        ?   87.21%           
=======================================
  Files           ?       47           
  Lines           ?     5553           
  Branches        ?        0           
=======================================
  Hits            ?     4843           
  Misses          ?      710           
  Partials        ?        0           
Flag Coverage Δ
cpu 87.21% <96.76%> (?)
pytest 87.21% <96.76%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@agobbifbk
Copy link

Hello @phoeenniixx! I will try to put here some comments

  • the D2 layer accept a list of possible D1 layer. We need to decide if we need an abstract class that is extended when we want to create a new D1 layer (my first guess) or if it is sufficient to allow the input time_series_dataset to be a list of possible D1 layers
  • the D2 layers should also manage temporal split, for seek of simplicity at this stage we can suppose a percentage temporal split per group: for each group I take x% for training and y% in validation. The current split is done only for groups meaning that your as supposing to have a GLOBAL FORECASTER. I think that we need to elaborate this more. If is_global_forecaster is True, then your split is valid but the temporal split is missing (I think it make sense to do it per group). If it is False we don't want to split by groups (so all the groups will be used in train) and we need to add also the temporal split. Moreover, in this latter case, we want to add the group id as categorical know variable so the models can understand better the samples.
  • the class _ProcessedEncoderDecoderDataset copies the initial dataset 3 times (train, validation and test). It may cause OOM problem is some cases, maybe there is a clever way to use the initial D1 layer just once. Any idea?
  • I think that the _create_windows function is the correct place for checking if data is valid (does not contain nans). You are asking for data so maybe it is better to check if the ranges you are creating are valid. In this case check separately the encoder and decoder part so we are sure that in inference, if there are nans future unknow variable the procedure does not discard the window (If you need clarification on that please let me know)
  • the _create_windows in prediction should work differently because in principle we may have some Nans (target)
  • I thought we decide to remove the nan fills in the getitem, or did I miss something?
  • Why the _ProcessedEncoderDecoderDataset expects a EncoderDecoderTimeSeriesDataModule type?
  • in the x key of the get_item of the _ProcessedEncoderDecoderDataset we may want to access also to past target value, I think we can add something like target_past and take the correct past variables
  • target_scale is something for back-compatibility? Which is the purpose of this key?

We are almost done (label encoder still missing in the D1). May I suggest to create a dummy pandas data frame (with at least two groups) and check if the code is behaving as expected?

If you have any blocker or question, please fell free to ask!

@phoeenniixx
Copy link
Author

phoeenniixx commented Apr 14, 2025

Thanks for the comments @agobbifbk, I will reply all of them shortly, here are some replies:

I thought we decide to remove the nan fills in the getitem, or did I miss something?

I have removed the imputation from D2 layer as discussed, the only nan fills are in D1 layer __getitem__ which is imo required when we concatenate the past and future data as we are creating a new datapoint x_merged in this case, and if there is any gap between the past and future data, it needs to be addressed to change them to tensors, should I remove that as well? Maybe the solution can be the is_valid check like you suggested?
Just to be clear: here the x_merged is created as a 2d array with nans only, and then the data is entered to the array and if the some value is not there, the already present nan will survive

target_scale is something for back-compatibility? Which is the purpose of this key?

yes the __init__ of datamodule for now contains a the params copied from the previous verison of TimeSeriesDataset so that we can have an idea what things we need to add and finally reach the back-compatibility.

Why the _ProcessedEncoderDecoderDataset expects a EncoderDecoderTimeSeriesDataModule type?

Because we agreed on calling the _preprocess_data inside the __getitem__ and not outside this class, so to access this we need EncoderDecoderTimeSeriesDataModule, we cannot use super() here, as that is possible only if we are inheriting the class, but here _ProcessedEncoderDecoderDataset is a nested class, and not inheriting it, is there any other way without passing EncoderDecoderTimeSeriesDataModule, not a python expert here, so I may have missed something. Should I make the _preprocess_data standalone function outside the class, in that way we need not to pass EncoderDecoderTimeSeriesDataModule to the nested class, but i am not sure we should do that.

@agobbifbk
Copy link

I have removed the imputation from D2 layer as discussed, the only nan fills are in D1 layer __getitem__ which is imo required when we concatenate the past and future data as we are creating a new datapoint x_merged in this case, and if there is any gap between the past and future data, it needs to be addressed to change them to tensors, should I remove that as well? Maybe the solution can be the is_valid check like you suggested?

In my opinion imputing is something not correlated with the D1 layer, or, if you want, you can set a flag (default False or put a warning when it is triggered) and just skip the samples that contains NaNs

Because we agreed on calling the _preprocess_data inside the __getitem__ and not outside this class, so to access this we need EncoderDecoderTimeSeriesDataModule, we cannot use super() here, as that is possible only if we are inheriting the class, but here _ProcessedEncoderDecoderDataset is a nested class, and not inheriting it, is there any other way without passing EncoderDecoderTimeSeriesDataModule, not a python expert here, so I may have missed something. Should I make the _preprocess_data standalone function outside the class, in that way we need not to pass EncoderDecoderTimeSeriesDataModule to the nested class, but i am not sure we should do that.

I see, but in this way you are copying 3 times the same object, where only the windows is changed. In my opinion with minimal effort you can use the same get_item for the three DataLoaders without duplicating possibly large dataframes.

Happy to continue the discussion also during the technical meeting, if needed we can arrange one in the next days!

@phoeenniixx
Copy link
Author

Happy to continue the discussion also during the technical meeting, if needed we can arrange one in the next days!

Yes, please! IF possible, we can meet atleast once this week before my exams starts, because I think I need a discussion to understand all your points, and I will be able to learn new aspects which can be helpful for me to carry on the work.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment: I think we need to start adding tests now, otherwise the devlopment work is getting too complex.

If you have time before the exams, that is what I would focus on - add tests in all three PR, and ensure they are stacked (keep them updated).

@phoeenniixx
Copy link
Author

Hi @fkiraly, @agobbifbk , I have added the tests for D1, D2 layer (still to make changes suggested by @agobbifbk ), but while working on the tests, I realized that the len we are returning of TimeSeries is more problematic that I thought, for now I am returning 1 group per __getitem__ and this even leads to 0 len of val_dataset as there might not be enough groups, so I think we should add windows to d1 layer as well, but that leads to another question:
Should 1 window should contain only 1 group or can it contain any number of groups? What I mean is like now I am masking and getting the group, should the windows be added after this? or should I create the windows without masking and directly to the raw data?

Also is there any other way (better way) than using windows here?

(for now I have commented out the tests checking this train-val splits)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants