[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

phoeenniixx · 2025-04-06T13:17:04Z

Description

This PR implements the basic skeleton of D1 and D2 layer for v2

phoeenniixx · 2025-04-06T13:22:46Z

FYI @fkiraly @agobbifbk, I have removed the imputation as I just found one thing related to handling missing data:
constant_fill_strategy that too handles only the missing timesteps and doesnot handle nans,
Also see: https://pytorch-forecasting.readthedocs.io/en/latest/faq.html#creating-datasets

Here it says, the nans should be handled before-hand

codecov · 2025-04-07T05:23:46Z

Codecov Report

Attention: Patch coverage is 96.76113% with 8 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@4cfa1b8). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
pytorch_forecasting/data/data_module.py	95.85%	7 Missing ⚠️
pytorch_forecasting/data/timeseries.py	98.71%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1811   +/-   ##
=======================================
  Coverage        ?   87.21%           
=======================================
  Files           ?       47           
  Lines           ?     5553           
  Branches        ?        0           
=======================================
  Hits            ?     4843           
  Misses          ?      710           
  Partials        ?        0

Flag	Coverage Δ
cpu	`87.21% <96.76%> (?)`
pytest	`87.21% <96.76%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

agobbifbk · 2025-04-14T14:05:04Z

Hello @phoeenniixx! I will try to put here some comments

the D2 layer accept a list of possible D1 layer. We need to decide if we need an abstract class that is extended when we want to create a new D1 layer (my first guess) or if it is sufficient to allow the input time_series_dataset to be a list of possible D1 layers
the D2 layers should also manage temporal split, for seek of simplicity at this stage we can suppose a percentage temporal split per group: for each group I take x% for training and y% in validation. The current split is done only for groups meaning that your as supposing to have a GLOBAL FORECASTER. I think that we need to elaborate this more. If is_global_forecaster is True, then your split is valid but the temporal split is missing (I think it make sense to do it per group). If it is False we don't want to split by groups (so all the groups will be used in train) and we need to add also the temporal split. Moreover, in this latter case, we want to add the group id as categorical know variable so the models can understand better the samples.
the class _ProcessedEncoderDecoderDataset copies the initial dataset 3 times (train, validation and test). It may cause OOM problem is some cases, maybe there is a clever way to use the initial D1 layer just once. Any idea?
I think that the _create_windows function is the correct place for checking if data is valid (does not contain nans). You are asking for data so maybe it is better to check if the ranges you are creating are valid. In this case check separately the encoder and decoder part so we are sure that in inference, if there are nans future unknow variable the procedure does not discard the window (If you need clarification on that please let me know)
the _create_windows in prediction should work differently because in principle we may have some Nans (target)
I thought we decide to remove the nan fills in the getitem, or did I miss something?
Why the _ProcessedEncoderDecoderDataset expects a EncoderDecoderTimeSeriesDataModule type?
in the x key of the get_item of the _ProcessedEncoderDecoderDataset we may want to access also to past target value, I think we can add something like target_past and take the correct past variables
target_scale is something for back-compatibility? Which is the purpose of this key?

We are almost done (label encoder still missing in the D1). May I suggest to create a dummy pandas data frame (with at least two groups) and check if the code is behaving as expected?

If you have any blocker or question, please fell free to ask!

phoeenniixx · 2025-04-14T14:46:29Z

Thanks for the comments @agobbifbk, I will reply all of them shortly, here are some replies:

I thought we decide to remove the nan fills in the getitem, or did I miss something?

I have removed the imputation from D2 layer as discussed, the only nan fills are in D1 layer __getitem__ which is imo required when we concatenate the past and future data as we are creating a new datapoint x_merged in this case, and if there is any gap between the past and future data, it needs to be addressed to change them to tensors, should I remove that as well? Maybe the solution can be the is_valid check like you suggested?
Just to be clear: here the x_merged is created as a 2d array with nans only, and then the data is entered to the array and if the some value is not there, the already present nan will survive

target_scale is something for back-compatibility? Which is the purpose of this key?

yes the __init__ of datamodule for now contains a the params copied from the previous verison of TimeSeriesDataset so that we can have an idea what things we need to add and finally reach the back-compatibility.

Why the _ProcessedEncoderDecoderDataset expects a EncoderDecoderTimeSeriesDataModule type?

Because we agreed on calling the _preprocess_data inside the __getitem__ and not outside this class, so to access this we need EncoderDecoderTimeSeriesDataModule, we cannot use super() here, as that is possible only if we are inheriting the class, but here _ProcessedEncoderDecoderDataset is a nested class, and not inheriting it, is there any other way without passing EncoderDecoderTimeSeriesDataModule, not a python expert here, so I may have missed something. Should I make the _preprocess_data standalone function outside the class, in that way we need not to pass EncoderDecoderTimeSeriesDataModule to the nested class, but i am not sure we should do that.

agobbifbk · 2025-04-14T19:24:19Z

I have removed the imputation from D2 layer as discussed, the only nan fills are in D1 layer __getitem__ which is imo required when we concatenate the past and future data as we are creating a new datapoint x_merged in this case, and if there is any gap between the past and future data, it needs to be addressed to change them to tensors, should I remove that as well? Maybe the solution can be the is_valid check like you suggested?

In my opinion imputing is something not correlated with the D1 layer, or, if you want, you can set a flag (default False or put a warning when it is triggered) and just skip the samples that contains NaNs

Because we agreed on calling the _preprocess_data inside the __getitem__ and not outside this class, so to access this we need EncoderDecoderTimeSeriesDataModule, we cannot use super() here, as that is possible only if we are inheriting the class, but here _ProcessedEncoderDecoderDataset is a nested class, and not inheriting it, is there any other way without passing EncoderDecoderTimeSeriesDataModule, not a python expert here, so I may have missed something. Should I make the _preprocess_data standalone function outside the class, in that way we need not to pass EncoderDecoderTimeSeriesDataModule to the nested class, but i am not sure we should do that.

I see, but in this way you are copying 3 times the same object, where only the windows is changed. In my opinion with minimal effort you can use the same get_item for the three DataLoaders without duplicating possibly large dataframes.

Happy to continue the discussion also during the technical meeting, if needed we can arrange one in the next days!

phoeenniixx · 2025-04-14T20:03:57Z

Happy to continue the discussion also during the technical meeting, if needed we can arrange one in the next days!

Yes, please! IF possible, we can meet atleast once this week before my exams starts, because I think I need a discussion to understand all your points, and I will be able to learn new aspects which can be helpful for me to carry on the work.

fkiraly

My main comment: I think we need to start adding tests now, otherwise the devlopment work is getting too complex.

If you have time before the exams, that is what I would focus on - add tests in all three PR, and ensure they are stacked (keep them updated).

phoeenniixx · 2025-04-19T14:23:00Z

Hi @fkiraly, @agobbifbk , I have added the tests for D1, D2 layer (still to make changes suggested by @agobbifbk ), but while working on the tests, I realized that the len we are returning of TimeSeries is more problematic that I thought, for now I am returning 1 group per __getitem__ and this even leads to 0 len of val_dataset as there might not be enough groups, so I think we should add windows to d1 layer as well, but that leads to another question:
Should 1 window should contain only 1 group or can it contain any number of groups? What I mean is like now I am masking and getting the group, should the windows be added after this? or should I create the windows without masking and directly to the raw data?

Also is there any other way (better way) than using windows here?

(for now I have commented out the tests checking this train-val splits)

D1, D2 layer commit

252598d

phoeenniixx requested review from benHeid, fkiraly, fnhirwa, jdb78 and yarnabrina as code owners April 6, 2025 13:17

remove one comment

d0d1c3e

This was referenced Apr 6, 2025

[ENH] EXPERIMENTAL: TFT model based on the new data pipeline #1812

Open

[ENH] EXPERIMENTAL: Example notebook based on the new data pipeline #1813

Open

update docstring

6364780

PranavBhatP added this to Dec 2024 - Mar 2025 mentee projects Apr 7, 2025

PranavBhatP moved this to PR in progress in Dec 2024 - Mar 2025 mentee projects Apr 7, 2025

phoeenniixx added 4 commits April 11, 2025 01:54

update data_module.py

257183c

update data_module.py

9cdcb19

Add disclaimer

ac56d4f

update docstring

4bfff21

fkiraly requested changes Apr 15, 2025

View reviewed changes

phoeenniixx and others added 3 commits April 19, 2025 19:37

Add tests for D1,D2 layer

8a53ed6

Merge branch 'main' into refactor-d1-d2

9f9df31

Code quality

cdecb77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

phoeenniixx commented Apr 6, 2025

phoeenniixx commented Apr 6, 2025

codecov bot commented Apr 7, 2025 •

edited

Loading

agobbifbk commented Apr 14, 2025

phoeenniixx commented Apr 14, 2025 •

edited

Loading

agobbifbk commented Apr 14, 2025

phoeenniixx commented Apr 14, 2025

fkiraly left a comment

phoeenniixx commented Apr 19, 2025

[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

Are you sure you want to change the base?

[ENH] EXPERIMENTAL PR: D1 and D2 layer for v2 refactor #1811

Conversation

phoeenniixx commented Apr 6, 2025

Description

phoeenniixx commented Apr 6, 2025

codecov bot commented Apr 7, 2025 • edited Loading

Codecov Report

agobbifbk commented Apr 14, 2025

phoeenniixx commented Apr 14, 2025 • edited Loading

agobbifbk commented Apr 14, 2025

phoeenniixx commented Apr 14, 2025

fkiraly left a comment

Choose a reason for hiding this comment

phoeenniixx commented Apr 19, 2025

codecov bot commented Apr 7, 2025 •

edited

Loading

phoeenniixx commented Apr 14, 2025 •

edited

Loading