[ENH] Experimental PR- New Dataset Version-B #1791

phoeenniixx · 2025-03-07T11:41:23Z

Description

This PR tires to implement another version of TimeSeries Dataset and data module where future_data is merged with the existing x and is not stored differently and one more tensor cutoff is added to remember the present time. Uses time_mask in the data module to differentiate the data into past and future with the help of cutoff

phoeenniixx · 2025-03-15T07:28:50Z

Hi @fkiraly, @agobbifbk, the basic vignette here: https://colab.research.google.com/drive/1LS0JFIzHZ2_EbzY19l1Yuqyr9lTN1Jj8?usp=sharing

please see if it works as it is meant to be. I haven't still fully understand how future exogenous data is handled in dsipts, so if there is any discrepancy, please lmk

codecov · 2025-03-20T19:38:00Z

Codecov Report

Attention: Patch coverage is 80.97166% with 47 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@e87230b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
pytorch_forecasting/data/data_modules.py	81.17%	32 Missing ⚠️
pytorch_forecasting/data/timeseries.py	80.51%	15 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1791   +/-   ##
=======================================
  Coverage        ?   86.49%           
=======================================
  Files           ?       47           
  Lines           ?     5548           
  Branches        ?        0           
=======================================
  Hits            ?     4799           
  Misses          ?      749           
  Partials        ?        0

Flag	Coverage Δ
cpu	`86.49% <80.97%> (?)`
pytest	`86.49% <80.97%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

agobbifbk · 2025-03-21T10:01:27Z

Great work thank you!
Here some comments:

"x": torch.tensor(data[self.feature_cols].values), in the d1 layer: if data contains non numerical features this will raise an error during the tensor parser. One option can be to add in the init part of the d1 layer some sort of LabelEncoder
in the encoder layer, _preprocess_data methods: it seems you are filling missing values with the mean. In my opinion this is something that the user can disable. Moreover in case of categorical variable (at this level encoded as integers, see point 1) the mean value is not correct
in _create_windows probably we need to check if data is valid or not (in the case we don't want to fill nans in point 2)
scalers is defined but not used (remember to fit it only in the training data)
train_val_test_split is referred only to time series and does not take into account the time. For sure this is something we want (especially in the case of global forecaster), but in my opinion the temporal one is more important for real applications
the _preprocess_data generate all the samples and keep them in memory. This is for sure something we want to do (most of the time the samples can fit in memory), but somehow we are losing the power of a generic d1 layer. With this d1 layer definition your approach works for sure, but I would add a flag in memory and explore what we can get data leveraging the _get_item function of the d1 layer
this part of code:

train_dataloader = data_module.train_dataloader()
sample_batch = next(iter(train_dataloader))

x, y = sample_batch
print(f"Encoder continuous shape: {x['encoder_cont'].shape}")
print(f"Decoder continuous shape: {x['decoder_cont'].shape}")
print(f"Target shape: {y.shape}")

encoder_input_size = x['encoder_cont'].shape[-1]
decoder_input_size = x['decoder_cont'].shape[-1]

model = TimeSeriesLightningModel(encoder_input_size=encoder_input_size, decoder_input_size=decoder_input_size)

has still the issue about the dimension. One backup idea is to put all the size in the d2 layer metadata, add the key dictionary to the init of the model and create an instance of it using:

model = TimeSeriesLightningModel(**data_module['metadata']['sizes'])

until we define how to link better the d2 and model layer.

Really looking forward to seeing the code after these changes!

phoeenniixx · 2025-03-21T10:33:26Z

Thanks for the comments @agobbifbk!

"x": torch.tensor(data[self.feature_cols].values), in the d1 layer: if data contains non numerical features this will raise an error during the tensor parser. One option can be to add in the init part of the d1 layer some sort of LabelEncoder

There were some doubts about where to put label encoders last time ig, @fkiraly can you please comment if those doubts were addressed.

in _create_windows probably we need to check if data is valid or not (in the case we don't want to fill nans in point 2)

Thanks, I will add it

scalers is defined but not used (remember to fit it only in the training data)

This implementation was just a test to see if evrything works, I will add them in subsequent commits

has still the issue about the dimension. One backup idea is to put all the size in the d2 layer metadata, add the key dictionary to the init of the model and create an instance of it using:

Please look at #1805, there I have tried to use metadata for version a, if that works?

I will add you suggestions in next commits, Thanks!

phoeenniixx · 2025-04-04T06:55:25Z

Hi @fkiraly, I have tried to create the metadata in d2 layer without calling __getitem__ please have a look at it.
I have also added some @agobbifbk's comments, about which we can discuss about in next tech sessions.

For the record:
We are creating a property metadata in d2 layer because we don't want the user to initialise the model using the data that he has already provided.
So in place of initialising the model in this way:

model = TFT(
    loss=nn.MSELoss(),
    logging_metrics=[MAE(), SMAPE()],
    optimizer="adam",
    optimizer_params={"lr": 1e-3},
    lr_scheduler="reduce_lr_on_plateau",
    lr_scheduler_params={"mode": "min", "factor": 0.1, "patience": 10},
    hidden_size=64,
    num_layers=2,
    attention_head_size=4,
    dropout=0.1,
    cont_feature_size = encoder_cont
    cat_feature_size = encoder_cat
    static_cat_feature_size = static_categorical_features
    static_cont_feature_size =  static_continuous_features
    max_encoder_length = max_encoder_length
    max_prediction_length = max_prediction_length
    
)

the user can initialise the model in this way:

model = TFT(
    loss=nn.MSELoss(),
    logging_metrics=[MAE(), SMAPE()],
    optimizer="adam",
    optimizer_params={"lr": 1e-3},
    lr_scheduler="reduce_lr_on_plateau",
    lr_scheduler_params={"mode": "min", "factor": 0.1, "patience": 10},
    hidden_size=64,
    num_layers=2,
    attention_head_size=4,
    dropout=0.1,
    metadata = data_module.metadata
)

as all this information can be inferred from the data inputted by the user and this will make the interface simpler as well

fkiraly

Review of metadata generation as requested:

great! It seems like we can generate the entire D2 specific metadata from D1 metadata and inputs, right?
for "cleanness" of the logic, I would suggest to move the entire logic for that into a method _prepare_metadata.

Some suggestions regarding documentation:

The method _prepare_metadata optimally also has a docstring that lists the output metadata fields generated and how - since it is a private method, primarily explained to a developer.
metadata already has this for the user, although I would go closer to the numpydoc style with Returns paragraph that is properly populated

phoeenniixx · 2025-04-04T07:25:30Z

we can generate the entire D2 specific metadata from D1 metadata and inputs, right?

Yes "almost" all keys are taken from D1 metadata, some keys like max_encoder_length etc (the last 4 keys) are inputted into the datamodule only so these are directly taken from there.

for "cleanness" of the logic, I would suggest to move the entire logic for that into a method _prepare_metadata.

Sure then I will call this function in the property

fkiraly · 2025-04-04T08:17:32Z

Sure then I will call this function in the property

That is one way - I was thinking about pre-populating at __init__, or caching after the first call.
What is the computational overhead? Minimal, I guess?

phoeenniixx · 2025-04-04T12:19:46Z

I was thinking about pre-populating at init.

this would mean the function is called in __init__, which imo is not necessary as this will be needed only when the model is to be initialised. So, we should compute it only when we actually need it, or do we need it somewhere else also that it should be available right after initialisation?

caching after the first call.

This I think is the best way, for now I will do this.

Even if we need it anywhere else as well, I think it can easily be calculated by calling the property, and also from the user perspective, it always has to called datammodule.metadata whenever it is needed, so it would not matter if it is computed in init or in property...

phoeenniixx · 2025-05-13T14:00:51Z

We are finally moving on with this version of the implementation and the final PRs of the prototype are mentioned below,
#1811, #1812, #1813 (Final PRs with a basic model implementation and vignette)
We are closing this PR and moving on with the above PRs
FYI @fkiraly, @PranavBhatP @agobbifbk

phoeenniixx and others added 9 commits February 16, 2025 13:32

initial commit

ab33b97

adding the timeseries and data module

dd8b6a0

adding predict to setup

c4dd9cf

Adding tests and some debugging

54d7828

debug

9f8256e

update comments

fda5f7e

version-2 commit

d2a1f38

Merge branch 'main' into newDataset-v2

9205245

update the tests (just removed the reference of )

07c80c6

fkiraly moved this to PR under review in Dec 2024 - Mar 2025 mentee projects Mar 7, 2025

fkiraly added this to Dec 2024 - Mar 2025 mentee projects Mar 7, 2025

fkiraly assigned phoeenniixx Mar 7, 2025

Merge branch 'sktime:main' into newDataset-v2

549b1c7

phoeenniixx changed the title ~~[ENH] Experimental PR- New Dataset Version-2~~ [ENH] Experimental PR- New Dataset Version-B Mar 20, 2025

Merge branch 'main' into newDataset-v2

8454200

phoeenniixx and others added 2 commits April 4, 2025 11:18

Merge branch 'sktime:main' into newDataset-v2

a637732

update the data_module and timeseries

8d17925

fkiraly requested changes Apr 4, 2025

View reviewed changes

phoeenniixx added 2 commits April 4, 2025 18:06

add _prepare_metadata to datamodule

cd2d34e

update docstring

33e2b04

This was referenced May 13, 2025

[ENH] Baseclass based on dataset-version-a #1805

Closed

[ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) #1770

Closed

phoeenniixx closed this May 13, 2025

github-project-automation bot moved this from PR under review to Done in Dec 2024 - Mar 2025 mentee projects May 13, 2025

phoeenniixx deleted the newDataset-v2 branch May 13, 2025 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Experimental PR- New Dataset Version-B #1791

[ENH] Experimental PR- New Dataset Version-B #1791

Uh oh!

phoeenniixx commented Mar 7, 2025

Uh oh!

phoeenniixx commented Mar 15, 2025

Uh oh!

codecov bot commented Mar 20, 2025 •

edited

Loading

Uh oh!

agobbifbk commented Mar 21, 2025

Uh oh!

phoeenniixx commented Mar 21, 2025

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

fkiraly left a comment

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

fkiraly commented Apr 4, 2025

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

phoeenniixx commented May 13, 2025

Uh oh!

Uh oh!

[ENH] Experimental PR- New Dataset Version-B #1791

[ENH] Experimental PR- New Dataset Version-B #1791

Uh oh!

Conversation

phoeenniixx commented Mar 7, 2025

Description

Uh oh!

phoeenniixx commented Mar 15, 2025

Uh oh!

codecov bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

agobbifbk commented Mar 21, 2025

Uh oh!

phoeenniixx commented Mar 21, 2025

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

fkiraly commented Apr 4, 2025

Uh oh!

phoeenniixx commented Apr 4, 2025

Uh oh!

phoeenniixx commented May 13, 2025

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2025 •

edited

Loading