Skip to content

[ENH] Experimental PR- New Dataset Version-B #1791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

phoeenniixx
Copy link
Contributor

Description

This PR tires to implement another version of TimeSeries Dataset and data module where future_data is merged with the existing x and is not stored differently and one more tensor cutoff is added to remember the present time. Uses time_mask in the data module to differentiate the data into past and future with the help of cutoff

@phoeenniixx
Copy link
Contributor Author

Hi @fkiraly, @agobbifbk, the basic vignette here: https://colab.research.google.com/drive/1LS0JFIzHZ2_EbzY19l1Yuqyr9lTN1Jj8?usp=sharing

please see if it works as it is meant to be. I haven't still fully understand how future exogenous data is handled in dsipts, so if there is any discrepancy, please lmk

@phoeenniixx phoeenniixx changed the title [ENH] Experimental PR- New Dataset Version-2 [ENH] Experimental PR- New Dataset Version-B Mar 20, 2025
Copy link

codecov bot commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 80.97166% with 47 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@e87230b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/data_modules.py 81.17% 32 Missing ⚠️
pytorch_forecasting/data/timeseries.py 80.51% 15 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1791   +/-   ##
=======================================
  Coverage        ?   86.49%           
=======================================
  Files           ?       47           
  Lines           ?     5548           
  Branches        ?        0           
=======================================
  Hits            ?     4799           
  Misses          ?      749           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.49% <80.97%> (?)
pytest 86.49% <80.97%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@agobbifbk
Copy link

Great work thank you!
Here some comments:

  1. "x": torch.tensor(data[self.feature_cols].values), in the d1 layer: if data contains non numerical features this will raise an error during the tensor parser. One option can be to add in the init part of the d1 layer some sort of LabelEncoder

  2. in the encoder layer, _preprocess_data methods: it seems you are filling missing values with the mean. In my opinion this is something that the user can disable. Moreover in case of categorical variable (at this level encoded as integers, see point 1) the mean value is not correct

  3. in _create_windows probably we need to check if data is valid or not (in the case we don't want to fill nans in point 2)

  4. scalers is defined but not used (remember to fit it only in the training data)

  5. train_val_test_split is referred only to time series and does not take into account the time. For sure this is something we want (especially in the case of global forecaster), but in my opinion the temporal one is more important for real applications

  6. the _preprocess_data generate all the samples and keep them in memory. This is for sure something we want to do (most of the time the samples can fit in memory), but somehow we are losing the power of a generic d1 layer. With this d1 layer definition your approach works for sure, but I would add a flag in memory and explore what we can get data leveraging the _get_item function of the d1 layer

  7. this part of code:

train_dataloader = data_module.train_dataloader()
sample_batch = next(iter(train_dataloader))

x, y = sample_batch
print(f"Encoder continuous shape: {x['encoder_cont'].shape}")
print(f"Decoder continuous shape: {x['decoder_cont'].shape}")
print(f"Target shape: {y.shape}")

encoder_input_size = x['encoder_cont'].shape[-1]
decoder_input_size = x['decoder_cont'].shape[-1]

model = TimeSeriesLightningModel(encoder_input_size=encoder_input_size, decoder_input_size=decoder_input_size)

has still the issue about the dimension. One backup idea is to put all the size in the d2 layer metadata, add the key dictionary to the init of the model and create an instance of it using:

model = TimeSeriesLightningModel(**data_module['metadata']['sizes'])

until we define how to link better the d2 and model layer.

Really looking forward to seeing the code after these changes!

@phoeenniixx
Copy link
Contributor Author

Thanks for the comments @agobbifbk!

"x": torch.tensor(data[self.feature_cols].values), in the d1 layer: if data contains non numerical features this will raise an error during the tensor parser. One option can be to add in the init part of the d1 layer some sort of LabelEncoder

There were some doubts about where to put label encoders last time ig, @fkiraly can you please comment if those doubts were addressed.

in _create_windows probably we need to check if data is valid or not (in the case we don't want to fill nans in point 2)

Thanks, I will add it

scalers is defined but not used (remember to fit it only in the training data)

This implementation was just a test to see if evrything works, I will add them in subsequent commits

has still the issue about the dimension. One backup idea is to put all the size in the d2 layer metadata, add the key dictionary to the init of the model and create an instance of it using:

Please look at #1805, there I have tried to use metadata for version a, if that works?

I will add you suggestions in next commits, Thanks!

@phoeenniixx
Copy link
Contributor Author

Hi @fkiraly, I have tried to create the metadata in d2 layer without calling __getitem__ please have a look at it.
I have also added some @agobbifbk's comments, about which we can discuss about in next tech sessions.

For the record:
We are creating a property metadata in d2 layer because we don't want the user to initialise the model using the data that he has already provided.
So in place of initialising the model in this way:

model = TFT(
    loss=nn.MSELoss(),
    logging_metrics=[MAE(), SMAPE()],
    optimizer="adam",
    optimizer_params={"lr": 1e-3},
    lr_scheduler="reduce_lr_on_plateau",
    lr_scheduler_params={"mode": "min", "factor": 0.1, "patience": 10},
    hidden_size=64,
    num_layers=2,
    attention_head_size=4,
    dropout=0.1,
    cont_feature_size = encoder_cont
    cat_feature_size = encoder_cat
    static_cat_feature_size = static_categorical_features
    static_cont_feature_size =  static_continuous_features
    max_encoder_length = max_encoder_length
    max_prediction_length = max_prediction_length
    
)

the user can initialise the model in this way:

model = TFT(
    loss=nn.MSELoss(),
    logging_metrics=[MAE(), SMAPE()],
    optimizer="adam",
    optimizer_params={"lr": 1e-3},
    lr_scheduler="reduce_lr_on_plateau",
    lr_scheduler_params={"mode": "min", "factor": 0.1, "patience": 10},
    hidden_size=64,
    num_layers=2,
    attention_head_size=4,
    dropout=0.1,
    metadata = data_module.metadata
)

as all this information can be inferred from the data inputted by the user and this will make the interface simpler as well

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of metadata generation as requested:

  • great! It seems like we can generate the entire D2 specific metadata from D1 metadata and inputs, right?
  • for "cleanness" of the logic, I would suggest to move the entire logic for that into a method _prepare_metadata.

Some suggestions regarding documentation:

  • The method _prepare_metadata optimally also has a docstring that lists the output metadata fields generated and how - since it is a private method, primarily explained to a developer.
  • metadata already has this for the user, although I would go closer to the numpydoc style with Returns paragraph that is properly populated

@phoeenniixx
Copy link
Contributor Author

we can generate the entire D2 specific metadata from D1 metadata and inputs, right?

Yes "almost" all keys are taken from D1 metadata, some keys like max_encoder_length etc (the last 4 keys) are inputted into the datamodule only so these are directly taken from there.

for "cleanness" of the logic, I would suggest to move the entire logic for that into a method _prepare_metadata.

Sure then I will call this function in the property

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 4, 2025

Sure then I will call this function in the property

That is one way - I was thinking about pre-populating at __init__, or caching after the first call.
What is the computational overhead? Minimal, I guess?

@phoeenniixx
Copy link
Contributor Author

I was thinking about pre-populating at init.

this would mean the function is called in __init__, which imo is not necessary as this will be needed only when the model is to be initialised. So, we should compute it only when we actually need it, or do we need it somewhere else also that it should be available right after initialisation?

caching after the first call.

This I think is the best way, for now I will do this.

Even if we need it anywhere else as well, I think it can easily be calculated by calling the property, and also from the user perspective, it always has to called datammodule.metadata whenever it is needed, so it would not matter if it is computed in init or in property...

@phoeenniixx
Copy link
Contributor Author

We are finally moving on with this version of the implementation and the final PRs of the prototype are mentioned below,
#1811, #1812, #1813 (Final PRs with a basic model implementation and vignette)
We are closing this PR and moving on with the above PRs
FYI @fkiraly, @PranavBhatP @agobbifbk

@github-project-automation github-project-automation bot moved this from PR under review to Done in Dec 2024 - Mar 2025 mentee projects May 13, 2025
@phoeenniixx phoeenniixx deleted the newDataset-v2 branch May 13, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants