Skip to content

De-transformation of samplers, filters #933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomFinley opened this issue Sep 17, 2018 · 1 comment
Open

De-transformation of samplers, filters #933

TomFinley opened this issue Sep 17, 2018 · 1 comment
Labels
code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. P3 Doc bugs, questions, minor issues, etc.

Comments

@TomFinley
Copy link
Contributor

As we transition the code from being exclusively for a tool to being more appropriate for an API, one of the most crucial parts of the work is that summarized in #581 where we take the concept IDataTransform and split it into three concepts IEstimator/ITransformer/IDataView -- currently, an IDataTranform fills all three roles (it is both the transforming model and an IDataView), which leads to a great deal of confusion when using this as an API.

The working assumption is that most things that are IDataTranform will transition to being this triad of estimator, transformer, and data-view. There are however some probably desirable exceptions that we do not want to fully convert:

  • All row filter transforms (skip, take, NA filter),
  • The shuffle transform,
  • The bootstrap sampler transform.

Currently they are IDataTransform, because everything that transforms data in this fashion is an IDataTransform. However this means that there is a data model associated with it, and it is serialized just alongside every other transform.

People have historically found this confusing. For example, people want to train and test based on the same dataset, so they apply bootstrap sampling transform in their transform list -- but then the same is done to their test set so the results are all screwed up. Or, they want to train on only some of it, so they apply the Take filter -- but then their test set evaluation happens only over the first however many. There are lots of examples like this that I've seen over the years. My belief is that generally this sort of row-wise filtering/sampling being part of the data pipeline really does more harm than good.

Now that we have the estimator/transformer/data triad of #581, we can make these operations exclusively as IDataViews, not actual fully blown ITransformer implementors (where someone might make the mistake of serializing them to a data pipeline).

It does technically represent a loss of capability, but I am not aware that I have ever seen a valid usecase where any of these entities was used as a data-model component deliberately, and it is very difficult for me to imagine a case where people would want to do so. Every usecase I have ever seen has been accidental and ultimately deeply harmful to the integrity of the user's experiments.

We will also need to decide what to do when we deserialize these models, when we deserialize what had been an IDataTransform into the new ITransformer. My own preference would be that they be replaced with a no-op transformer, since again I've never seen anything valid done with them.

This will also incidentally mean less work as we perform the conversion work.

@Ivanidzo4ka
Copy link
Contributor

@TomFinley I agree we need to remove IDataTransform from our code and rewrite current data operation transforms, but I don't see this issue is still can belong to Project 13 which is API cleaning. @rogancarr did great job in putting them in proper places in catalog and publicly they are in good shape.

So if you don't mind I would remove this from Project 13 and 0.11.

@antoniovs1029 antoniovs1029 added code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. P3 Doc bugs, questions, minor issues, etc. labels Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. P3 Doc bugs, questions, minor issues, etc.
Projects
None yet
Development

No branches or pull requests

3 participants