De-transformation of samplers, filters #933

TomFinley · 2018-09-17T22:49:22Z

As we transition the code from being exclusively for a tool to being more appropriate for an API, one of the most crucial parts of the work is that summarized in #581 where we take the concept IDataTransform and split it into three concepts IEstimator/ITransformer/IDataView -- currently, an IDataTranform fills all three roles (it is both the transforming model and an IDataView), which leads to a great deal of confusion when using this as an API.

The working assumption is that most things that are IDataTranform will transition to being this triad of estimator, transformer, and data-view. There are however some probably desirable exceptions that we do not want to fully convert:

All row filter transforms (skip, take, NA filter),
The shuffle transform,
The bootstrap sampler transform.

Currently they are IDataTransform, because everything that transforms data in this fashion is an IDataTransform. However this means that there is a data model associated with it, and it is serialized just alongside every other transform.

People have historically found this confusing. For example, people want to train and test based on the same dataset, so they apply bootstrap sampling transform in their transform list -- but then the same is done to their test set so the results are all screwed up. Or, they want to train on only some of it, so they apply the Take filter -- but then their test set evaluation happens only over the first however many. There are lots of examples like this that I've seen over the years. My belief is that generally this sort of row-wise filtering/sampling being part of the data pipeline really does more harm than good.

Now that we have the estimator/transformer/data triad of #581, we can make these operations exclusively as IDataViews, not actual fully blown ITransformer implementors (where someone might make the mistake of serializing them to a data pipeline).

It does technically represent a loss of capability, but I am not aware that I have ever seen a valid usecase where any of these entities was used as a data-model component deliberately, and it is very difficult for me to imagine a case where people would want to do so. Every usecase I have ever seen has been accidental and ultimately deeply harmful to the integrity of the user's experiments.

We will also need to decide what to do when we deserialize these models, when we deserialize what had been an IDataTransform into the new ITransformer. My own preference would be that they be replaced with a no-op transformer, since again I've never seen anything valid done with them.

This will also incidentally mean less work as we perform the conversion work.

The text was updated successfully, but these errors were encountered:

Ivanidzo4ka · 2019-02-15T22:04:24Z

@TomFinley I agree we need to remove IDataTransform from our code and rewrite current data operation transforms, but I don't see this issue is still can belong to Project 13 which is API cleaning. @rogancarr did great job in putting them in proper places in catalog and publicly they are in good shape.

So if you don't mind I would remove this from Project 13 and 0.11.

TomFinley mentioned this issue Oct 2, 2018

Error due to ShuffleTransform in pipeline. #1106

Closed

Zruty0 mentioned this issue Nov 7, 2018

Add support for caching and filtering #1568

Closed

TomFinley mentioned this issue Nov 29, 2018

Add a Filter for text-based columns #1763

Closed

TomFinley mentioned this issue Jan 29, 2019

Add a Functional.Tests project that doesn't have InternalsVisibleTo #2306

Closed

rogancarr mentioned this issue Feb 2, 2019

Add BootstrapSamplingTransform to DataOperationsCatalog #2384

Closed

antoniovs1029 added code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. P3 Doc bugs, questions, minor issues, etc. labels Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-transformation of samplers, filters #933

De-transformation of samplers, filters #933

TomFinley commented Sep 17, 2018

Ivanidzo4ka commented Feb 15, 2019

De-transformation of samplers, filters #933

De-transformation of samplers, filters #933

Comments

TomFinley commented Sep 17, 2018

Ivanidzo4ka commented Feb 15, 2019