De-transformation of samplers, filters #933
Labels
code-sanitation
Code consistency, maintainability, and best practices, moreso than any public API.
P3
Doc bugs, questions, minor issues, etc.
As we transition the code from being exclusively for a tool to being more appropriate for an API, one of the most crucial parts of the work is that summarized in #581 where we take the concept
IDataTransform
and split it into three conceptsIEstimator
/ITransformer
/IDataView
-- currently, anIDataTranform
fills all three roles (it is both the transforming model and anIDataView
), which leads to a great deal of confusion when using this as an API.The working assumption is that most things that are
IDataTranform
will transition to being this triad of estimator, transformer, and data-view. There are however some probably desirable exceptions that we do not want to fully convert:Currently they are
IDataTransform
, because everything that transforms data in this fashion is anIDataTransform
. However this means that there is a data model associated with it, and it is serialized just alongside every other transform.People have historically found this confusing. For example, people want to train and test based on the same dataset, so they apply bootstrap sampling transform in their transform list -- but then the same is done to their test set so the results are all screwed up. Or, they want to train on only some of it, so they apply the
Take
filter -- but then their test set evaluation happens only over the first however many. There are lots of examples like this that I've seen over the years. My belief is that generally this sort of row-wise filtering/sampling being part of the data pipeline really does more harm than good.Now that we have the estimator/transformer/data triad of #581, we can make these operations exclusively as
IDataView
s, not actual fully blownITransformer
implementors (where someone might make the mistake of serializing them to a data pipeline).It does technically represent a loss of capability, but I am not aware that I have ever seen a valid usecase where any of these entities was used as a data-model component deliberately, and it is very difficult for me to imagine a case where people would want to do so. Every usecase I have ever seen has been accidental and ultimately deeply harmful to the integrity of the user's experiments.
We will also need to decide what to do when we deserialize these models, when we deserialize what had been an
IDataTransform
into the newITransformer
. My own preference would be that they be replaced with a no-op transformer, since again I've never seen anything valid done with them.This will also incidentally mean less work as we perform the conversion work.
The text was updated successfully, but these errors were encountered: