Skip to content

There are two transforms with the Friendly Name "Term Transform" #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sfilipi opened this issue May 23, 2018 · 11 comments
Closed

There are two transforms with the Friendly Name "Term Transform" #214

sfilipi opened this issue May 23, 2018 · 11 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers up-for-grabs A good issue to fix if you are trying to contribute to the project
Milestone

Comments

@sfilipi
Copy link
Member

sfilipi commented May 23, 2018

In the list of entry points there are two identical transforms, one with the name field: "Transforms.TextToKeyConverter" and the other with the name field: "Transforms.Dictionarizer".
Their Friendly Name filed is the same: "Term Transform".

This will be confusing for systems interfacing with ml.net through the entry points; ml.net should not present the same entry point choice more than once.

@shauheen shauheen added the bug Something isn't working label May 23, 2018
@shauheen shauheen added this to the 0518 milestone May 23, 2018
@shauheen shauheen added good first issue Good for newcomers up-for-grabs A good issue to fix if you are trying to contribute to the project labels May 23, 2018
@TomFinley
Copy link
Contributor

What is the proposed final name? Are we happy with either "Dictionarizer" or "TextToKey"? The first seems a bit funky (at least to my eye), but the second is not descriptive since obviously this can be applied to more than just text. So I prefer the first but only because being basically meaningless is preferable to being flat-out misleading and wrong.

@justinormont
Copy link
Contributor

justinormont commented May 23, 2018

Yeah, I was about to say, "TextToKeyConverter" is a bit wrong as it can take in about anything and convert to Key, but @TomFinley beat me to it.

I don't like either one, really. Do we want to expose the concept of keys? If so, "ToKey" is rather terse and hopefully descriptive. I'd say "AnythingToKey", but I don't know that every type, for instance the Picture type, can be converted to a Key type (and the future will bring more types).

To capture the primary use case, "LabelConverter" may be suitable. Scikit calls this a "LabelEncoder": http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

I maybe wrong but Tensorflow seems to not have a specialized transform for this, and uses the categorical featurizers:

categorical_column_with_hash_bucket(...): Represents sparse feature where ids are set by hashing.
categorical_column_with_identity(...): A _CategoricalColumn that returns identity values.
categorical_column_with_vocabulary_file(...): A _CategoricalColumn with a vocabulary file.
categorical_column_with_vocabulary_list(...): A _CategoricalColumn with in-memory vocabulary

@TomFinley
Copy link
Contributor

mmm. ToKey... I actually like that name. ToKey. Super simple, fairly descriptive in five characters. Of course, hash is also a "to key." But it's still better than the original name "term."

We sort of have to expose them. They're basically the same things as factors in R, and as far as I can tell you simply can't get around the fact that enumerations into sets is a fairly central concept of ML. Of course whether we actually wind up using "key" in the name, I don't know.

LabelEncoder might be fine, indeed term's first name was "auto-label", but I worry somewhat about what will happen when someone uses the text metatransform, inspects the pipeline, and one of the first things there is them "label-encoding" their feature inputs. :)

@shauheen shauheen modified the milestones: 0518, 0618 May 29, 2018
@shauheen shauheen modified the milestones: 0618, 0718 Jul 5, 2018
@shauheen shauheen removed this from the 0718 milestone Aug 4, 2018
@jwood803
Copy link
Contributor

@TomFinley, if it's ok I'd like to work on this. Was ToKey decided to be the name to change one of the Term Transform's to?

@shauheen shauheen added this to the 0918 milestone Aug 31, 2018
@artidoro artidoro self-assigned this Sep 7, 2018
@TomFinley
Copy link
Contributor

Hi @jwood803 I apologize, I just saw this now!!

I'm not sure it was decided. @justinormont likes it (or at least so I presume from the fact that he suggested it), I like it, but other people that might have opinions on this matter have not weighed in -- the ones I can think of are @Zruty0 , @GalOshri , @eerhardt , @KrzysztofCwalina ...

@jwood803
Copy link
Contributor

jwood803 commented Sep 8, 2018

No worries, @TomFinley. 😄

I can go ahead and mess with it to get a PR out and we can go from there if that sounds like a plan.

@TomFinley
Copy link
Contributor

TomFinley commented Sep 9, 2018

Hi @jwood803 , mmmmaaaaybe? I would just hate for this to happen, then everyone shoots it down. Some history, we named it term four years ago, and the name stuck, not because anyone liked the name per se or thought it was so great, but because in those intervening four years every time someone suggested a new name for it, someone hated it even worse. (E.g., "dictionarizer..") My prior at this point is to assume everyone will hate any renaming to term.

Let's try this. I'm going to force discussion on the issue by naming my pigsty extension method in #870 to ToKey (while still the underlying classes have Term in their name). If it passes, then we'll consider that @justinormont's suggestion stands, and then the things named Term and whatnot can be renamed.

@jwood803
Copy link
Contributor

jwood803 commented Sep 9, 2018

Sounds great, @TomFinley! I'll be on the lookout for the discussion. Thanks! 😄

@TomFinley
Copy link
Contributor

Hi @jwood803 I think everyone agrees FYI (or at least, they let the PR go in), so let's possibly count on there being no particular disagreement on the point.

@justinormont
Copy link
Contributor

@jwood803, @TomFinley : Correct, I like ToKey for being terse and hopefully descriptive.

@jwood803
Copy link
Contributor

@TomFinley @justinormont Awesome! Thanks for the update. I'll start messing with this. Thanks!

@shauheen shauheen modified the milestone: 0918 Oct 1, 2018
@shauheen shauheen closed this as completed Oct 1, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working good first issue Good for newcomers up-for-grabs A good issue to fix if you are trying to contribute to the project
Projects
None yet
Development

No branches or pull requests

6 participants