Skip to content

Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dfd opened this issue Dec 17, 2013 · 6 comments

Comments

@dfd
Copy link

dfd commented Dec 17, 2013

I'm thinking specifically of a categorical variable to multiple binary variables. One of the biggest pain points in the sklearn pandas ecosystem compared to R is keeping track of which numpy columns correspond with what categorical variables. I guess what I'm picturing is perhaps a list of variable names that correspond to the output of the DataFrameMapper. Is there an easy way to get this today? If not, it could make for a nice enhancement.

@paulgb
Copy link
Collaborator

paulgb commented Dec 17, 2013

I like this idea. One issue I can see is that some transformations are not 1-to-1 in terms of columns, eg. a PCA transformation from 100 columns to 10. Thoughts?

@dfd
Copy link
Author

dfd commented Dec 17, 2013

That occurred to me too. There's no concise way to describe a PCA transformation, so maybe just name them PCA1, PCA2, etc?

@dfd
Copy link
Author

dfd commented Dec 17, 2013

One more suggestion would be to include the value of the variable in the name when transforming the categorical variables. So if a variable called 'homeowner' has values 'Y' and 'N' and the mapper represents 'Y' as 1, then perhaps call the new variable 'homeowner_Y'. This would be particularly useful for variables with many levels. R does something similar if you run a model with categorical data.

@scclemens
Copy link

+1 on this
I think the analogous attribute on a DictVectorizer is .feature_names_
If I'm missing a way to get this, let me know?

@paulgb
Copy link
Collaborator

paulgb commented Jan 21, 2014

In R the data structure that gets passed around is a data.frame which has column names. In sklearn the data structure that gets passed around is a numpy array with just numeric indexes. Unfortunately I don't see any way around this limitation.

The logic of how columns in the input of a transformer map to columns in the output depends on the transformer; there's no consistent way to probe a trained transformer object for the meaning of its output columns. OneHotEncoder stores the column meanings in .feature_indices_, LabelBinarizer stores the column meanings in .classes_, TfidfVectorizer stores them as .vocabulary_. I'm extremely hesitant to bake any logic into sklearn-pandas to extract the column meanings for every different transformer, because it would break the modularity of sklearn which is what makes it so useful. In R these transformers would be required to return some unique name for each column (assuming they return a data.frame); unfortunately that isn't the case with sklearn.

@paulgb
Copy link
Collaborator

paulgb commented Jan 25, 2014

I'm closing this as I don't see a solution that will work on top of sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants