Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

dfd · 2013-12-17T17:48:24Z

I'm thinking specifically of a categorical variable to multiple binary variables. One of the biggest pain points in the sklearn pandas ecosystem compared to R is keeping track of which numpy columns correspond with what categorical variables. I guess what I'm picturing is perhaps a list of variable names that correspond to the output of the DataFrameMapper. Is there an easy way to get this today? If not, it could make for a nice enhancement.

paulgb · 2013-12-17T18:11:56Z

I like this idea. One issue I can see is that some transformations are not 1-to-1 in terms of columns, eg. a PCA transformation from 100 columns to 10. Thoughts?

dfd · 2013-12-17T18:13:55Z

That occurred to me too. There's no concise way to describe a PCA transformation, so maybe just name them PCA1, PCA2, etc?

dfd · 2013-12-17T18:50:15Z

One more suggestion would be to include the value of the variable in the name when transforming the categorical variables. So if a variable called 'homeowner' has values 'Y' and 'N' and the mapper represents 'Y' as 1, then perhaps call the new variable 'homeowner_Y'. This would be particularly useful for variables with many levels. R does something similar if you run a model with categorical data.

scclemens · 2014-01-21T19:23:43Z

+1 on this
I think the analogous attribute on a DictVectorizer is .feature_names_
If I'm missing a way to get this, let me know?

paulgb · 2014-01-21T20:31:33Z

In R the data structure that gets passed around is a data.frame which has column names. In sklearn the data structure that gets passed around is a numpy array with just numeric indexes. Unfortunately I don't see any way around this limitation.

The logic of how columns in the input of a transformer map to columns in the output depends on the transformer; there's no consistent way to probe a trained transformer object for the meaning of its output columns. OneHotEncoder stores the column meanings in .feature_indices_, LabelBinarizer stores the column meanings in .classes_, TfidfVectorizer stores them as .vocabulary_. I'm extremely hesitant to bake any logic into sklearn-pandas to extract the column meanings for every different transformer, because it would break the modularity of sklearn which is what makes it so useful. In R these transformers would be required to return some unique name for each column (assuming they return a data.frame); unfortunately that isn't the case with sklearn.

paulgb · 2014-01-25T21:44:55Z

I'm closing this as I don't see a solution that will work on top of sklearn.

paulgb closed this as completed Jan 25, 2014

paulgb mentioned this issue Aug 13, 2014

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

dfd commented Dec 17, 2013

paulgb commented Dec 17, 2013

dfd commented Dec 17, 2013

dfd commented Dec 17, 2013

scclemens commented Jan 21, 2014

paulgb commented Jan 21, 2014

paulgb commented Jan 25, 2014

Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7

Comments

dfd commented Dec 17, 2013

paulgb commented Dec 17, 2013

dfd commented Dec 17, 2013

dfd commented Dec 17, 2013

scclemens commented Jan 21, 2014

paulgb commented Jan 21, 2014

paulgb commented Jan 25, 2014