-
Notifications
You must be signed in to change notification settings - Fork 415
Anyway to keep track of DataFrame variable names after DataFrameMapper transformation? #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I like this idea. One issue I can see is that some transformations are not 1-to-1 in terms of columns, eg. a PCA transformation from 100 columns to 10. Thoughts? |
That occurred to me too. There's no concise way to describe a PCA transformation, so maybe just name them PCA1, PCA2, etc? |
One more suggestion would be to include the value of the variable in the name when transforming the categorical variables. So if a variable called 'homeowner' has values 'Y' and 'N' and the mapper represents 'Y' as 1, then perhaps call the new variable 'homeowner_Y'. This would be particularly useful for variables with many levels. R does something similar if you run a model with categorical data. |
+1 on this |
In R the data structure that gets passed around is a data.frame which has column names. In sklearn the data structure that gets passed around is a numpy array with just numeric indexes. Unfortunately I don't see any way around this limitation. The logic of how columns in the input of a transformer map to columns in the output depends on the transformer; there's no consistent way to probe a trained transformer object for the meaning of its output columns. OneHotEncoder stores the column meanings in |
I'm closing this as I don't see a solution that will work on top of sklearn. |
I'm thinking specifically of a categorical variable to multiple binary variables. One of the biggest pain points in the sklearn pandas ecosystem compared to R is keeping track of which numpy columns correspond with what categorical variables. I guess what I'm picturing is perhaps a list of variable names that correspond to the output of the DataFrameMapper. Is there an easy way to get this today? If not, it could make for a nice enhancement.
The text was updated successfully, but these errors were encountered: