-
Notifications
You must be signed in to change notification settings - Fork 418
Sparse vs. Dense Encoding #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not sure we can do this without breaking anything. Apparently there is a scipy function to hstack sparse matrices (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html). I've tested it does hstack sparse+dense or dense+sparse correctly, but it fails for dense+dense. However it is not documented that it works for mix of sparse+dense, even if my tests show that it works. Anyhow I think we could go with something similar to what sklearn
Could you submit a pull request with the code to do so for sklearn_pandas? |
@calpaterson ping! Can you review this? I think this could break for users that were expecting the results of the mapper to always be dense. We could add a Additionally we might want to add some documentation about this in the README so people is aware of what happens if any of the extracted features is sparse. |
I'm ok with breaking backward comparability so long as we signal it with the versioning. Presumably we would go to 0.1.0 for this change. I'm currently reading up on sparse representations and hstack to convince myself that PR #36 does deal with all situations |
@dukebody: thx for taking care of it! |
I run a pipeline to extract text features as follows.
This is working fine and is nicer than the approach described in [1]:
However, the former results in a dense encoding (which is intractable for text). Are you planning to change that?
[1] http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
The text was updated successfully, but these errors were encountered: