Skip to content

Sparse vs. Dense Encoding #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BigCrunsh opened this issue Aug 10, 2015 · 4 comments
Closed

Sparse vs. Dense Encoding #34

BigCrunsh opened this issue Aug 10, 2015 · 4 comments

Comments

@BigCrunsh
Copy link

I run a pipeline to extract text features as follows.

pipeline = Pipeline([
    ('text', DataFrameMapper([
        ('description', CountVectorizer())
    ]))
])

This is working fine and is nicer than the approach described in [1]:

pipeline = Pipeline([
    ('text', Pipeline([
        ('selector', ItemSelector(key='description')),
        ('bow', CountVectorizer()),
    ]))
])

However, the former results in a dense encoding (which is intractable for text). Are you planning to change that?

[1] http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

@dukebody
Copy link
Collaborator

I'm not sure we can do this without breaking anything.

Apparently there is a scipy function to hstack sparse matrices (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html). I've tested it does hstack sparse+dense or dense+sparse correctly, but it fails for dense+dense. However it is not documented that it works for mix of sparse+dense, even if my tests show that it works.

Anyhow I think we could go with something similar to what sklearn FeatureUnion does:

    if any(sparse.issparse(f) for f in Xs):
        Xs = sparse.hstack(Xs).tocsr()
    else:
        Xs = np.hstack(Xs)
    return Xs

Could you submit a pull request with the code to do so for sklearn_pandas?

@dukebody
Copy link
Collaborator

@calpaterson ping! Can you review this?

I think this could break for users that were expecting the results of the mapper to always be dense. We could add a sparse_output=False kwarg to the mapper, so the previous behaviour is preserved. However the case might be to rare to consider the overhead of adding the extra optional parameter. I'm not sure if it's worth it.

Additionally we might want to add some documentation about this in the README so people is aware of what happens if any of the extracted features is sparse.

@calpaterson
Copy link
Collaborator

I'm ok with breaking backward comparability so long as we signal it with the versioning. Presumably we would go to 0.1.0 for this change. I'm currently reading up on sparse representations and hstack to convince myself that PR #36 does deal with all situations

@BigCrunsh
Copy link
Author

@dukebody: thx for taking care of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants