Sparse vs. Dense Encoding #34

BigCrunsh · 2015-08-10T17:10:13Z

I run a pipeline to extract text features as follows.

pipeline = Pipeline([
    ('text', DataFrameMapper([
        ('description', CountVectorizer())
    ]))
])

This is working fine and is nicer than the approach described in [1]:

pipeline = Pipeline([
    ('text', Pipeline([
        ('selector', ItemSelector(key='description')),
        ('bow', CountVectorizer()),
    ]))
])

However, the former results in a dense encoding (which is intractable for text). Are you planning to change that?

[1] http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

dukebody · 2015-08-10T21:47:00Z

I'm not sure we can do this without breaking anything.

Apparently there is a scipy function to hstack sparse matrices (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html). I've tested it does hstack sparse+dense or dense+sparse correctly, but it fails for dense+dense. However it is not documented that it works for mix of sparse+dense, even if my tests show that it works.

Anyhow I think we could go with something similar to what sklearn FeatureUnion does:

    if any(sparse.issparse(f) for f in Xs):
        Xs = sparse.hstack(Xs).tocsr()
    else:
        Xs = np.hstack(Xs)
    return Xs

Could you submit a pull request with the code to do so for sklearn_pandas?

dukebody · 2015-08-11T18:44:59Z

@calpaterson ping! Can you review this?

I think this could break for users that were expecting the results of the mapper to always be dense. We could add a sparse_output=False kwarg to the mapper, so the previous behaviour is preserved. However the case might be to rare to consider the overhead of adding the extra optional parameter. I'm not sure if it's worth it.

Additionally we might want to add some documentation about this in the README so people is aware of what happens if any of the extracted features is sparse.

calpaterson · 2015-08-12T06:30:08Z

I'm ok with breaking backward comparability so long as we signal it with the versioning. Presumably we would go to 0.1.0 for this change. I'm currently reading up on sparse representations and hstack to convince myself that PR #36 does deal with all situations

BigCrunsh · 2015-08-12T13:18:29Z

@dukebody: thx for taking care of it!

dukebody mentioned this issue Aug 11, 2015

If any of the extracted features is sparse, make the hstacked result sparse as well #36

Closed

dukebody mentioned this issue Aug 30, 2015

Sparse features optional #37

Merged

dukebody closed this as completed Nov 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse vs. Dense Encoding #34

Sparse vs. Dense Encoding #34

BigCrunsh commented Aug 10, 2015

dukebody commented Aug 10, 2015

Uh oh!

dukebody commented Aug 11, 2015

Uh oh!

calpaterson commented Aug 12, 2015

Uh oh!

BigCrunsh commented Aug 12, 2015

Uh oh!

Sparse vs. Dense Encoding #34

Sparse vs. Dense Encoding #34

Comments

BigCrunsh commented Aug 10, 2015

dukebody commented Aug 10, 2015

Uh oh!

dukebody commented Aug 11, 2015

Uh oh!

calpaterson commented Aug 12, 2015

Uh oh!

BigCrunsh commented Aug 12, 2015

Uh oh!