Add a Filter for text-based columns #1763

CESARDELATORRE · 2018-11-28T23:12:43Z

Afaik, the new filter APIs can target just numeric-based columns, but not text-based columns.
As of v0.8 we have:
1)

//FilterByColumn()

IDataView trainingDataView = mlContext.Data.FilterByColumn(baseTrainingDataView, "FareAmount", lowerBound: 1, upperBound: 150);

This is a very convenient filter, but for NUMERIC values. I’m currently using this for the sample code-snippet

FilterByKeyColumnFraction()
Good for hashed values.

But if we want to filter by text-based columns, let's say I want to remove the rows where a text based column has no value (this might be doable when transforming to numeric values, then missing value is NaN and it will get filtered, but you need an additional not straightforward step) or to remove specific rows when categorical values are equal to "some text", I think we cannot do that, yet.

The text was updated successfully, but these errors were encountered:

TomFinley · 2018-11-29T07:46:26Z

This is fine, but I wonder if we can go even a step further and make something generally applicable to many things rather than just something very, very specific to text.

So: data pipelines must be serializable, and prior to ML.NET being open sourced filters were part of data pipelines. Pursuant to #933, it is our position that filters should not be part of data pipelines any longer, hence why we see the old functionality of filters not being exposed as IEstimator/ITransformer, but instead just straight functions on IDataView, e.g.:

machinelearning/src/Microsoft.ML.Data/DataLoadSave/DataOperations.cs

Line 57 in 533e186

    
           public IDataView FilterByColumn(IDataView input, string columnName, double lowerBound = double.NegativeInfinity, double upperBound = double.PositiveInfinity)

The implication means that there is no longer any requirement for things to be serializable, which means we could actually probably simplify a lot of code by deleting practically all of the existing filters, and just replace them with some sort of bool evaluating delegate akin to a LINQ .Where. (We can't serialize delegates, but since we no longer care about that, that's fine.)

This means this code here:

machinelearning/test/Microsoft.ML.Tests/RangeFilterTests.cs

Line 30 in 533e186

var data1 = ML.Data.FilterByColumn(data, "Floats", upperBound: 2.8);

Could potentially just be:

var data1 = ML.Data.KeepWhere(data, "Floats", (float v) => v < 2.8);

Your specific example (assuming that this is in some text column called MyAwesomeText), might be something akin to:

var data1 = ML.Data.KeepWhere(data, "MyAwesomeText", (ReadOnlyMemory<char> v) => v == v.Length > 0);

Whether we want to have these existing methods as a convenience, or add the convenience you suggest, I don't really know.

CESARDELATORRE added the enhancement New feature or request label Nov 28, 2018

glebuk added the API Issues pertaining the friendly API label Jan 18, 2019

TomFinley mentioned this issue Feb 19, 2019

Lockdown Microsoft.ML.Data Dataview folder. #2608

Merged

codemzs closed this as completed Jun 30, 2019

ghost locked as resolved and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Filter for text-based columns #1763

Add a Filter for text-based columns #1763

CESARDELATORRE commented Nov 28, 2018

TomFinley commented Nov 29, 2018 •

edited

Loading

Add a Filter for text-based columns #1763

Add a Filter for text-based columns #1763

Comments

CESARDELATORRE commented Nov 28, 2018

TomFinley commented Nov 29, 2018 • edited Loading

TomFinley commented Nov 29, 2018 •

edited

Loading