Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

aslotte · 2019-09-26T16:24:07Z

When using the DataFrame object, the current way to retrieve columns are by the df["ColumnName"]

It would be very neat to instead be able to get a property by executing df.ColumnName as a property. This may be possible to do by creating a generic version of a DataFrame, e.g.

var df = new DataFrame().ReadCsv(filePath)

I understand that the ReadCsv method currently is static, so not sure if this breaks a paradigm.

eerhardt · 2019-09-26T19:42:32Z

cc @pgovind - thoughts?

eerhardt · 2019-09-26T23:04:18Z

This may be possible to do by creating a generic version of a DataFrame

What generic would you imagine here?

public class Customer
{
    public string FirstName;
    public string LastName;
}

DataFrame<Customer> df = DataFrame<Customer>.ReadCsv(customerDataPath);
df.FirstName   // this gets me the FirstName column?

Is something like that what you are envisioning?

aslotte · 2019-09-26T23:16:06Z

@eerhardt - yep exactly, something along those lines.
The DataFrame class can certainly be a copy-implementation of Pandas in .NET, but it would be really cool I think if we could make a bit more .NET friendly :)

Given that a large crowd that will be using ML.NET and DataFrame come from a .NET background, it would be so neat if we could leverage generic, LinQ and other C# syntax to work with the DataFrame.
I certainly can't answer whether that is possible from a performance stand-point though. I would be more than happy to try to help out as time permits.

pgovind · 2019-09-27T01:15:45Z

Interesting. There is a good chance that the DataFrame API could eventually converge to something like this. This is the second time this has come up. First time is here. The DataFrame type seems to be in a space that is an intersection of data scientists and engineers, so there's always been this constant question of how strongly typed the API should be. The reasons I haven't made the APIs very strongly typed yet are:
a) I was targeting a natural DataFrame + Jupyter experience in .NET and we didn't have IntelliSense working in Jupyter yet. I assume this will happen in the future. At that point, there'd be a stronger case for this IMO.
b) I wanted to be conservative with the API initially. For ex: The datasets I've seen for most ML tasks seem to have many columns. On a notebook, defining a schema such as Customer but with say 15 columns didn't seem natural to me? My impression was that users might be turned off by all the code they'd have to type(especially without IntelliSense/code completion) just to read in a csv file. Without code completion, I figured that a weakly typed API was the way to go since it made code shorter and the intent clearer.

Having said all that, there are places where I realllllly wish I had more type information. The Merge/Join APIs or indexing(like you mention here) for ex. It'd also help with AppendRow and, I'm sure, in other places. I'll keep this in mind as we go along. A generic DataFrame derived from a base DataFrame will fit in nicely for sure with .NET for Spark scenarios.

Thoughts? Feedback? We definitely don't want the API to feel alien to .NET developers!

aslotte · 2019-09-27T10:32:09Z

Thank you for your detailed and long answer @pgovind!
That all makes a lot of sense, and I can certainly see why the DataFrame is built as it is today.

With that said, it would be awesome to have the option for both, and that may not be in the road map as yet, and that's okay :) I was thinking about it last night, and it may be possible to achieve it without asking the user to specify the entire schema (which I agree can be a bit ugly).

We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection: https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.typebuilder.createtype?view=netframework-4.8
I haven't tried that myself, so not sure if it actually would work.

When it comes to being ".NET friendly", I think the biggest ask I would have would be to enable LinQ queries on the columns or rows, to filter, select and project data.
Thank you for all your contributions to this framework!

jonsequitur · 2019-09-27T15:28:39Z

We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection

@aslotte In the .NET notebook we wouldn't need to use reflection. We're already compiling code submissions, so we could, for example, introduce a magic command that generates and compiles this on demand, after which the generated type would be available for the duration of the notebook session. It might look something like this:

%%compile-dataframe --data c:\housing.csv --type MyData 
MyData df = MyData.ReadCsv("c:\housing.csv");
IntColumn populationCol = df.population;

aslotte · 2019-09-27T15:36:32Z

That's a great idea @jonsequitur

pgovind · 2020-01-27T23:58:16Z

@jonsequitur : I missed this comment somehow. How would I go about implementing your suggestion? I'd like to prototype it. Maybe there'd be method to generate MyData in the DataFrame library and the magic command would call this method(from where though)?

jonsequitur · 2020-01-28T16:19:35Z

@pgovind This would be a good fit for the extensibility story for dotnet-interactive. Since we don't have it documented yet, a quick chat might be the best way to get you started.

jonsequitur · 2020-10-22T17:28:24Z

FYI, you can try this out by installing the Microsoft.DotNet.Interactive.ExtensionLab package in a .NET Interactive notebook:

eerhardt · 2020-10-22T19:08:48Z

@jonsequitur - any thoughts on contributing that directly to the Microsoft.Data.Analysis package? Then users don't need to install both?

https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data.Analysis.Interactive

aslotte changed the title ~~Jupyter + ML.NET | DataFrame | Fetching columns~~ Jupyter + ML.NET | DataFrame | Suggestion on fetching columns Sep 26, 2019

eerhardt transferred this issue from dotnet/machinelearning Sep 26, 2019

pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021

pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

aslotte commented Sep 26, 2019

eerhardt commented Sep 26, 2019

Uh oh!

eerhardt commented Sep 26, 2019

Uh oh!

aslotte commented Sep 26, 2019

Uh oh!

pgovind commented Sep 27, 2019

Uh oh!

aslotte commented Sep 27, 2019

Uh oh!

jonsequitur commented Sep 27, 2019

Uh oh!

aslotte commented Sep 27, 2019

Uh oh!

pgovind commented Jan 27, 2020 •

edited

Loading

Uh oh!

jonsequitur commented Jan 28, 2020

Uh oh!

jonsequitur commented Oct 22, 2020

Uh oh!

eerhardt commented Oct 22, 2020

Uh oh!

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

Comments

aslotte commented Sep 26, 2019

eerhardt commented Sep 26, 2019

Uh oh!

eerhardt commented Sep 26, 2019

Uh oh!

aslotte commented Sep 26, 2019

Uh oh!

pgovind commented Sep 27, 2019

Uh oh!

aslotte commented Sep 27, 2019

Uh oh!

jonsequitur commented Sep 27, 2019

Uh oh!

aslotte commented Sep 27, 2019

Uh oh!

pgovind commented Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonsequitur commented Jan 28, 2020

Uh oh!

jonsequitur commented Oct 22, 2020

Uh oh!

eerhardt commented Oct 22, 2020

Uh oh!

pgovind commented Jan 27, 2020 •

edited

Loading