Skip to content

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aslotte opened this issue Sep 26, 2019 · 11 comments
Open

Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684

aslotte opened this issue Sep 26, 2019 · 11 comments
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs

Comments

@aslotte
Copy link
Contributor

aslotte commented Sep 26, 2019

When using the DataFrame object, the current way to retrieve columns are by the df["ColumnName"]

It would be very neat to instead be able to get a property by executing df.ColumnName as a property. This may be possible to do by creating a generic version of a DataFrame, e.g.

var df = new DataFrame().ReadCsv(filePath)

I understand that the ReadCsv method currently is static, so not sure if this breaks a paradigm.

@aslotte aslotte changed the title Jupyter + ML.NET | DataFrame | Fetching columns Jupyter + ML.NET | DataFrame | Suggestion on fetching columns Sep 26, 2019
@eerhardt eerhardt transferred this issue from dotnet/machinelearning Sep 26, 2019
@eerhardt
Copy link
Member

cc @pgovind - thoughts?

@eerhardt
Copy link
Member

This may be possible to do by creating a generic version of a DataFrame

What generic would you imagine here?

public class Customer
{
    public string FirstName;
    public string LastName;
}

DataFrame<Customer> df = DataFrame<Customer>.ReadCsv(customerDataPath);
df.FirstName   // this gets me the FirstName column?

Is something like that what you are envisioning?

@aslotte
Copy link
Contributor Author

aslotte commented Sep 26, 2019

@eerhardt - yep exactly, something along those lines.
The DataFrame class can certainly be a copy-implementation of Pandas in .NET, but it would be really cool I think if we could make a bit more .NET friendly :)

Given that a large crowd that will be using ML.NET and DataFrame come from a .NET background, it would be so neat if we could leverage generic, LinQ and other C# syntax to work with the DataFrame.
I certainly can't answer whether that is possible from a performance stand-point though. I would be more than happy to try to help out as time permits.

@pgovind
Copy link

pgovind commented Sep 27, 2019

Interesting. There is a good chance that the DataFrame API could eventually converge to something like this. This is the second time this has come up. First time is here. The DataFrame type seems to be in a space that is an intersection of data scientists and engineers, so there's always been this constant question of how strongly typed the API should be. The reasons I haven't made the APIs very strongly typed yet are:
a) I was targeting a natural DataFrame + Jupyter experience in .NET and we didn't have IntelliSense working in Jupyter yet. I assume this will happen in the future. At that point, there'd be a stronger case for this IMO.
b) I wanted to be conservative with the API initially. For ex: The datasets I've seen for most ML tasks seem to have many columns. On a notebook, defining a schema such as Customer but with say 15 columns didn't seem natural to me? My impression was that users might be turned off by all the code they'd have to type(especially without IntelliSense/code completion) just to read in a csv file. Without code completion, I figured that a weakly typed API was the way to go since it made code shorter and the intent clearer.

Having said all that, there are places where I realllllly wish I had more type information. The Merge/Join APIs or indexing(like you mention here) for ex. It'd also help with AppendRow and, I'm sure, in other places. I'll keep this in mind as we go along. A generic DataFrame derived from a base DataFrame will fit in nicely for sure with .NET for Spark scenarios.

Thoughts? Feedback? We definitely don't want the API to feel alien to .NET developers!

@aslotte
Copy link
Contributor Author

aslotte commented Sep 27, 2019

Thank you for your detailed and long answer @pgovind!
That all makes a lot of sense, and I can certainly see why the DataFrame is built as it is today.

With that said, it would be awesome to have the option for both, and that may not be in the road map as yet, and that's okay :) I was thinking about it last night, and it may be possible to achieve it without asking the user to specify the entire schema (which I agree can be a bit ugly).

We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection: https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.typebuilder.createtype?view=netframework-4.8
I haven't tried that myself, so not sure if it actually would work.

When it comes to being ".NET friendly", I think the biggest ask I would have would be to enable LinQ queries on the columns or rows, to filter, select and project data.
Thank you for all your contributions to this framework!

@jonsequitur
Copy link

We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection

@aslotte In the .NET notebook we wouldn't need to use reflection. We're already compiling code submissions, so we could, for example, introduce a magic command that generates and compiles this on demand, after which the generated type would be available for the duration of the notebook session. It might look something like this:

%%compile-dataframe --data c:\housing.csv --type MyData 
MyData df = MyData.ReadCsv("c:\housing.csv");
IntColumn populationCol = df.population;

@aslotte
Copy link
Contributor Author

aslotte commented Sep 27, 2019

That's a great idea @jonsequitur

@pgovind
Copy link

pgovind commented Jan 27, 2020

@jonsequitur : I missed this comment somehow. How would I go about implementing your suggestion? I'd like to prototype it. Maybe there'd be method to generate MyData in the DataFrame library and the magic command would call this method(from where though)?

@jonsequitur
Copy link

@pgovind This would be a good fit for the extensibility story for dotnet-interactive. Since we don't have it documented yet, a quick chat might be the best way to get you started.

@jonsequitur
Copy link

FYI, you can try this out by installing the Microsoft.DotNet.Interactive.ExtensionLab package in a .NET Interactive notebook:

linqify

@eerhardt
Copy link
Member

@jonsequitur - any thoughts on contributing that directly to the Microsoft.Data.Analysis package? Then users don't need to install both?

https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data.Analysis.Interactive

@pgovind pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021
@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants