-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Jupyter + ML.NET | DataFrame | Suggestion on fetching columns #5684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @pgovind - thoughts? |
What generic would you imagine here? public class Customer
{
public string FirstName;
public string LastName;
}
DataFrame<Customer> df = DataFrame<Customer>.ReadCsv(customerDataPath);
df.FirstName // this gets me the FirstName column? Is something like that what you are envisioning? |
@eerhardt - yep exactly, something along those lines. Given that a large crowd that will be using ML.NET and DataFrame come from a .NET background, it would be so neat if we could leverage generic, LinQ and other C# syntax to work with the DataFrame. |
Interesting. There is a good chance that the DataFrame API could eventually converge to something like this. This is the second time this has come up. First time is here. The DataFrame type seems to be in a space that is an intersection of data scientists and engineers, so there's always been this constant question of how strongly typed the API should be. The reasons I haven't made the APIs very strongly typed yet are: Having said all that, there are places where I realllllly wish I had more type information. The Thoughts? Feedback? We definitely don't want the API to feel alien to .NET developers! |
Thank you for your detailed and long answer @pgovind! With that said, it would be awesome to have the option for both, and that may not be in the road map as yet, and that's okay :) I was thinking about it last night, and it may be possible to achieve it without asking the user to specify the entire schema (which I agree can be a bit ugly). We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection: https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.typebuilder.createtype?view=netframework-4.8 When it comes to being ".NET friendly", I think the biggest ask I would have would be to enable LinQ queries on the columns or rows, to filter, select and project data. |
@aslotte In the .NET notebook we wouldn't need to use reflection. We're already compiling code submissions, so we could, for example, introduce a magic command that generates and compiles this on demand, after which the generated type would be available for the duration of the notebook session. It might look something like this: %%compile-dataframe --data c:\housing.csv --type MyData
MyData df = MyData.ReadCsv("c:\housing.csv");
IntColumn populationCol = df.population; |
That's a great idea @jonsequitur |
@jonsequitur : I missed this comment somehow. How would I go about implementing your suggestion? I'd like to prototype it. Maybe there'd be method to generate |
@pgovind This would be a good fit for the extensibility story for |
FYI, you can try this out by installing the |
@jonsequitur - any thoughts on contributing that directly to the Microsoft.Data.Analysis package? Then users don't need to install both? https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data.Analysis.Interactive |
When using the DataFrame object, the current way to retrieve columns are by the df["ColumnName"]
It would be very neat to instead be able to get a property by executing df.ColumnName as a property. This may be possible to do by creating a generic version of a DataFrame, e.g.
var df = new DataFrame().ReadCsv(filePath)
I understand that the ReadCsv method currently is static, so not sure if this breaks a paradigm.
The text was updated successfully, but these errors were encountered: