-
Notifications
You must be signed in to change notification settings - Fork 1.9k
DataFrame feature suggestions #5670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the detailed feedback! This is what we were looking for with the preview package. About I'll open an issue for DataFrameColumn to support I agree than Apologies for the F# experience. I just haven't had time to optimize that experience yet. I wonder if @eiriktsarpalis might be interested in looking at this and giving us his thoughts? |
I'm not particularly knowledgeable on dataframe libraries, however it would seem that the main UX issue particular to F# in the code sample above is nullable interop. By design, F# does not permit implicit conversions, which is what makes the corresponding C# code sample work. So here are a few possible workarounds for the issue:
In my limited time playing with the library, I've noticed other issues related to implicit conversion, for instance row indexing being using longs for rows and int for columns. From the F# perspective it adds incidental complexity that shouldn't be a source of concern for users with data science backgrounds. Referencing the deedle project since it has been mentioned. |
Regarding F# specific stuff: One area that should improve is improved interop to nullable-typed values here: dotnet/fsharp#7989 Basically, anywhere you have to do We may want to consider finding a way to make functions like double? x = 12.0;
var y = Math.Exp(x); That's because |
Regarding this: bdf2.["UnstdPost"] :?> PrimitiveDataFrameColumn<float> The use of |
I might be missing something, but corresponding C# code still needs to be downcast: ((PrimitiveDataFrameColumn<double>)bdf2["UnstdPost"]).ApplyElementwise((x, i) => x + i); |
Yeah in that case I think there's some API work needed if the scenario ( |
Yes this use of inheritance and subtyping in the user-facing API design looks wrong to me |
With dotnet/corefxlab#2807, we shouldn't lose type information anymore after At the moment, all cc @eerhardt for thoughts? |
I'm not sure I fully understand the API design issue here. Can someone explain it to me? Deedle appears to solve this problem using a generic method - Another thing that could help here is https://github.com/dotnet/corefxlab/issues/2732 - generating strongly-typed
I think this makes sense - if we can make it work correctly. One thought would be to take a |
You're right that it's effectively the same thing, however there's an element of discoverability in using a method that does it for you. In the simple casting scenario, you need to know that there's a type called |
This is hugely important. Hierarchies are undiscoverable and baroque. There's also the fact that modifying/removing/deprecating/overloading a method in the future is really easy, while once you have a heirarchy you're stuck with it forever, and it is very brittle. |
I’ve had time to check some of the APIs now and this is my feedback. Sorry if I have missed something that already exists. I just tested this yesterday so please forgive any mistakes.
Before saying anything else, I would just like to say that I think this whole initiative is awesome! So happy to see these ideas coming to dotnet. Great job.
Quick background on me is that I work in the finance industry dealing heavily with time series data of all kinds. Currently we use a mix of dotnet, python, R and MATLAB. My favored dotnet language for data analysis is F# (which will be reflected below).
Fsharp Interface
Is there an F# tailored interface?
I didn’t see one and while it’s possible to use all features the differences between F# and C# really stand out for some operations. Example (grid approximation of posterior distribution):
I don’t think this code is that nice from an F# perspective. I would hope that some of these quirks can be done away with either by tailoring an interface for F# or by making some other adjustments, discussed below.
Extend the concept of an index
In other dataframe solutions the concept of an index column takes a central role. Usually this is an integer or a datetime. This then enables easy joins and with new timeseries and other operations. One example for timeseries data is resampling. That is, given data on a millisecond basis I may want to resample that data to seconds and perform a custom aggregation in doing so (see pandas resample).
In the NET implementation, there is an index but it’s always integer based and you can’t supply it when creating a series (data frame column). This makes it harder than would have to be to quickly put together a dataframe from disparate data sources. Requiring the length
of all columns to be the same is not good enough for production use and inhibits productivity.
See pandas or deedle:
index
series
Missing values treatment
On the dataframe there is a DropNulls operation but not when working on individual columns?
From my previous code example, what I could have been OK with would have been to drop all nulls from the column and then call Apply with my custom function, not having to deal with Nullable. This would have given me a new column where I have my index info (datetime) together with my new values. Then I would assign that to a new column in my dataframe. For the indices where I am missing values, the dataframe would just know that.
Currently that’s not possible (?) and it makes anything non-trivial a hassle.
Time series operations
The dataframe comes from the world of time series analysis in different forms. I think the design and implementation should recognize and honour that. Otherwise I don’t see the point as that’s where practically all applications lie.
This means out-of-the-box support for standard calculations such as moving averages. Much of this can of course be done in a third-party library but at least the necessary concepts have to exist. As I see it this is primarily what’s called “window”-functionality. In deedle and pandas it’s possible to perform windowed calculations. Either a moving window of a fixed size or an expanding window that adds a new row for each iteration. This is really useful for smoothing data and the expanding functionality is very powerful for making sure that all computations are done in a chronologically consistent way (no benefit of hindsight).
See pandas or deedle:
windowing
Summary
Great initiative. Please improve F# experience, introduce concept of an index (usually datetime-based) and put time series analysis in center-stage.
The text was updated successfully, but these errors were encountered: