Skip to content

DataFrame feature suggestions #5670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #6144
MikaelUmaN opened this issue Jan 11, 2020 · 11 comments
Open
Tracked by #6144

DataFrame feature suggestions #5670

MikaelUmaN opened this issue Jan 11, 2020 · 11 comments
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs

Comments

@MikaelUmaN
Copy link

I’ve had time to check some of the APIs now and this is my feedback. Sorry if I have missed something that already exists. I just tested this yesterday so please forgive any mistakes.

Before saying anything else, I would just like to say that I think this whole initiative is awesome! So happy to see these ideas coming to dotnet. Great job.

Quick background on me is that I work in the finance industry dealing heavily with time series data of all kinds. Currently we use a mix of dotnet, python, R and MATLAB. My favored dotnet language for data analysis is F# (which will be reflected below).

Fsharp Interface

Is there an F# tailored interface?

I didn’t see one and while it’s possible to use all features the differences between F# and C# really stand out for some operations. Example (grid approximation of posterior distribution):

let l2 = ps1 |> Array.map (fun p -> Binomial.PMFLn(p, n2, k2))
let p2 = ps1 |> Array.map (fun p -> ContinuousUniform.PDFLn(0., 1., p))
let l2Col = PrimitiveDataFrameColumn("Likelihood", l2)
let p2Col = PrimitiveDataFrameColumn("Prior", p2)
let bdf2 = DataFrame(l2Col, p2Col)

// The unstandardized likelihood.
bdf2.["UnstdPostLn"] <- bdf2.["Likelihood"] + bdf2.["Prior"]

// What I really want to do is the equivalent of pandas "assign" operation. I want to create a new column based on existing columns
// in a non-trivial way. The only alternative I found was to clone and then apply elementwise.
bdf2.["UnstdPost"] <- bdf2.["UnstdPostLn"].Clone()

// Here, type information is lost so I have to cast. Then I have to work with nullable which is a pain.
// F# has good support for a lot of nullable operators but no support for when you want to apply functions like exp.
(bdf2.["UnstdPost"] :?> PrimitiveDataFrameColumn<float>).ApplyElementwise(fun (x: Nullable<float>) i -> Nullable(exp x.Value))

// Normalizing constant
let evidence2 = bdf2.["UnstdPost"].Sum() :?> float
bdf2.["StdPostLn"] <- bdf2.["UnstdPost"] - log evidence2

// Final, standardized posterior approximation. Same issues as before.
(bdf2.["StdPost"] :?> PrimitiveDataFrameColumn<float>).ApplyElementwise(fun x i -> Nullable(exp x.Value))

I don’t think this code is that nice from an F# perspective. I would hope that some of these quirks can be done away with either by tailoring an interface for F# or by making some other adjustments, discussed below.

Extend the concept of an index

In other dataframe solutions the concept of an index column takes a central role. Usually this is an integer or a datetime. This then enables easy joins and with new timeseries and other operations. One example for timeseries data is resampling. That is, given data on a millisecond basis I may want to resample that data to seconds and perform a custom aggregation in doing so (see pandas resample).

In the NET implementation, there is an index but it’s always integer based and you can’t supply it when creating a series (data frame column). This makes it harder than would have to be to quickly put together a dataframe from disparate data sources. Requiring the length
of all columns to be the same is not good enough for production use and inhibits productivity.

See pandas or deedle:

index
series

Missing values treatment

On the dataframe there is a DropNulls operation but not when working on individual columns?

From my previous code example, what I could have been OK with would have been to drop all nulls from the column and then call Apply with my custom function, not having to deal with Nullable. This would have given me a new column where I have my index info (datetime) together with my new values. Then I would assign that to a new column in my dataframe. For the indices where I am missing values, the dataframe would just know that.

Currently that’s not possible (?) and it makes anything non-trivial a hassle.

Time series operations

The dataframe comes from the world of time series analysis in different forms. I think the design and implementation should recognize and honour that. Otherwise I don’t see the point as that’s where practically all applications lie.

This means out-of-the-box support for standard calculations such as moving averages. Much of this can of course be done in a third-party library but at least the necessary concepts have to exist. As I see it this is primarily what’s called “window”-functionality. In deedle and pandas it’s possible to perform windowed calculations. Either a moving window of a fixed size or an expanding window that adds a new row for each iteration. This is really useful for smoothing data and the expanding functionality is very powerful for making sure that all computations are done in a chronologically consistent way (no benefit of hindsight).

See pandas or deedle:

windowing

Summary

Great initiative. Please improve F# experience, introduce concept of an index (usually datetime-based) and put time series analysis in center-stage.

@pgovind
Copy link

pgovind commented Jan 17, 2020

Thanks for the detailed feedback! This is what we were looking for with the preview package.

About ApplyElementwise, we recently merged dotnet/corefxlab#2807. You should now be able to make new columns(same/different type) from existing ones :)

I'll open an issue for DataFrameColumn to support DropNulls.

I agree than DataFrame and DataFrameColumn need to understand what an index is. I do plan on introducing windowing and/or rolling APIs first and then tackling time series support. It's definitely in our plan for DataFrame.

Apologies for the F# experience. I just haven't had time to optimize that experience yet. I wonder if @eiriktsarpalis might be interested in looking at this and giving us his thoughts?

@eiriktsarpalis
Copy link
Member

I'm not particularly knowledgeable on dataframe libraries, however it would seem that the main UX issue particular to F# in the code sample above is nullable interop. By design, F# does not permit implicit conversions, which is what makes the corresponding C# code sample work. So here are a few possible workarounds for the issue:

  • Support some form of implicit conversions in future F# language versions. I find this unlikely given technical constraints (it complicates type inference) and in the best case scenario you'd probably have to wait until the next major release. Pinging @dsyme and @cartermp who are more qualified to comment on this.
  • Consider adding columnar types for non-nullable primitives. I'm not familiar with the design decisions that lead to nullable being the default, but I'm assuming it's related to the dynamically typed nature of python?
  • Use ugly hacks that enable consuming implicit conversions in F# (not recommended).

In my limited time playing with the library, I've noticed other issues related to implicit conversion, for instance row indexing being using longs for rows and int for columns. From the F# perspective it adds incidental complexity that shouldn't be a source of concern for users with data science backgrounds.

Referencing the deedle project since it has been mentioned.

@cartermp
Copy link

cartermp commented Jan 21, 2020

Regarding F# specific stuff:

One area that should improve is improved interop to nullable-typed values here: dotnet/fsharp#7989

Basically, anywhere you have to do Nullable x when x is a value type and the target type is a nullable value type, we do the implicit conversion. It's a safe change with no breaking changes and should make using this library easier.

We may want to consider finding a way to make functions like exp apply to appropriate nullable value types, but it's not quite so simple. The following C# code also does not compile:

double? x = 12.0;
var y = Math.Exp(x);

That's because Math.Exp doesn't have a nullable overload, so you need to do x.Value. If this kind of thing were supported with F#, we'd need to think pretty carefully about what the behavior is when something is null for every math function in FSharp.Core. So perhaps there's something to be done but I don't think it's clear what could be done.

@cartermp
Copy link

cartermp commented Jan 21, 2020

Regarding this:

bdf2.["UnstdPost"] :?> PrimitiveDataFrameColumn<float>

The use of :?> is usually quite rare, since downcasting isn't typically needed in normal .NET or F# programming. Needing to downcast to a derived type feels like a smell in either the API design or some issue related to F# type inference.

@eiriktsarpalis
Copy link
Member

Needing to downcast to a derived type feels like a smell in either the API design or some issue related to F# type inference.

I might be missing something, but corresponding C# code still needs to be downcast:

((PrimitiveDataFrameColumn<double>)bdf2["UnstdPost"]).ApplyElementwise((x, i) => x + i);

@cartermp
Copy link

Yeah in that case I think there's some API work needed if the scenario (.ApplyElementwise on a data frame) is considered common or normal.

@dsyme
Copy link
Contributor

dsyme commented Jan 21, 2020

Needing to downcast to a derived type feels like a smell in either the API design

Yes this use of inheritance and subtyping in the user-facing API design looks wrong to me

@pgovind
Copy link

pgovind commented Jan 21, 2020

With dotnet/corefxlab#2807, we shouldn't lose type information anymore after Apply, so running into a need to downcast is less likely. However I don't see a design that avoids a downcast in all cases. The reason ApplyElementwise doesn't exist on DataFrameColumn is because of boxing concerns. DataFrameColumn is weakly typed, but PrimitiveDataFrameColumn has a T in it. We could consider adding a virtual ApplyElementwise method on DataFrameColumn and hiding it with a new ApplyElementwise on PrimitiveDataFrameColumn? However, statements like df[colA] + df[colB] will still return a weakly typed DataFrameColumn because we don't know the return type until runtime.

At the moment, all PrimitiveDataFrameColumns have a T : unmanaged constraint on them, so they can hold nullable values. To support non-nullable only values in columns, we'd have to make a new NonNullablePrimitiveDataFrameColumn<T> where T : nonnull type but that comes with the issue that nonnull allows reference types. I don't think we have a constraint that says "non nullable value types only". I don't think we have a better option here other than dotnet/fsharp#7989.

cc @eerhardt for thoughts?

@eerhardt
Copy link
Member

Needing to downcast to a derived type feels like a smell in either the API design

Yes this use of inheritance and subtyping in the user-facing API design looks wrong to me

I'm not sure I fully understand the API design issue here. Can someone explain it to me? bdf2["UnstdPost"] can't return a strong type - it's not possible for that to return a double-typed column.

Deedle appears to solve this problem using a generic method - titanic.GetColumn<bool>("Survived"). This seems like an alternative way of casting, it just isn't using a cast operator in the API. Would adding a similar API to Microsoft.Data.Analysis.DataFrame help here?

Another thing that could help here is https://github.com/dotnet/corefxlab/issues/2732 - generating strongly-typed DataFrame types, so you could write bdf2.UnstdPost and it would return a double-typed column.

We could consider adding a virtual ApplyElementwise method on DataFrameColumn and hiding it with a new ApplyElementwise on PrimitiveDataFrameColumn?

I think this makes sense - if we can make it work correctly. One thought would be to take a Func<T, T> as the input on the base DataFrameColumn, and if the T type doesn't match the actual column type, either do a conversion or throw. When it does match, we shouldn't need to do any boxing.

@eiriktsarpalis
Copy link
Member

Deedle appears to solve this problem using a generic method - titanic.GetColumn("Survived"). This seems like an alternative way of casting, it just isn't using a cast operator in the API. Would adding a similar API to Microsoft.Data.Analysis.DataFrame help here?

You're right that it's effectively the same thing, however there's an element of discoverability in using a method that does it for you. In the simple casting scenario, you need to know that there's a type called PrimitiveDataFrameColumn<T> and that the column that you're interested in is really an instance of that type. OTOH a method can be quickly discovered via intellisense and the caller need only supply the type of the column as a generic parameter.

@dsyme
Copy link
Contributor

dsyme commented Jan 23, 2020

You're right that it's effectively the same thing, however there's an element of discoverability in using a method that does it for you.

This is hugely important. Hierarchies are undiscoverable and baroque.

There's also the fact that modifying/removing/deprecating/overloading a method in the future is really easy, while once you have a heirarchy you're stuck with it forever, and it is very brittle.

@pgovind pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021
@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

6 participants