Merge output variables with input dataset #217

tomwhite · 2020-08-31T10:53:17Z

This is an initial attempt to implement #103 for count allele functions. Does this look like the right direction?

eric-czech · 2020-08-31T19:03:01Z

Thanks @tomwhite, that squares well with what I took away from the issue discussion.

One problem here though is that I can see this being a frustrating experience for users:

import xarray as xr
import dask.array as da

# I load a dataset from somewhere
ds = xr.Dataset(dict(x=xr.DataArray(da.random.random(100))))

# This is what sgkit functions do (adds a `y` variable in this case)
def fn(ds):
    new_ds = xr.Dataset(dict(y=xr.DataArray(da.random.random(100))))
    return ds.merge(new_ds) 

# First run
ds = fn(ds)  # No eager evaluation happens here

# Second run
ds = fn(ds)  # Because `y` already exists, Xarray will force a compute and compare the values
> MergeError: conflicting values for variable 'y' on objects to be combined. You can skip this check by specifying compat='override'.

I think we should make every function either do ds.merge(new_ds, compat='override') (which is curiously undocumented) or have some additional conditional logic for overwriting existing variables, perhaps by defaulting to deleting the original ones and throwing a warning like @timothymillar suggested in https://github.com/pystatgen/sgkit/issues/103#issuecomment-673720012.

jeromekelleher · 2020-09-01T11:06:00Z

LGTM also. I think there'll be refinements we need to make as we get more experience, but this is the right basic "shape" of how things are done. As such, I'd say we merge ASAP (modulo addressing @eric-czech's points) and start building on it.

tomwhite · 2020-09-01T11:08:25Z

Thanks @eric-czech, that's a useful case to consider. I have extracted a merge_datasets function to encapsulate this behaviour (new variables overwrite old ones, and issue a warning). How does this version look?

tomwhite · 2020-09-01T11:11:47Z

@jeromekelleher I agree we want to get the general API approach established sooner rather than later. This change will impact #100 and #102 for example.

eric-czech · 2020-09-01T13:03:30Z

I have extracted a merge_datasets function to encapsulate this behaviour (new variables overwrite old ones, and issue a warning). How does this version look?

Perfect, thank you.

tomwhite added 2 commits August 31, 2020 17:21

Count allele functions should return datasets not arrays.

ae8171c

Add merge=True to count allele functions.

fdd7b62

tomwhite force-pushed the return-dataset branch from 6b8d74a to fdd7b62 Compare August 31, 2020 16:22

tomwhite mentioned this pull request Aug 31, 2020

Summary stats #102

Merged

Issue a MergeWarning in the case when input variables are overwritten.

be502ec

Removed unused variable in test

0e4cefe

tomwhite marked this pull request as ready for review September 1, 2020 13:50

jeromekelleher approved these changes Sep 1, 2020

View reviewed changes

eric-czech approved these changes Sep 1, 2020

View reviewed changes

tomwhite merged commit ff5a0a0 into sgkit-dev:master Sep 2, 2020

tomwhite deleted the return-dataset branch September 2, 2020 08:13

tomwhite mentioned this pull request Sep 3, 2020

[WIP] Popgen stats #100

Merged

hammer mentioned this pull request Sep 3, 2020

Append output variables from functions to input dataset #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge output variables with input dataset #217

Merge output variables with input dataset #217

tomwhite commented Aug 31, 2020

eric-czech commented Aug 31, 2020 •

edited

Loading

jeromekelleher commented Sep 1, 2020

tomwhite commented Sep 1, 2020

tomwhite commented Sep 1, 2020

eric-czech commented Sep 1, 2020

Merge output variables with input dataset #217

Merge output variables with input dataset #217

Conversation

tomwhite commented Aug 31, 2020

eric-czech commented Aug 31, 2020 • edited Loading

jeromekelleher commented Sep 1, 2020

tomwhite commented Sep 1, 2020

tomwhite commented Sep 1, 2020

eric-czech commented Sep 1, 2020

eric-czech commented Aug 31, 2020 •

edited

Loading