Skip to content
This repository was archived by the owner on Oct 15, 2020. It is now read-only.

Pysnptools reader implementation #1

Merged
merged 15 commits into from
Jul 7, 2020

Conversation

eric-czech
Copy link
Collaborator

This adds a PLINK reader as a dask.array.from_array wrapper with a pysnptools.Bed delegate. Simulated data for testing was generated by Hail with a small degree of missingness injected (10%).

One question I had for you in doing this @tomwhite was why you wanted to use zero-terminated bytes for allele strings (i.e. dtype "S"). I'm not familiar with the advantages of that but I did see that the numpy docs recommend not using it. I have no idea why they say that, but I'm curious what implications it has. If nothing else, I thought this behavior would be confusing/unexpected for most users:

> x = ds['variant/alleles'][:5].values
> x
array([[b'T', b'C'],
       [b'G', b'A'],
       [b'T', b'C'],
       [b'A', b'G'],
       [b'A', b'G']], dtype='|S1')
> x[0, 0] == 'T'
False

I'm all for it if it's a lot more efficient though.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @eric-czech.

Regarding the allele string encoding, using dtype "S1" was @alimanfoo's suggestion in https://discourse.smadstatgen.org/t/xarray-conventions-for-variation-data/44/8. This uses 1 byte per allele, compared to 4 if we used "U1". We can mitigate the user confusion by documenting this well.

variant_contig_names = list(variant_contig_names.compute())
# Otherwise index by unique name where index will correspond
# to lexsort on names, i.e. if contigs are 'chr1', 'chr2',
# ..., 'chr10' then 'chr10' comes before 'chr2'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be better to index in the order that the names appear? Harder to implement, but likely to correspond to the order of other contig indexes (e.g. VCF).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that when I saw your comment in your initial changes. Personally, I thought it would make more sense to say that using the contig name index is a better practice (we'll need to support that well anyhow). Referring to them directly by index when not already encoded as ints seemed like it would invite errors. In other words, I thought this was reasonable:

ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))

I don't feel strongly about it though if you think access by order-of-appearance index will be important.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen problems with sorting contig names in the past (in BAM/VCF headers), so I thought ordering by appearance would be making slightly fewer assumptions. (There's a neat way of doing it here BTW: https://stackoverflow.com/a/12926989).

Doesn't your ds.sel work whatever the order of contigs - i.e. lex sorted or oder-of-appearance sorted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the ds.sel using .attrs['contigs'] should work regardless so I think we should push ourselves and users towards it.

One advantage of the order-of-appearance indexing though would be that contig indexes in an Xarray index wouldn't cause a re-ordering relative to the original dataset. I can't think of any access patterns where sequential access across contigs is important (e.g. range queries), so the order of them shouldn't matter, but reordering contigs should lead to Dask needlessly running big slices on the other non-indexed arrays.

I'll change it but if the contig index isn't going to be an internal detail, it makes me wonder if we should use 1-based indexing. I think most geneticists would find 0-based contig codes problematic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using Python indexing, so 0-based makes sense I think?

Agreed on the other points.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 717d307

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))

Btw I'm cool with this but it will be a bit obscure for new users. Is there any way we could make some convenience or shortcut?

Copy link
Collaborator Author

@eric-czech eric-czech Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few options that come to mind:

  • Using an accessor:
ds.genetics.sel_contig(slice('chr1'))
  • Accessor specific to functions on attributes:
ds.sel(variant=ds['variant/contig'] == ds.meta.contig_index('chr1'))
  • Use contig index directly:
ds.sel(variant=ds['variant/contig'] == 0) # for chr1

I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:

contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))

Any thoughts/preferences on those?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eric-czech for thinking through options. Moved this discussion over to https://github.com/pystatgen/sgkit/issues/25 to give it some more space.

@eric-czech
Copy link
Collaborator Author

Regarding the allele string encoding, using dtype "S1" was @alimanfoo's suggestion in https://discourse.smadstatgen.org/t/xarray-conventions-for-variation-data/44/8. This uses 1 byte per allele, compared to 4 if we used "U1". We can mitigate the user confusion by documenting this well.

Ah got it. 👍

@tomwhite
Copy link
Collaborator

tomwhite commented Jul 7, 2020

Thanks for making all those changes - they look good to me. I think it would be good to have order-by-appearance for the contig index.

@eric-czech
Copy link
Collaborator Author

Sure, I added and used a function for it here.

@eric-czech eric-czech merged commit 78da13a into sgkit-dev:master Jul 7, 2020
@alimanfoo
Copy link

Very nice!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants