Pysnptools reader implementation #1

eric-czech · 2020-07-06T14:08:28Z

This adds a PLINK reader as a dask.array.from_array wrapper with a pysnptools.Bed delegate. Simulated data for testing was generated by Hail with a small degree of missingness injected (10%).

One question I had for you in doing this @tomwhite was why you wanted to use zero-terminated bytes for allele strings (i.e. dtype "S"). I'm not familiar with the advantages of that but I did see that the numpy docs recommend not using it. I have no idea why they say that, but I'm curious what implications it has. If nothing else, I thought this behavior would be confusing/unexpected for most users:

> x = ds['variant/alleles'][:5].values
> x
array([[b'T', b'C'],
       [b'G', b'A'],
       [b'T', b'C'],
       [b'A', b'G'],
       [b'A', b'G']], dtype='|S1')
> x[0, 0] == 'T'
False

I'm all for it if it's a lot more efficient though.

tomwhite

Looks great @eric-czech.

Regarding the allele string encoding, using dtype "S1" was @alimanfoo's suggestion in https://discourse.smadstatgen.org/t/xarray-conventions-for-variation-data/44/8. This uses 1 byte per allele, compared to 4 if we used "U1". We can mitigate the user confusion by documenting this well.

sgkit_plink/pysnptools.py

setup.cfg

tomwhite · 2020-07-06T14:39:35Z

sgkit_plink/pysnptools.py

+        variant_contig_names = list(variant_contig_names.compute())
+    # Otherwise index by unique name where index will correspond
+    # to lexsort on names, i.e. if contigs are 'chr1', 'chr2',
+    # ..., 'chr10' then 'chr10' comes before 'chr2'


Might it be better to index in the order that the names appear? Harder to implement, but likely to correspond to the order of other contig indexes (e.g. VCF).

I was thinking about that when I saw your comment in your initial changes. Personally, I thought it would make more sense to say that using the contig name index is a better practice (we'll need to support that well anyhow). Referring to them directly by index when not already encoded as ints seemed like it would invite errors. In other words, I thought this was reasonable:

ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))

I don't feel strongly about it though if you think access by order-of-appearance index will be important.

I've seen problems with sorting contig names in the past (in BAM/VCF headers), so I thought ordering by appearance would be making slightly fewer assumptions. (There's a neat way of doing it here BTW: https://stackoverflow.com/a/12926989).

Doesn't your ds.sel work whatever the order of contigs - i.e. lex sorted or oder-of-appearance sorted?

Yea the ds.sel using .attrs['contigs'] should work regardless so I think we should push ourselves and users towards it.

One advantage of the order-of-appearance indexing though would be that contig indexes in an Xarray index wouldn't cause a re-ordering relative to the original dataset. I can't think of any access patterns where sequential access across contigs is important (e.g. range queries), so the order of them shouldn't matter, but reordering contigs should lead to Dask needlessly running big slices on the other non-indexed arrays.

I'll change it but if the contig index isn't going to be an internal detail, it makes me wonder if we should use 1-based indexing. I think most geneticists would find 0-based contig codes problematic.

It's using Python indexing, so 0-based makes sense I think?

Agreed on the other points.

Changed in 717d307

ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))

Btw I'm cool with this but it will be a bit obscure for new users. Is there any way we could make some convenience or shortcut?

A few options that come to mind:

Using an accessor:

ds.genetics.sel_contig(slice('chr1'))

Accessor specific to functions on attributes:

ds.sel(variant=ds['variant/contig'] == ds.meta.contig_index('chr1'))

Use contig index directly:

ds.sel(variant=ds['variant/contig'] == 0) # for chr1

I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:

contig_index = ds.attrs['contigs'].index('chr1') contig_range = slice(contig_index, contig_index) # or contig_range = slice(0, 0) for chr1 bp_pos_range = slice(0, 1000000) ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))

Any thoughts/preferences on those?

Thanks @eric-czech for thinking through options. Moved this discussion over to https://github.com/pystatgen/sgkit/issues/25 to give it some more space.

sgkit_plink/pysnptools.py

sgkit_plink/tests/data/plink_sim_10s_100v_10pmiss.bim

sgkit_plink/tests/test_pysnptools.py

eric-czech · 2020-07-06T17:10:47Z

Regarding the allele string encoding, using dtype "S1" was @alimanfoo's suggestion in https://discourse.smadstatgen.org/t/xarray-conventions-for-variation-data/44/8. This uses 1 byte per allele, compared to 4 if we used "U1". We can mitigate the user confusion by documenting this well.

Ah got it. 👍

tomwhite · 2020-07-07T08:39:42Z

Thanks for making all those changes - they look good to me. I think it would be good to have order-by-appearance for the contig index.

eric-czech · 2020-07-07T12:37:58Z

Sure, I added and used a function for it here.

alimanfoo · 2020-07-08T08:40:49Z

Very nice!

rs-gh-sa and others added 11 commits July 1, 2020 16:35

Tooling copy

b99334d

Adding initial reader code

68236fc

Test file

5702c43

Reader updates

0834ab5

Cleaning up field type handling and bed reader slicing

3d49a31

Adding tests

cbada36

Move to branch

0573ce7

Adding reader code+tests

f4b3161

Adding sgkit dependency

066f5ec

Fix action for pytest

33c2d3f

Adding reader docstring

0f3ba2f

tomwhite suggested changes Jul 6, 2020

View reviewed changes

tomwhite reviewed Jul 6, 2020

View reviewed changes

sgkit_plink/tests/test_pysnptools.py Show resolved Hide resolved

eric-czech added 3 commits July 6, 2020 17:36

Cleanup and additional test for call values (sgkit-dev#1)

ee71708

Adding plink pedigree field docs and removing redundant sample id

ca71748

Update pedigree field docs

9971491

Package imports update, contig indexing update

717d307

tomwhite approved these changes Jul 7, 2020

View reviewed changes

eric-czech merged commit 78da13a into sgkit-dev:master Jul 7, 2020

alimanfoo mentioned this pull request Jul 8, 2020

Genome region query API sgkit-dev/sgkit#25

Open

eric-czech mentioned this pull request Jul 24, 2020

Requirements for UKB GWAS sgkit-dev/sgkit#67

Open

11 tasks

CarlKCarlK mentioned this pull request Aug 5, 2020

Pysnptools Bed API Usage #21

Closed

Pysnptools reader implementation #1

Pysnptools reader implementation #1

Uh oh!

Conversation

eric-czech commented Jul 6, 2020

Uh oh!

tomwhite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-czech commented Jul 6, 2020

Uh oh!

tomwhite commented Jul 7, 2020

Uh oh!

eric-czech commented Jul 7, 2020

Uh oh!

alimanfoo commented Jul 8, 2020

Uh oh!

Uh oh!

eric-czech Jul 8, 2020 •

edited

Loading