-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @eric-czech.
Regarding the allele string encoding, using dtype "S1" was @alimanfoo's suggestion in https://discourse.smadstatgen.org/t/xarray-conventions-for-variation-data/44/8. This uses 1 byte per allele, compared to 4 if we used "U1". We can mitigate the user confusion by documenting this well.
sgkit_plink/pysnptools.py
Outdated
variant_contig_names = list(variant_contig_names.compute()) | ||
# Otherwise index by unique name where index will correspond | ||
# to lexsort on names, i.e. if contigs are 'chr1', 'chr2', | ||
# ..., 'chr10' then 'chr10' comes before 'chr2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be better to index in the order that the names appear? Harder to implement, but likely to correspond to the order of other contig indexes (e.g. VCF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that when I saw your comment in your initial changes. Personally, I thought it would make more sense to say that using the contig name index is a better practice (we'll need to support that well anyhow). Referring to them directly by index when not already encoded as ints seemed like it would invite errors. In other words, I thought this was reasonable:
ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))
I don't feel strongly about it though if you think access by order-of-appearance index will be important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen problems with sorting contig names in the past (in BAM/VCF headers), so I thought ordering by appearance would be making slightly fewer assumptions. (There's a neat way of doing it here BTW: https://stackoverflow.com/a/12926989).
Doesn't your ds.sel
work whatever the order of contigs
- i.e. lex sorted or oder-of-appearance sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the ds.sel
using .attrs['contigs']
should work regardless so I think we should push ourselves and users towards it.
One advantage of the order-of-appearance indexing though would be that contig indexes in an Xarray index wouldn't cause a re-ordering relative to the original dataset. I can't think of any access patterns where sequential access across contigs is important (e.g. range queries), so the order of them shouldn't matter, but reordering contigs should lead to Dask needlessly running big slices on the other non-indexed arrays.
I'll change it but if the contig index isn't going to be an internal detail, it makes me wonder if we should use 1-based indexing. I think most geneticists would find 0-based contig codes problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's using Python indexing, so 0-based makes sense I think?
Agreed on the other points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 717d307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.sel(variant=ds.contig == ds.attrs['contigs'].index('chr1'))
Btw I'm cool with this but it will be a bit obscure for new users. Is there any way we could make some convenience or shortcut?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few options that come to mind:
- Using an accessor:
ds.genetics.sel_contig(slice('chr1'))
- Accessor specific to functions on attributes:
ds.sel(variant=ds['variant/contig'] == ds.meta.contig_index('chr1'))
- Use contig index directly:
ds.sel(variant=ds['variant/contig'] == 0) # for chr1
I think it makes more sense to focus on the situation with an index though, which is a bit more clear on it's own IMO:
contig_index = ds.attrs['contigs'].index('chr1')
contig_range = slice(contig_index, contig_index)
# or contig_range = slice(0, 0) for chr1
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('variant/contig', 'variant/pos')).sel(variant=(contig_range, bp_pos_range))
Any thoughts/preferences on those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eric-czech for thinking through options. Moved this discussion over to https://github.com/pystatgen/sgkit/issues/25 to give it some more space.
Ah got it. 👍 |
Thanks for making all those changes - they look good to me. I think it would be good to have order-by-appearance for the contig index. |
Sure, I added and used a function for it here. |
Very nice! |
This adds a PLINK reader as a
dask.array.from_array
wrapper with apysnptools.Bed
delegate. Simulated data for testing was generated by Hail with a small degree of missingness injected (10%).One question I had for you in doing this @tomwhite was why you wanted to use zero-terminated bytes for allele strings (i.e. dtype "S"). I'm not familiar with the advantages of that but I did see that the numpy docs recommend not using it. I have no idea why they say that, but I'm curious what implications it has. If nothing else, I thought this behavior would be confusing/unexpected for most users:
I'm all for it if it's a lot more efficient though.