PCA #227

jerowe · 2020-09-02T11:31:14Z

I couldn't push to pystatgen/sgkit so I'm reopening this PR here.

Continues - https://github.com/pystatgen/sgkit/pull/123

Linked Issue - https://github.com/pystatgen/sgkit/issues/95

jerowe · 2020-09-02T15:20:59Z

Throwing this back in here to show that we still care.

https://github.com/pystatgen/sgkit/blob/6dcf446583bc52de5e15bae65f4aa63fcc5bb9b7/sgkit/tests/test_decomposition.py#L239

I have some tests demonstrating that dask-ml PCA fails based on the shape of the data.

eric-czech · 2020-09-02T19:20:31Z

I have some tests demonstrating that dask-ml PCA fails based on the shape of the data.

There are two problems I see with dask-ml PCA so far:

Sign determinacy support is broken when an array is wider than it is tall (utils.svd_flip breaks on rectangular matrix dask/dask-ml#732)
Dask deterministic svd doesn't support column chunking (Support short-fat array chunking in svd dask/dask#6590)

Both of those are problems we should solve regardless of whether or not we subclass, vendor in, or omit code in sgkit. It's not looking like they'll be hard to fix upstream at the moment.

To me, the simplest way to support PCA is to:

Remove the dependency on dask-ml added in this PR
Remove everything but the PattersonScaler (and have it subclass from scikit-learn instead)
Add documentation and/or notebooks that show how to do PCA with these other tools

I can imagine that we might eventually find some reason to want to do PCA differently than how dask-ml and dask currently support it, but I can't see any reason why those things would only apply to genetics. Let me know if you disagree @jeromekelleher / @jerowe.

I guess I wouldn't be too surprised if we one day vendor in a bunch of external code like this, though I'm struggling to find a reason why it's necessary now.

jerowe · 2020-09-03T08:50:56Z

I'll go with whatever the vote is, but my personal vote is that we should support PCA. The GenotypePCA class 74 lines of code, with 9 of those being a call to super in the init function so we have the same defaults as the GenotypePCA. The rest is just porting over docstrings so it's obvious to users where the differences are. I'd consider PCA to be a baseline in terms of core functionality myself.

We can still subclass dask-ml PCA and just change the _fit to transpose the data the way scikit-allel does it. The problems in the dask-ml PCA lie where they run the svd_flip function.

eric-czech · 2020-09-03T10:56:07Z

We can still subclass dask-ml PCA and just change the _fit to transpose the data the way scikit-allel does it. The problems in the dask-ml PCA lie where they run the svd_flip function.

Transposing the input and supporting sign determinacy aren't specific to our domain though. Fixing svd_flip (dask/dask-ml#732) won't take much except a little time to figure out where dask/dask#3576 (comment) is at. I put in dask/dask#6591 yesterday to try to address the deeper svd issue with column chunkings (unrelated to svd_flip).

If we would like to wrap this all up for users now, then this interface would be consistent with the rest of the API:

from sklearn.base import BaseEstimator as Estimator

def pca(ds: Dataset, n_components: int, est: Optional[Estimator] = None, ...) -> Tuple[Dataset, Estimator]:
     # Make sure allele count is present
     if est is None:
         est = ... # Create and fit scaling / PCA pipeline estimator (sklearn or Dask-ML, depending on chunking)
    # Attach results to Dataset
    return ds, est

Then we could wrap the utilities rather than extending them in a way that involves copy+pasting so much code in order to work around domain-agnostic upstream limitations.

EDIT

I can also see value in instead starting with a function like def pca(ds: Dataset, n_components: int, ...) -> Dataset (with no notion of a stateful underlying model) that covers the large majority of PCA use cases, and then adding a guide on how to build pipelines for doing more complex things. Again though, I don't yet see any reason why that function would need to use any custom overrides of dask-ml preprocessing or PCA classes.

hammer · 2020-09-17T14:19:09Z

Closing this in favor of #262

jerowe added 12 commits August 19, 2020 15:18

first pass PCA

34c34e2

adding tests

4a10af8

resolving conflicts in requirements

8de4ec6

Updating tests to test against scikit-allel

c34ff27

Updating tests to test against scikit-allel

80f6b93

increasing number of variants in svd to decrease rounding error

d6289c8

increasing number of variants in svd to decrease rounding error

e719c0b

just testing max distance now

147602c

adding in test against dask-ml

8b2588c

adding in test against dask-ml

06e4fb0

removing custom PCA code

aee0953

cleaned up docs

6dcf446

jerowe requested a review from eric-czech September 2, 2020 11:31

jerowe self-assigned this Sep 2, 2020

jerowe changed the base branch from feature/pca to master September 2, 2020 11:34

jerowe marked this pull request as draft September 2, 2020 11:37

putting back in original scikit-allel pca

35d2614

jerowe added 2 commits September 3, 2020 09:06

increasing level of allowed variance on components

99ead07

added references

14d5131

hammer mentioned this pull request Sep 3, 2020

Add variant annotation functions #112

Open

jerowe removed their assignment Sep 3, 2020

eric-czech mentioned this pull request Sep 16, 2020

PCA implementation #262

Merged

18 tasks

hammer closed this Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PCA #227

PCA #227

Uh oh!

jerowe commented Sep 2, 2020 •

edited

Loading

Uh oh!

jerowe commented Sep 2, 2020

Uh oh!

eric-czech commented Sep 2, 2020

Uh oh!

jerowe commented Sep 3, 2020

Uh oh!

eric-czech commented Sep 3, 2020 •

edited

Loading

Uh oh!

hammer commented Sep 17, 2020

Uh oh!

Uh oh!

PCA #227

PCA #227

Uh oh!

Conversation

jerowe commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerowe commented Sep 2, 2020

Uh oh!

eric-czech commented Sep 2, 2020

Uh oh!

jerowe commented Sep 3, 2020

Uh oh!

eric-czech commented Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hammer commented Sep 17, 2020

Uh oh!

Uh oh!

jerowe commented Sep 2, 2020 •

edited

Loading

eric-czech commented Sep 3, 2020 •

edited

Loading