Skip to content

PCA #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

PCA #227

wants to merge 15 commits into from

Conversation

jerowe
Copy link

@jerowe jerowe commented Sep 2, 2020

I couldn't push to pystatgen/sgkit so I'm reopening this PR here.

Continues - https://github.com/pystatgen/sgkit/pull/123

Linked Issue - https://github.com/pystatgen/sgkit/issues/95

@jerowe jerowe requested a review from eric-czech September 2, 2020 11:31
@jerowe jerowe self-assigned this Sep 2, 2020
@jerowe jerowe changed the base branch from feature/pca to master September 2, 2020 11:34
@jerowe jerowe marked this pull request as draft September 2, 2020 11:37
@jerowe
Copy link
Author

jerowe commented Sep 2, 2020

Throwing this back in here to show that we still care.

dask/dask-ml#731

https://github.com/pystatgen/sgkit/blob/6dcf446583bc52de5e15bae65f4aa63fcc5bb9b7/sgkit/tests/test_decomposition.py#L239

I have some tests demonstrating that dask-ml PCA fails based on the shape of the data.

@eric-czech
Copy link
Collaborator

I have some tests demonstrating that dask-ml PCA fails based on the shape of the data.

There are two problems I see with dask-ml PCA so far:

  1. Sign determinacy support is broken when an array is wider than it is tall (utils.svd_flip breaks on rectangular matrix dask/dask-ml#732)
  2. Dask deterministic svd doesn't support column chunking (Support short-fat array chunking in svd dask/dask#6590)

Both of those are problems we should solve regardless of whether or not we subclass, vendor in, or omit code in sgkit. It's not looking like they'll be hard to fix upstream at the moment.

To me, the simplest way to support PCA is to:

  1. Remove the dependency on dask-ml added in this PR
  2. Remove everything but the PattersonScaler (and have it subclass from scikit-learn instead)
  3. Add documentation and/or notebooks that show how to do PCA with these other tools

I can imagine that we might eventually find some reason to want to do PCA differently than how dask-ml and dask currently support it, but I can't see any reason why those things would only apply to genetics. Let me know if you disagree @jeromekelleher / @jerowe.

I guess I wouldn't be too surprised if we one day vendor in a bunch of external code like this, though I'm struggling to find a reason why it's necessary now.

@jerowe
Copy link
Author

jerowe commented Sep 3, 2020

I'll go with whatever the vote is, but my personal vote is that we should support PCA. The GenotypePCA class 74 lines of code, with 9 of those being a call to super in the init function so we have the same defaults as the GenotypePCA. The rest is just porting over docstrings so it's obvious to users where the differences are. I'd consider PCA to be a baseline in terms of core functionality myself.

We can still subclass dask-ml PCA and just change the _fit to transpose the data the way scikit-allel does it. The problems in the dask-ml PCA lie where they run the svd_flip function.

@eric-czech
Copy link
Collaborator

eric-czech commented Sep 3, 2020

We can still subclass dask-ml PCA and just change the _fit to transpose the data the way scikit-allel does it. The problems in the dask-ml PCA lie where they run the svd_flip function.

Transposing the input and supporting sign determinacy aren't specific to our domain though. Fixing svd_flip (dask/dask-ml#732) won't take much except a little time to figure out where dask/dask#3576 (comment) is at. I put in dask/dask#6591 yesterday to try to address the deeper svd issue with column chunkings (unrelated to svd_flip).

If we would like to wrap this all up for users now, then this interface would be consistent with the rest of the API:

from sklearn.base import BaseEstimator as Estimator

def pca(ds: Dataset, n_components: int, est: Optional[Estimator] = None, ...) -> Tuple[Dataset, Estimator]:
     # Make sure allele count is present
     if est is None:
         est = ... # Create and fit scaling / PCA pipeline estimator (sklearn or Dask-ML, depending on chunking)
    # Attach results to Dataset
    return ds, est

Then we could wrap the utilities rather than extending them in a way that involves copy+pasting so much code in order to work around domain-agnostic upstream limitations.

EDIT

I can also see value in instead starting with a function like def pca(ds: Dataset, n_components: int, ...) -> Dataset (with no notion of a stateful underlying model) that covers the large majority of PCA use cases, and then adding a guide on how to build pipelines for doing more complex things. Again though, I don't yet see any reason why that function would need to use any custom overrides of dask-ml preprocessing or PCA classes.

@jerowe jerowe removed their assignment Sep 3, 2020
@eric-czech eric-czech mentioned this pull request Sep 16, 2020
18 tasks
@hammer
Copy link
Contributor

hammer commented Sep 17, 2020

Closing this in favor of #262

@hammer hammer closed this Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants