Skip to content

PCA #123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

PCA #123

wants to merge 9 commits into from

Conversation

jerowe
Copy link

@jerowe jerowe commented Aug 19, 2020

https://github.com/pystatgen/sgkit/issues/95

Moving scikit-allel PCA functions from:

And adopting conventions from SKLearn so everyone can use sklearn pipelines.

@jerowe jerowe self-assigned this Aug 19, 2020
@jerowe jerowe linked an issue Aug 19, 2020 that may be closed by this pull request
@jerowe jerowe marked this pull request as draft August 19, 2020 12:23
@jerowe
Copy link
Author

jerowe commented Aug 19, 2020

I see that the linter is unhappy with me and I will go follow the contributing docs. https://pystatgen.github.io/sgkit/contributing.html

@jerowe jerowe requested review from eric-czech and tomwhite August 20, 2020 13:11
return random_state.randint(low, high)


class GenotypePCA(sklearn.decomposition.PCA): # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the rationale for this to be able to use super._get_solver()? Does the dask-ml version override that logic or does it also use the scikit-learn function?



class GenotypePCA(sklearn.decomposition.PCA): # type: ignore
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description here would be good, possibly with some appeal to the relationship with scikit-allele given that there is a "differences from scikit-allel" section.

)

n_components = self.n_components
if solver in {"full"}:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this if statement not be collapsed with the if statement above?



class PattersonScaler(TransformerMixin, BaseEstimator): # type: ignore
"""New Patterson Scaler with SKLearn API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better IMO if this was a description of what patterson scaling means and I don't think it makes sense to say that it's "new" in some way.

Also nit: can you use "scikit-learn" or "sklearn" when referring to it? As far as I know "SKLearn" isn't a common casing.

return transformed

def fit_transform(self, gn: ArrayLike, y: Optional[ArrayLike] = None) -> ArrayLike:
# TODO Raise an Error if this is not a dask array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw I've found a few functions in the Dask array API that will choke if not provided a Dask array explicitly but most would dispatch to the underlying array type successfully (e.g. da.sqrt). If you find otherwise, then a da.asarray would be better than an error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is mostly resolved by using dask-ml as the base class for the preprocessors and pca. They have type checking in their functions.

return self.transform(gn)


class CenterScaler(TransformerMixin, BaseEstimator): # type: ignore
Copy link
Collaborator

@eric-czech eric-czech Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this unless there's some reason the scikit-learn StandardScaler doesn't work using with_std=False. The Dask-ML StandardScaler also has that option as a fallback. I don't see the ploidy arg being used anywhere so did you see any other reason to port this over?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's only ported over for the sake of niceness and keeping names consistent. Should I remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, remove please.

self.copy: bool = copy
self.ploidy: int = ploidy

def _reset(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn has features for this sort of thing, namely clone. That would work for this since the stateful parameters are suffixed by _ so you can drop this function.

"""

# Reset internal state before fitting
self._reset()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary if we go the clone-via-sklearn approach.

pca.fit(genotypes)


def test_patterson_scaler_genotype_pca_sklearn_pipeline():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good idea.

@eric-czech
Copy link
Collaborator

Hey @jerowe thanks for your work on this. Sorry if I missed this somewhere along the way, but what was the rationale behind not using Dask-ML PCA?

@jerowe
Copy link
Author

jerowe commented Aug 26, 2020

@eric-czech thanks for the comments!

Hey @jerowe thanks for your work on this. Sorry if I missed this somewhere along the way, but what was the rationale behind not using Dask-ML PCA?

The dask pca produces slightly different results. Once you plot them you can see that the relationships are about the same and things cluster about the same, but still they are slightly different.

Edit Ok I'm taking a closer look at the dask-ml code and I'll write some tests to see if I can get any parameters to make them the same.

Thanks again!

@jerowe jerowe marked this pull request as ready for review August 27, 2020 09:29
@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

@eric-czech thanks again for the helpful comments.

I update the PR to use dask-ml as a base class for the preprocessors and PCA. This removes the need for more type checking and is more in line with what's actually happening in the code.

I also added in testing against scikit-alle and added proper doc strings.

@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

@hammer I don't have an auto-merge tag but github is still telling me I can merge. Is that the way it's supposed to be or did something go sideways. ;-)

@jerowe jerowe changed the title WIP PCA PCA Aug 27, 2020
@eric-czech
Copy link
Collaborator

I update the PR to use dask-ml as a base class for the preprocessors and PCA. This removes the need for more type checking and is more in line with what's actually happening in the code.

@jerowe what is the difference between the current GenotypePCA and using dask-ml PCA directly?

@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

@jerowe what is the difference between the current GenotypePCA and using dask-ml PCA directly?

@eric-czech they give slightly different results. Once you plot them they look very similar, but I couldn't get them similar enough to get the tests against scikit-allel to pass.

@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

For reference here's the scikit-allel PCA and here's the dask-ml PCA.

@eric-czech
Copy link
Collaborator

@eric-czech they give slightly different results. Once you plot them they look very similar, but I couldn't get them similar enough to get the tests against scikit-allel to pass.

As far as I can tell, the GenotypePCA code seems to be more or less a copy+paste of the dask-ml code with some minor differences that don't seem to be related to properties of genetic data we have to address. Let me know if I'm missing something there.

I think we should either:

  1. Remove GenotypePCA class unless it adds important features on top of dask-ml PCA. I know @alimanfoo has mentioned some nuances with this, though all of the ones I can recall are related to preprocessing and not the PCA itself (e.g. a variant with all het calls results in nans in results due to no variance in the counts).
  2. Keep GenotypePCA but make whatever functionality it adds clear in the docs.

@hammer
Copy link
Contributor

hammer commented Aug 27, 2020

@hammer I don't have an auto-merge tag but github is still telling me I can merge. Is that the way it's supposed to be or did something go sideways. ;-)

Thanks @jerowe I've updated the repo permissions to only allow committers to have push rights. Let me know if that impacts you negatively...

@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

@hammer no that's fine. I prefer having a review anyways. I was just wondering.

@jerowe
Copy link
Author

jerowe commented Aug 27, 2020

@eric-czech @alimanfoo I work on sgkit Wednesdays and Thursdays, and it's the end of the day for me today. So if you have anything to add here and I don't respond right away I'm not ignoring you. ;-)

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 27, 2020

Thanks @jerowe have a good weekend!

I ran a quick comparison between the scikit-allel, scikit-learn, and dask-ml PCA implementations to get a better feel for the differences btw: https://gist.github.com/eric-czech/0fc7b73146913fe232b6e72adee80ff6.

I think they are all equivalent within any meaningful margin of error, though I did hit a bug that I had to fix in the source code related to using dask-ml PCA with short-fat rather than tall-skinny arrays. Other than that, it seems reasonable to assume that based on dataset size users could choose freely between the scikit-learn, dask-ml tsqr, and dask-ml randomized implementations.

Dask-ml and scikit-learn also force sign determinacy so that would be a nice benefit. The bug I found was related to that, so I'll file it and see if there is any fundamental limitation with that code for short-fat arrays.

EDIT

Actually, I take that back. I'm not sure if the dask-ml PCA will work for short-fat arrays despite the fact that da.linalg.svd does. I'll ask: dask/dask-ml#731. If that's true then the in-memory use case and the fully chunked use case could be supported well, but the non-randomized, chunked-in-only-one-dimension use case would be problematic.

@jerowe
Copy link
Author

jerowe commented Sep 2, 2020

@eric-czech I see what you mean now with the dask-ml being the same. I think I was testing against skinny arrays and it was failing, so I just moved on. (It does fail on skinny arrays)

Now the question is what do we want to do with the PCA. I see 3 options.

  1. Leave the GenotypePCA class out entirely and just say use DaskML
  2. Put in a GenotypePCA class that has some doc strings and calls super
    def __init__(
        self,
        n_components: int = 10,
        copy: bool = True,
        whiten: bool = False,
        svd_solver: str = "full",
        tol: float = 0.0,
        iterated_power: int = 0,
        random_state: Optional[int] = None,
    ):
        super().__init__(
            n_components=n_components,
            copy=copy,
            whiten=whiten,
            svd_solver=svd_solver,
            tol=tol,
            iterated_power=iterated_power,
            random_state=random_state,
        )

Which is then called as :

pca = GenotypePCA(n_components=10)
x_R = pca.fit_transform(gn.T)
  1. Include a GenotypePCA with super that also transposes the genotypes
Above plus

def _fit(self, gn: ArrayLike) -> Tuple[ArrayLike, ArrayLike, ArrayLike]:
    super().fit(gn.T)

def transform(self, gn: ArrayLike) -> ArrayLike:
    super().transform(gn.T)

Which is then called as

pca = GenotypePCA(n_components=10)
x_R = pca.fit_transform(gn)

Thoughts?

@jerowe
Copy link
Author

jerowe commented Sep 2, 2020

Alright I guess I didn't push that code.

ERROR: Permission to pystatgen/sgkit.git denied to jerowe.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Should I fork and then open a PR from that fork?

Edit - Ok that worked so I'll just work from a forked sgkit, which I probably should have been doing anyways. ;-)

@jerowe jerowe mentioned this pull request Sep 2, 2020
Closed
@jerowe jerowe closed this Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PCA User Story
4 participants