PCA #123

jerowe · 2020-08-19T12:20:50Z

https://github.com/pystatgen/sgkit/issues/95

Moving scikit-allel PCA functions from:

And adopting conventions from SKLearn so everyone can use sklearn pipelines.

jerowe · 2020-08-19T12:40:14Z

I see that the linter is unhappy with me and I will go follow the contributing docs. https://pystatgen.github.io/sgkit/contributing.html

eric-czech · 2020-08-21T14:49:52Z

sgkit/stats/decomposition.py

+    return random_state.randint(low, high)
+
+
+class GenotypePCA(sklearn.decomposition.PCA):  # type: ignore


Was the rationale for this to be able to use super._get_solver()? Does the dask-ml version override that logic or does it also use the scikit-learn function?

eric-czech · 2020-08-21T14:51:13Z

sgkit/stats/decomposition.py

+
+
+class GenotypePCA(sklearn.decomposition.PCA):  # type: ignore
+    """


A description here would be good, possibly with some appeal to the relationship with scikit-allele given that there is a "differences from scikit-allel" section.

sgkit/stats/decomposition.py

eric-czech · 2020-08-21T15:01:22Z

sgkit/stats/decomposition.py

+            )
+
+        n_components = self.n_components
+        if solver in {"full"}:


Can this if statement not be collapsed with the if statement above?

eric-czech · 2020-08-21T15:04:14Z

sgkit/stats/preprocessing.py

+
+
+class PattersonScaler(TransformerMixin, BaseEstimator):  # type: ignore
+    """New Patterson Scaler with SKLearn API


It would be better IMO if this was a description of what patterson scaling means and I don't think it makes sense to say that it's "new" in some way.

Also nit: can you use "scikit-learn" or "sklearn" when referring to it? As far as I know "SKLearn" isn't a common casing.

eric-czech · 2020-08-21T15:09:27Z

sgkit/stats/preprocessing.py

+        return transformed
+
+    def fit_transform(self, gn: ArrayLike, y: Optional[ArrayLike] = None) -> ArrayLike:
+        # TODO Raise an Error if this is not a dask array


Fwiw I've found a few functions in the Dask array API that will choke if not provided a Dask array explicitly but most would dispatch to the underlying array type successfully (e.g. da.sqrt). If you find otherwise, then a da.asarray would be better than an error.

I think this comment is mostly resolved by using dask-ml as the base class for the preprocessors and pca. They have type checking in their functions.

eric-czech · 2020-08-21T15:12:45Z

sgkit/stats/preprocessing.py

+        return self.transform(gn)
+
+
+class CenterScaler(TransformerMixin, BaseEstimator):  # type: ignore


I think we can drop this unless there's some reason the scikit-learn StandardScaler doesn't work using with_std=False. The Dask-ML StandardScaler also has that option as a fallback. I don't see the ploidy arg being used anywhere so did you see any other reason to port this over?

You're right. It's only ported over for the sake of niceness and keeping names consistent. Should I remove?

Yep, remove please.

sgkit/stats/preprocessing.py

sgkit/stats/decomposition.py

eric-czech · 2020-08-21T15:25:24Z

sgkit/stats/preprocessing.py

+        self.copy: bool = copy
+        self.ploidy: int = ploidy
+
+    def _reset(self) -> None:


Scikit-learn has features for this sort of thing, namely clone. That would work for this since the stateful parameters are suffixed by _ so you can drop this function.

eric-czech · 2020-08-21T15:26:23Z

sgkit/stats/preprocessing.py

+        """
+
+        # Reset internal state before fitting
+        self._reset()


I don't think this is necessary if we go the clone-via-sklearn approach.

eric-czech · 2020-08-21T15:28:10Z

sgkit/tests/test_decomposition.py

+        pca.fit(genotypes)
+
+
+def test_patterson_scaler_genotype_pca_sklearn_pipeline():


Nice! Good idea.

sgkit/tests/test_decomposition.py

sgkit/stats/decomposition.py

eric-czech · 2020-08-21T15:46:24Z

Hey @jerowe thanks for your work on this. Sorry if I missed this somewhere along the way, but what was the rationale behind not using Dask-ML PCA?

jerowe · 2020-08-26T07:54:02Z

@eric-czech thanks for the comments!

Hey @jerowe thanks for your work on this. Sorry if I missed this somewhere along the way, but what was the rationale behind not using Dask-ML PCA?

The dask pca produces slightly different results. Once you plot them you can see that the relationships are about the same and things cluster about the same, but still they are slightly different.

Edit Ok I'm taking a closer look at the dask-ml code and I'll write some tests to see if I can get any parameters to make them the same.

Thanks again!

jerowe · 2020-08-27T09:31:41Z

@eric-czech thanks again for the helpful comments.

I update the PR to use dask-ml as a base class for the preprocessors and PCA. This removes the need for more type checking and is more in line with what's actually happening in the code.

I also added in testing against scikit-alle and added proper doc strings.

jerowe · 2020-08-27T09:32:49Z

@hammer I don't have an auto-merge tag but github is still telling me I can merge. Is that the way it's supposed to be or did something go sideways. ;-)

eric-czech · 2020-08-27T10:12:52Z

I update the PR to use dask-ml as a base class for the preprocessors and PCA. This removes the need for more type checking and is more in line with what's actually happening in the code.

@jerowe what is the difference between the current GenotypePCA and using dask-ml PCA directly?

jerowe · 2020-08-27T10:27:59Z

@jerowe what is the difference between the current GenotypePCA and using dask-ml PCA directly?

@eric-czech they give slightly different results. Once you plot them they look very similar, but I couldn't get them similar enough to get the tests against scikit-allel to pass.

jerowe · 2020-08-27T10:34:49Z

For reference here's the scikit-allel PCA and here's the dask-ml PCA.

eric-czech · 2020-08-27T10:58:51Z

@eric-czech they give slightly different results. Once you plot them they look very similar, but I couldn't get them similar enough to get the tests against scikit-allel to pass.

As far as I can tell, the GenotypePCA code seems to be more or less a copy+paste of the dask-ml code with some minor differences that don't seem to be related to properties of genetic data we have to address. Let me know if I'm missing something there.

I think we should either:

Remove GenotypePCA class unless it adds important features on top of dask-ml PCA. I know @alimanfoo has mentioned some nuances with this, though all of the ones I can recall are related to preprocessing and not the PCA itself (e.g. a variant with all het calls results in nans in results due to no variance in the counts).
Keep GenotypePCA but make whatever functionality it adds clear in the docs.

hammer · 2020-08-27T11:56:14Z

@hammer I don't have an auto-merge tag but github is still telling me I can merge. Is that the way it's supposed to be or did something go sideways. ;-)

Thanks @jerowe I've updated the repo permissions to only allow committers to have push rights. Let me know if that impacts you negatively...

jerowe · 2020-08-27T14:53:23Z

@hammer no that's fine. I prefer having a review anyways. I was just wondering.

jerowe · 2020-08-27T16:30:36Z

@eric-czech @alimanfoo I work on sgkit Wednesdays and Thursdays, and it's the end of the day for me today. So if you have anything to add here and I don't respond right away I'm not ignoring you. ;-)

eric-czech · 2020-08-27T18:13:09Z

Thanks @jerowe have a good weekend!

I ran a quick comparison between the scikit-allel, scikit-learn, and dask-ml PCA implementations to get a better feel for the differences btw: https://gist.github.com/eric-czech/0fc7b73146913fe232b6e72adee80ff6.

I think they are all equivalent within any meaningful margin of error, though I did hit a bug that I had to fix in the source code related to using dask-ml PCA with short-fat rather than tall-skinny arrays. Other than that, it seems reasonable to assume that based on dataset size users could choose freely between the scikit-learn, dask-ml tsqr, and dask-ml randomized implementations.

Dask-ml and scikit-learn also force sign determinacy so that would be a nice benefit. The bug I found was related to that, so I'll file it and see if there is any fundamental limitation with that code for short-fat arrays.

EDIT

Actually, I take that back. I'm not sure if the dask-ml PCA will work for short-fat arrays despite the fact that da.linalg.svd does. I'll ask: dask/dask-ml#731. If that's true then the in-memory use case and the fully chunked use case could be supported well, but the non-randomized, chunked-in-only-one-dimension use case would be problematic.

jerowe · 2020-09-02T08:46:20Z

@eric-czech I see what you mean now with the dask-ml being the same. I think I was testing against skinny arrays and it was failing, so I just moved on. (It does fail on skinny arrays)

Now the question is what do we want to do with the PCA. I see 3 options.

Leave the GenotypePCA class out entirely and just say use DaskML
Put in a GenotypePCA class that has some doc strings and calls super

    def __init__(
        self,
        n_components: int = 10,
        copy: bool = True,
        whiten: bool = False,
        svd_solver: str = "full",
        tol: float = 0.0,
        iterated_power: int = 0,
        random_state: Optional[int] = None,
    ):
        super().__init__(
            n_components=n_components,
            copy=copy,
            whiten=whiten,
            svd_solver=svd_solver,
            tol=tol,
            iterated_power=iterated_power,
            random_state=random_state,
        )

Which is then called as :

pca = GenotypePCA(n_components=10)
x_R = pca.fit_transform(gn.T)

Include a GenotypePCA with super that also transposes the genotypes

Above plus

def _fit(self, gn: ArrayLike) -> Tuple[ArrayLike, ArrayLike, ArrayLike]:
    super().fit(gn.T)

def transform(self, gn: ArrayLike) -> ArrayLike:
    super().transform(gn.T)

Which is then called as

pca = GenotypePCA(n_components=10)
x_R = pca.fit_transform(gn)

Thoughts?

jerowe · 2020-09-02T11:25:43Z

Alright I guess I didn't push that code.

ERROR: Permission to pystatgen/sgkit.git denied to jerowe.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Should I fork and then open a PR from that fork?

Edit - Ok that worked so I'll just work from a forked sgkit, which I probably should have been doing anyways. ;-)

first pass PCA

34c34e2

jerowe self-assigned this Aug 19, 2020

jerowe linked an issue Aug 19, 2020 that may be closed by this pull request

PCA User Story #95

Closed

jerowe marked this pull request as draft August 19, 2020 12:23

adding tests

4a10af8

jerowe requested review from eric-czech and tomwhite August 20, 2020 13:11