-
Notifications
You must be signed in to change notification settings - Fork 35
PCA #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA #123
Conversation
I see that the linter is unhappy with me and I will go follow the contributing docs. https://pystatgen.github.io/sgkit/contributing.html |
sgkit/stats/decomposition.py
Outdated
return random_state.randint(low, high) | ||
|
||
|
||
class GenotypePCA(sklearn.decomposition.PCA): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the rationale for this to be able to use super._get_solver()
? Does the dask-ml version override that logic or does it also use the scikit-learn function?
|
||
|
||
class GenotypePCA(sklearn.decomposition.PCA): # type: ignore | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A description here would be good, possibly with some appeal to the relationship with scikit-allele given that there is a "differences from scikit-allel" section.
) | ||
|
||
n_components = self.n_components | ||
if solver in {"full"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this if statement not be collapsed with the if statement above?
sgkit/stats/preprocessing.py
Outdated
|
||
|
||
class PattersonScaler(TransformerMixin, BaseEstimator): # type: ignore | ||
"""New Patterson Scaler with SKLearn API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better IMO if this was a description of what patterson scaling means and I don't think it makes sense to say that it's "new" in some way.
Also nit: can you use "scikit-learn" or "sklearn" when referring to it? As far as I know "SKLearn" isn't a common casing.
sgkit/stats/preprocessing.py
Outdated
return transformed | ||
|
||
def fit_transform(self, gn: ArrayLike, y: Optional[ArrayLike] = None) -> ArrayLike: | ||
# TODO Raise an Error if this is not a dask array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw I've found a few functions in the Dask array API that will choke if not provided a Dask array explicitly but most would dispatch to the underlying array type successfully (e.g. da.sqrt). If you find otherwise, then a da.asarray
would be better than an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is mostly resolved by using dask-ml as the base class for the preprocessors and pca. They have type checking in their functions.
sgkit/stats/preprocessing.py
Outdated
return self.transform(gn) | ||
|
||
|
||
class CenterScaler(TransformerMixin, BaseEstimator): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop this unless there's some reason the scikit-learn StandardScaler doesn't work using with_std=False
. The Dask-ML StandardScaler also has that option as a fallback. I don't see the ploidy
arg being used anywhere so did you see any other reason to port this over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. It's only ported over for the sake of niceness and keeping names consistent. Should I remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, remove please.
sgkit/stats/preprocessing.py
Outdated
self.copy: bool = copy | ||
self.ploidy: int = ploidy | ||
|
||
def _reset(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn has features for this sort of thing, namely clone. That would work for this since the stateful parameters are suffixed by _
so you can drop this function.
sgkit/stats/preprocessing.py
Outdated
""" | ||
|
||
# Reset internal state before fitting | ||
self._reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary if we go the clone-via-sklearn approach.
sgkit/tests/test_decomposition.py
Outdated
pca.fit(genotypes) | ||
|
||
|
||
def test_patterson_scaler_genotype_pca_sklearn_pipeline(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Good idea.
Hey @jerowe thanks for your work on this. Sorry if I missed this somewhere along the way, but what was the rationale behind not using Dask-ML PCA? |
@eric-czech thanks for the comments!
The dask pca produces slightly different results. Once you plot them you can see that the relationships are about the same and things cluster about the same, but still they are slightly different. Edit Ok I'm taking a closer look at the dask-ml code and I'll write some tests to see if I can get any parameters to make them the same. Thanks again! |
@eric-czech thanks again for the helpful comments. I update the PR to use dask-ml as a base class for the preprocessors and PCA. This removes the need for more type checking and is more in line with what's actually happening in the code. I also added in testing against scikit-alle and added proper doc strings. |
@hammer I don't have an |
@jerowe what is the difference between the current GenotypePCA and using dask-ml PCA directly? |
@eric-czech they give slightly different results. Once you plot them they look very similar, but I couldn't get them similar enough to get the tests against scikit-allel to pass. |
For reference here's the scikit-allel PCA and here's the dask-ml PCA. |
As far as I can tell, the GenotypePCA code seems to be more or less a copy+paste of the dask-ml code with some minor differences that don't seem to be related to properties of genetic data we have to address. Let me know if I'm missing something there. I think we should either:
|
@hammer no that's fine. I prefer having a review anyways. I was just wondering. |
@eric-czech @alimanfoo I work on sgkit Wednesdays and Thursdays, and it's the end of the day for me today. So if you have anything to add here and I don't respond right away I'm not ignoring you. ;-) |
Thanks @jerowe have a good weekend! I ran a quick comparison between the scikit-allel, scikit-learn, and dask-ml PCA implementations to get a better feel for the differences btw: https://gist.github.com/eric-czech/0fc7b73146913fe232b6e72adee80ff6. I think they are all equivalent within any meaningful margin of error, though I did hit a bug that I had to fix in the source code related to using dask-ml PCA with short-fat rather than tall-skinny arrays. Other than that, it seems reasonable to assume that based on dataset size users could choose freely between the scikit-learn, dask-ml tsqr, and dask-ml randomized implementations. Dask-ml and scikit-learn also force sign determinacy so that would be a nice benefit. The bug I found was related to that, so I'll file it and see if there is any fundamental limitation with that code for short-fat arrays. EDIT Actually, I take that back. I'm not sure if the dask-ml PCA will work for short-fat arrays despite the fact that |
@eric-czech I see what you mean now with the dask-ml being the same. I think I was testing against skinny arrays and it was failing, so I just moved on. (It does fail on skinny arrays) Now the question is what do we want to do with the PCA. I see 3 options.
Which is then called as :
Which is then called as
Thoughts? |
Alright I guess I didn't push that code.
Should I fork and then open a PR from that fork? Edit - Ok that worked so I'll just work from a forked sgkit, which I probably should have been doing anyways. ;-) |
https://github.com/pystatgen/sgkit/issues/95
Moving scikit-allel PCA functions from:
And adopting conventions from SKLearn so everyone can use sklearn pipelines.