Skip to content

[MRG] Add fast kernel classifier/regressor #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Aug 2, 2019

Conversation

Alex7Li
Copy link
Contributor

@Alex7Li Alex7Li commented May 21, 2019

This pull request was originally for sklearn, but it was decided to move it to sklearn-extra scikit-learn/scikit-learn#11694. We add the EigenPro, also known as FastKernel method.

@Alex7Li
Copy link
Contributor Author

Alex7Li commented May 22, 2019

Oh wow I was struggling for so long because of some mysterious error about having NaN's in a matrix that only appeared when I pushed the code here, but it just magically disappeared when I added the fastfood import (I was removing it because my code wouldn't compile in my environment with that line). Not sure why that happened, but it seems like it works now and the code is ready.

Edit: wait what? I swear I saw a green checkmark! time to go back to debugging

@NicolasHug
Copy link

Thanks for the PR!

I do appreciate the code being well-documented.

Here are a few general comments after a first glance, in order of importance:

  • The current tests are mostly high level, functional tests. We will also need some low level tests, e.g. individually testing the private helpers (if relevant, I haven't checked in too much details). We will also need tests for input checking see e.g. here and making sure the default values behave as expected.
  • I think it would be more appropriate to have a base class and 2 child classes, one for the regressor, one for the classifier. This is how we do things in scikit-learn. You can take inspiration e.g. at the GBDT code.
  • Some of the attributes (typically Q_, V_) should not be attributes. If they're only useful at fit time, they should be destroyed when fit returns. Pass them along and return them from your helper methods. Same for random_state_.
  • The doc for the estimators isn't generated.
  • Please add a short description of each test, indicating what it checks, and why you would expect a given result. Also I feel like a lot of them could be factorized via parametrization.
  • Please document the attributes. The public attribute should end with _, while the private ones should start with _. It seems like a lot of the current public attributes should actually be private.
  • please check your code style with flake or equivalent tool.
  • The benchmark part of the user guide is redundant with the examples. It would be helpful to expand a bit the first part, and give a bit more details on the algorithm.
  • Please only use lower case for variable names (e.g. FKR_prediction should just be fkr_prediction (sometimes predictions is enough depending on the context))
  • The examples take some time to run. Maybe keep one and convert the others as benchmarks?
  • CI is still failing

@NicolasHug
Copy link

Also, for variable that aren't attributes, and if there isn't a more obvious verbose alternative, it'd be great if you followed the paper notation (e.g. E and \Lambda as the result of the SVD).

@rth
Copy link
Contributor

rth commented Jun 3, 2019

Thanks for this PR @Alex7Li . Looking forward to being able to use it!

I'll review in more detail once the above comments are addressed. Also please add new estimators to the list here so that check_estimator is run on them.

@GaelVaroquaux
Copy link
Member

Playing with this code in my lab, we are struggling with the number of eigenvalues to approximate the kernel: setting it to a large value throws an error on certain datasets, as a result we must set it to a moderate value (160, as in the paper), which is suboptimal for certain datasets. It ends up being a dataset-specific parameter, which is a usability concern.

Cc @pcerda

@amueller
Copy link
Member

@GaelVaroquaux what's the error? Can you provide code to reproduce?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 17, 2019 via email

@pcerda
Copy link

pcerda commented Jun 18, 2019

@GaelVaroquaux what's the error?
Nan
Can you provide code to reproduce?
Ping @pcerda

The problem comes in line 177 of fast_kernel.py when calculating the highest n_components eigenvalues:

S, V = sp.linalg.eigh(W, eigvals=(m - n_components, m - 1))
where n_components=1000 and m=1601
with the following hermitian matrix W:
https://www.dropbox.com/s/xqcxhnx1p06u76o/W.npy?dl=1

if n_components=160 I do not have this problem.

@amueller
Copy link
Member

@pcerda can you share the full code and data? If the data is private, can you share a synthetic or open example that reproduces the issue?

@amueller
Copy link
Member

Hm once scikit-learn/scikit-learn#14381 is merged we'll be able to get more helpful error messages. Though I guess not on the CI because it's not release yet?

This seems to be quite helpful according to initial experiments so I'd love to see this merged.

@Alex7Li
Copy link
Contributor Author

Alex7Li commented Jul 23, 2019

I think that it's ready to be merged right now, I explained above why I think it's impossible to fix the CI errors. Is there something else I should do?

@amueller
Copy link
Member

oh, you're right. Sorry, I lost track. I didn't review this in detail but I think it's in a state where we can iterate from, which I think is the point of scikit-learn-extra.
Maybe remove the underscore in the class name, I don't think that's pep8? https://www.python.org/dev/peps/pep-0008/#class-names

@amueller
Copy link
Member

@NicolasHug you had looked at this before, do you want to do another round?

@Alex7Li
Copy link
Contributor Author

Alex7Li commented Jul 23, 2019

Oh hey it passed the tests cause the version changed!

@rth
Copy link
Contributor

rth commented Jul 23, 2019

Oh hey it passed the tests cause the version changed!

Great! Yeah we dropped support for scikit-learn <0.20 recently for this repo.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @Alex7Li!

I did a quick pass on the code. See comments below.

In major points, I find the names FKREigenPro and FKCEigenPro very difficult to parse. Can we change them to either FastKernelRegressor / Classifier or EigenProRegessor / Classifier?

Also please add,

.. toctree::
     :numbered:
  
     modules/fast_kernel.rst


to user_guide.rst so it is linked.

I have not read the paper but the results look compelling

benchmark_results

Unless there are some other major concerns, I would be also +1 for merging once the below (minor) comments are addressed.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please rename,
examples/fast_kernel/plot_mnist.py to examples/plot_fast_kernel_mnist.py.

I'm concerned that static benchmark plots could get outdated, but running them for each commit is also not ideal. Please include the code you used to make those figures maybe under benchmarks/bench_*

Maybe we should put the code under,
sklearn_extra/kernel_methods/_fast_kernel.py that can be imported from sklearn_extra.kernel_methods directly.

@Alex7Li
Copy link
Contributor Author

Alex7Li commented Jul 27, 2019

Still working on the second post (I want to rerun all the examples on our server), but I added all the changes from the first post. I changed the class names to EigenProRegression/Classification, and then changed the main file name from fast_kernel.py to eigenpro.py as well, just to get rid of that alias. I'm not sure why it should be moved to a kernel_methods folder exactly, how does that make it easier to import?

@Alex7Li
Copy link
Contributor Author

Alex7Li commented Jul 27, 2019

Ok, I have finished updating the code

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Alex7Li !

I'm not sure why it should be moved to a kernel_methods folder exactly, how does that make it easier to import?

To follow the scikit-learn aporach where we have general topics (e.g. clustering, linear_models etc) under which individual estimators are imported. I'm not sure what's the the best section name for this estimator, but it's probably not eigenpro. If you create say sklearn_extra/kernel_methods/_eigenpro.py then in sklearn_extra/kernel_methods/__init__.py import these estimator, the actual import path users will see is going to be sklearn_extra.kernel_methods.EigenProClassifier . Again proposal for a better section name than kernel_methods is welcome. The same section name would then need to be used for doc/modules/<section-name>.rst.

I haven't been able to run the benchmarks on a laptop with 8GB RAM due to memory errors but that seem fine.

Also please resolve conflicts (and re-run black).

Apart from that +1 for merging and opening follow-up issues.

@amueller
Copy link
Member

I'm +1 for merging as well.
@Alex7Li it would be awesome if you could stick around a bit to help with the implementation. I think this looks very promising but it's likely that we'll run into issues (because that always happens ;).

@Alex7Li
Copy link
Contributor Author

Alex7Li commented Aug 2, 2019

Thanks Roman for explaining, I moved and finished the merge. I also tried to fix the memory error issue, I think the estimator was just choosing a batch size that was quite large.
I will stick around after this is finished, though I'm not sure exactly what that entails. Browsing through the issues and seeing if there is anything related to this code is what I'm imagining now.
Also I'm having a bit of trouble passing the checks, it seems like it's failing while trying to download packages. I don't think anything I changed should effect that, though?

@rth
Copy link
Contributor

rth commented Aug 2, 2019

The problem is with conda failing to downgrade from 4.7 to 4.6. I can't push to this PR to fix it (as you are using your master branch of your fork instead of a separate branch). CI was green before the last minor of commits, and tests pass locally for me. So let's just merge this. I'll make a follow up PR for fix Travis.

Thanks again @Alex7Li !

I will stick around after this is finished, though I'm not sure exactly what that entails. Browsing through the issues and seeing if there is anything related to this code is what I'm imagining now.

Yeah, mostly that. Also if you think of anything which in your opinion could be improved (with this code or other code in scikit-learn-extra) don't hesitate to open issues.

@rth rth merged commit ef3f570 into scikit-learn-contrib:master Aug 2, 2019
@rth
Copy link
Contributor

rth commented Aug 2, 2019

@Alex7Li Also note that because you were using the master branch of your fork, if you want to sync it with the upstream master, you would need to git reset --hard to some commit deep in the history before your changes, and then rebase on top of upstream/master to get the merged changes.

For next PRs it's better to create a new branch from master..

@amueller
Copy link
Member

I'm a bit surprised by the current behavior, the gamma parameter is completely ignored, despite the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants