[MRG] Add fast kernel classifier/regressor #13

Alex7Li · 2019-05-21T21:47:09Z

This pull request was originally for sklearn, but it was decided to move it to sklearn-extra scikit-learn/scikit-learn#11694. We add the EigenPro, also known as FastKernel method.

Alex7Li · 2019-05-22T18:06:58Z

Oh wow I was struggling for so long because of some mysterious error about having NaN's in a matrix that only appeared when I pushed the code here, but it just magically disappeared when I added the fastfood import (I was removing it because my code wouldn't compile in my environment with that line). Not sure why that happened, but it seems like it works now and the code is ready.

Edit: wait what? I swear I saw a green checkmark! time to go back to debugging

NicolasHug · 2019-05-30T18:19:49Z

Thanks for the PR!

I do appreciate the code being well-documented.

Here are a few general comments after a first glance, in order of importance:

The current tests are mostly high level, functional tests. We will also need some low level tests, e.g. individually testing the private helpers (if relevant, I haven't checked in too much details). We will also need tests for input checking see e.g. here and making sure the default values behave as expected.
I think it would be more appropriate to have a base class and 2 child classes, one for the regressor, one for the classifier. This is how we do things in scikit-learn. You can take inspiration e.g. at the GBDT code.
Some of the attributes (typically Q_, V_) should not be attributes. If they're only useful at fit time, they should be destroyed when fit returns. Pass them along and return them from your helper methods. Same for random_state_.
The doc for the estimators isn't generated.
Please add a short description of each test, indicating what it checks, and why you would expect a given result. Also I feel like a lot of them could be factorized via parametrization.
Please document the attributes. The public attribute should end with _, while the private ones should start with _. It seems like a lot of the current public attributes should actually be private.
please check your code style with flake or equivalent tool.
The benchmark part of the user guide is redundant with the examples. It would be helpful to expand a bit the first part, and give a bit more details on the algorithm.
Please only use lower case for variable names (e.g. FKR_prediction should just be fkr_prediction (sometimes predictions is enough depending on the context))
The examples take some time to run. Maybe keep one and convert the others as benchmarks?
CI is still failing

NicolasHug · 2019-05-30T18:48:54Z

Also, for variable that aren't attributes, and if there isn't a more obvious verbose alternative, it'd be great if you followed the paper notation (e.g. E and \Lambda as the result of the SVD).

rth · 2019-06-03T17:49:03Z

Thanks for this PR @Alex7Li . Looking forward to being able to use it!

I'll review in more detail once the above comments are addressed. Also please add new estimators to the list here so that check_estimator is run on them.

GaelVaroquaux · 2019-06-17T14:43:01Z

Playing with this code in my lab, we are struggling with the number of eigenvalues to approximate the kernel: setting it to a large value throws an error on certain datasets, as a result we must set it to a moderate value (160, as in the paper), which is suboptimal for certain datasets. It ends up being a dataset-specific parameter, which is a usability concern.

Cc @pcerda

amueller · 2019-06-17T20:06:26Z

@GaelVaroquaux what's the error? Can you provide code to reproduce?

GaelVaroquaux · 2019-06-17T20:07:22Z

@GaelVaroquaux what's the error?

Nan

Can you provide code to reproduce?

Ping @pcerda

pcerda · 2019-06-18T13:37:34Z

@GaelVaroquaux what's the error?
Nan
Can you provide code to reproduce?
Ping @pcerda

The problem comes in line 177 of fast_kernel.py when calculating the highest n_components eigenvalues:

S, V = sp.linalg.eigh(W, eigvals=(m - n_components, m - 1))
where n_components=1000 and m=1601
with the following hermitian matrix W:
https://www.dropbox.com/s/xqcxhnx1p06u76o/W.npy?dl=1

if n_components=160 I do not have this problem.

amueller · 2019-06-18T15:08:09Z

@pcerda can you share the full code and data? If the data is private, can you share a synthetic or open example that reproduces the issue?

style

completely different set of tests appears and causes even more failures despite only comments changing

amueller · 2019-07-22T18:07:39Z

Hm once scikit-learn/scikit-learn#14381 is merged we'll be able to get more helpful error messages. Though I guess not on the CI because it's not release yet?

This seems to be quite helpful according to initial experiments so I'd love to see this merged.

Alex7Li · 2019-07-23T01:34:34Z

I think that it's ready to be merged right now, I explained above why I think it's impossible to fix the CI errors. Is there something else I should do?

amueller · 2019-07-23T01:52:44Z

oh, you're right. Sorry, I lost track. I didn't review this in detail but I think it's in a state where we can iterate from, which I think is the point of scikit-learn-extra.
Maybe remove the underscore in the class name, I don't think that's pep8? https://www.python.org/dev/peps/pep-0008/#class-names

amueller · 2019-07-23T01:53:16Z

@NicolasHug you had looked at this before, do you want to do another round?

…r the changes

Alex7Li · 2019-07-23T03:56:49Z

Oh hey it passed the tests cause the version changed!

rth · 2019-07-23T05:58:19Z

Oh hey it passed the tests cause the version changed!

Great! Yeah we dropped support for scikit-learn <0.20 recently for this repo.

rth

Thanks for your PR @Alex7Li!

I did a quick pass on the code. See comments below.

In major points, I find the names FKREigenPro and FKCEigenPro very difficult to parse. Can we change them to either FastKernelRegressor / Classifier or EigenProRegessor / Classifier?

Also please add,

.. toctree::
     :numbered:
  
     modules/fast_kernel.rst

to user_guide.rst so it is linked.

I have not read the paper but the results look compelling

Unless there are some other major concerns, I would be also +1 for merging once the below (minor) comments are addressed.

sklearn_extra/fast_kernel.py

sklearn_extra/tests/test_fast_kernel.py

sklearn_extra/fast_kernel.py

rth

Also please rename,
examples/fast_kernel/plot_mnist.py to examples/plot_fast_kernel_mnist.py.

I'm concerned that static benchmark plots could get outdated, but running them for each commit is also not ideal. Please include the code you used to make those figures maybe under benchmarks/bench_*

Maybe we should put the code under,
sklearn_extra/kernel_methods/_fast_kernel.py that can be imported from sklearn_extra.kernel_methods directly.

Alex7Li · 2019-07-27T00:30:39Z

Still working on the second post (I want to rerun all the examples on our server), but I added all the changes from the first post. I changed the class names to EigenProRegression/Classification, and then changed the main file name from fast_kernel.py to eigenpro.py as well, just to get rid of that alias. I'm not sure why it should be moved to a kernel_methods folder exactly, how does that make it easier to import?

Alex7Li · 2019-07-27T23:35:35Z

Ok, I have finished updating the code

rth

Thanks @Alex7Li !

I'm not sure why it should be moved to a kernel_methods folder exactly, how does that make it easier to import?

To follow the scikit-learn aporach where we have general topics (e.g. clustering, linear_models etc) under which individual estimators are imported. I'm not sure what's the the best section name for this estimator, but it's probably not eigenpro. If you create say sklearn_extra/kernel_methods/_eigenpro.py then in sklearn_extra/kernel_methods/__init__.py import these estimator, the actual import path users will see is going to be sklearn_extra.kernel_methods.EigenProClassifier . Again proposal for a better section name than kernel_methods is welcome. The same section name would then need to be used for doc/modules/<section-name>.rst.

I haven't been able to run the benchmarks on a laptop with 8GB RAM due to memory errors but that seem fine.

Also please resolve conflicts (and re-run black).

Apart from that +1 for merging and opening follow-up issues.

amueller · 2019-07-31T15:48:58Z

I'm +1 for merging as well.
@Alex7Li it would be awesome if you could stick around a bit to help with the implementation. I think this looks very promising but it's likely that we'll run into issues (because that always happens ;).

reduce memory consumption.

really sure how to fix it

Alex7Li · 2019-08-02T00:18:45Z

Thanks Roman for explaining, I moved and finished the merge. I also tried to fix the memory error issue, I think the estimator was just choosing a batch size that was quite large.
I will stick around after this is finished, though I'm not sure exactly what that entails. Browsing through the issues and seeing if there is anything related to this code is what I'm imagining now.
Also I'm having a bit of trouble passing the checks, it seems like it's failing while trying to download packages. I don't think anything I changed should effect that, though?

rth · 2019-08-02T11:48:35Z

The problem is with conda failing to downgrade from 4.7 to 4.6. I can't push to this PR to fix it (as you are using your master branch of your fork instead of a separate branch). CI was green before the last minor of commits, and tests pass locally for me. So let's just merge this. I'll make a follow up PR for fix Travis.

Thanks again @Alex7Li !

I will stick around after this is finished, though I'm not sure exactly what that entails. Browsing through the issues and seeing if there is anything related to this code is what I'm imagining now.

Yeah, mostly that. Also if you think of anything which in your opinion could be improved (with this code or other code in scikit-learn-extra) don't hesitate to open issues.

rth · 2019-08-02T11:51:55Z

@Alex7Li Also note that because you were using the master branch of your fork, if you want to sync it with the upstream master, you would need to git reset --hard to some commit deep in the history before your changes, and then rebase on top of upstream/master to get the merged changes.

For next PRs it's better to create a new branch from master..

amueller · 2019-08-12T20:16:30Z

I'm a bit surprised by the current behavior, the gamma parameter is completely ignored, despite the documentation.

Alex7Li added 12 commits May 21, 2019 17:29

Moved eigenpro files to sklearn-extra

fe03f0c

removed idea files

2a82a86

Import from sklearn_extra not sklearn

e42e447

Updated website to work with EigenPro

850d6ec

Modify examples

ac61cc1

Update examples

9f7ef75

Trying to find out why examples are crashing

462c053

removed Y squared parameter from the call to euclidean distances

4d3da41

Fixed documentation spaces

4430b99

Changed documentation as required and modified broken test

bfe5513

Removed synthetic example

0db284f

Undo accidental deletion of line from _fastfood

c359051

Alex7Li and others added 9 commits June 29, 2019 18:11

Updated tests and refactored eigenpro to add a base class

7ed1796

Added lines to help diagnose error, reduced size of plot_mnist.py

82de694

Merge branch 'master' into master

93fd120

Fix plot mnist using wrong permutation number and remove print statments

db77d84

Convert to float64 before doing conversion

df4afa4

Convert to float64 for computing eigenvalues

e90c646

Convert to float64 for computing eigenvalues, try to conform to all code

9a92110

style

Merge branch 'master' of https://github.com/Alex7Li/scikit-learn-extra

3f37b36

Apparently the doc tests want different docs again horray now watch as a

289115f

completely different set of tests appears and causes even more failures despite only comments changing

Renamed classes and some variables, edited and added documentation fo…

7504399

…r the changes

rth reviewed Jul 25, 2019

View reviewed changes

Alex7Li added 2 commits July 26, 2019 20:08

Renaming and addressing issues from Roman

fed866f

Renaming and addressing issues from Roman

5a338af

Alex7Li added 2 commits July 27, 2019 19:23

Added code for static tests and ran the code to produce new images.

8da6edc

Updated eigenpro.rst to give accurate descriptions of the new graphs.

9b2f64e

rth added the new feature label Jul 29, 2019

rth reviewed Jul 31, 2019

View reviewed changes

Alex7Li added 5 commits August 1, 2019 01:50

Fix merge conflicts, move files to kernrl_methods folder, and attempt to

2282f10

reduce memory consumption.

Removed extra file

6fef3d2

Trying to commit again to see if ci buidls are still there

ccd34a1

Using old commit that worked previously to see if the problem is me

9d6b033

Back to current version, seems like the problem is me or the merge, not

24cbaca

really sure how to fix it

rth merged commit ef3f570 into scikit-learn-contrib:master Aug 2, 2019

This was referenced Aug 2, 2019

Fix Travis CI #27

Merged

EigenProRegressor fails check_methods_subset_invariance on Py3.5 & 3.6 with old numpy #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Add fast kernel classifier/regressor #13

[MRG] Add fast kernel classifier/regressor #13

Alex7Li commented May 21, 2019 •

edited

Loading

Alex7Li commented May 22, 2019 •

edited

Loading

NicolasHug commented May 30, 2019

NicolasHug commented May 30, 2019

rth commented Jun 3, 2019

GaelVaroquaux commented Jun 17, 2019

amueller commented Jun 17, 2019

GaelVaroquaux commented Jun 17, 2019 via email

pcerda commented Jun 18, 2019 •

edited

Loading

amueller commented Jun 18, 2019

amueller commented Jul 22, 2019

Alex7Li commented Jul 23, 2019

amueller commented Jul 23, 2019

amueller commented Jul 23, 2019

Alex7Li commented Jul 23, 2019

rth commented Jul 23, 2019

rth left a comment

rth left a comment •

edited

Loading

Alex7Li commented Jul 27, 2019

Alex7Li commented Jul 27, 2019

rth left a comment

amueller commented Jul 31, 2019

Alex7Li commented Aug 2, 2019

rth commented Aug 2, 2019

rth commented Aug 2, 2019

amueller commented Aug 12, 2019

[MRG] Add fast kernel classifier/regressor #13

[MRG] Add fast kernel classifier/regressor #13

Conversation

Alex7Li commented May 21, 2019 • edited Loading

Alex7Li commented May 22, 2019 • edited Loading

NicolasHug commented May 30, 2019

NicolasHug commented May 30, 2019

rth commented Jun 3, 2019

GaelVaroquaux commented Jun 17, 2019

amueller commented Jun 17, 2019

GaelVaroquaux commented Jun 17, 2019 via email

pcerda commented Jun 18, 2019 • edited Loading

amueller commented Jun 18, 2019

amueller commented Jul 22, 2019

Alex7Li commented Jul 23, 2019

amueller commented Jul 23, 2019

amueller commented Jul 23, 2019

Alex7Li commented Jul 23, 2019

rth commented Jul 23, 2019

rth left a comment

Choose a reason for hiding this comment

rth left a comment • edited Loading

Choose a reason for hiding this comment

Alex7Li commented Jul 27, 2019

Alex7Li commented Jul 27, 2019

rth left a comment

Choose a reason for hiding this comment

amueller commented Jul 31, 2019

Alex7Li commented Aug 2, 2019

rth commented Aug 2, 2019

rth commented Aug 2, 2019

amueller commented Aug 12, 2019

Alex7Li commented May 21, 2019 •

edited

Loading

Alex7Li commented May 22, 2019 •

edited

Loading

pcerda commented Jun 18, 2019 •

edited

Loading

rth left a comment •

edited

Loading