Skip to content

[MRG] Add fast kernel classifier/regressor (see #11039) #11694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 63 commits into from

Conversation

EigenPro
Copy link

This pull request implements the feature in issue #11039

ToDo:

  1. Fix several unittest failures
  2. Add a user guide page
  3. Add an example

@jnothman
Copy link
Member

jnothman commented Jul 29, 2018 via email

@amueller
Copy link
Member

amueller commented Aug 8, 2018

maybe it's a tabs vs spaces thing? Sorry we still haven't released, but we'll look at this soon.

There's a bunch of errors related to

AttributeError: 'module' object has no attribute 'multi_dot'

Please check which version it was added, we might need a backport.

@sklearn-lgtm
Copy link

This pull request introduces 4 alerts when merging 46488c3 into 53069c2 - view on LGTM.com

new alerts:

  • 1 for Comparison using is when operands support __eq__
  • 1 for Unused local variable
  • 1 for Unused import
  • 1 for Implicit string concatenation in a list

Comment posted by LGTM.com

@amueller
Copy link
Member

amueller commented Nov 5, 2018

how long do the examples run?
Also, it might be interesting to add this to some of the comparisons in the benchmark folder?
Having some comparisons with SVC and kernel approximation would be nice.

@EigenPro
Copy link
Author

EigenPro commented Nov 5, 2018

We did compare the fast kernel method with SVC in three examples (mnist, noisy mnist, and synthetic). The training for SVC on one dataset (noisy mnist) can take as long as 27 minuets. Fast kernel normally completes training in 10~25% of time used by SVC. It also shows consistently better testing accuracy and nearly 10X speedup over SVC on testing. Note that we run these experiments on a server with one Intel Xeon E5-1620 CPU (4 cores).

@amueller as to the kernel approximation, do you mean kernel ridge regression? For now we have implemented a kernel classifier and a kernel regressor. Notably, our fast kernel regression and the kernel ridge regression (without ridge :) converge to the same optimal solution.

See results for noisy mnist here:
https://github.com/scikit-learn/scikit-learn/blob/b8e32885dbfd06a534be8d4c2a5c16233188a688/doc/images/fast_kernel_noisy_mnist.png

@amueller
Copy link
Member

I meant using either sklearn.kernel_approximation.Nystroem or sklearn.kernel_approximation.RBFSampler (which implements Rahimi and Recht) and then RidgeClassifier.

@EigenPro
Copy link
Author

Sorry for the late update. We have compared the performance of sklearn.kernel_approximation.Nystroem with sklearn.linear_model.RidgeClassifier and our method (run for 1 epoch) on the full MNIST. The result can be seen in the attached file. We can add this test to the source if needed.

mnist-nystrom-epro.pdf

@GaelVaroquaux
Copy link
Member

I am sorry, but the paper behind the method is cited 23 times on scholar. This is far below our citation criterion for inclusion (https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms).

Hence, this method cannot be contributed to scikit-learn. It should be contributed as a package in scikit-learn contrib.

@EigenPro
Copy link
Author

@GaelVaroquaux : I see your concern. So would you consider the other criterion for inclusion (https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms).

"A technique that provides a clear-cut improvement on a widely-used method will also be considered for inclusion."

The improvement in this MRG is clear-cut. It is a preconditioned iterative method that is theoretically guaranteed to improve the performance (see an arXiv version of the paper here https://arxiv.org/abs/1703.10622). We also show strong empirical evidence on various datasets (in both the paper and this MRG).

The methods we target to improve are kernel machines (e.g., SVM and kernel regression) which are widely used.

@GaelVaroquaux
Copy link
Member

The sentence about providing a clear-cut improvement would be for a method that has feature parity, for instance doing the same thing but faster.

I see two problems with considering the inclusion of EigeinPro:

  • First, it exposes us to many similar requests. There are at least several dozens of papers a year that contribute a specific improvement to an established method and end up weakly used and weakly cited. To ensure the future of scikit-learn, we need to focus on a small number of methods, we just cannot address them all. This is why we focus on well-cited papers. It is an indication that their is a strong interest, albeit an imperfect indication. Of course, there is a chicken-and-egg problem: papers get well cited when they have an easily-accessible implementation, as in scikit-learn. This is why we created scikit-learn-contrib.

  • EigenPro is not merely a fast solver for a classic problem. It optimizes a squared-loss classifier, which is not one the most popular losses for classification. Reading the paper, it seems to me that the optimization strategy also introduces an implicit regularization in via the optimization. Hence, it corresponds to a new learning problem, the popularity of which needs to be established.

@EigenPro
Copy link
Author

EigenPro commented Mar 6, 2019

@GaelVaroquaux: After some discussions on your last reply, we would like to clarify a few points in regard to our method, EigenPro, to show that it is a a clear-cut improvement over solvers for kernel regression.

EigenPro is a fast solver for the classical problem of kernel regression, quite central to machine learning and statistics. Kernel regression is also implemented in scikit-learn
(https://scikit-learn.org/stable/modules/kernel_ridge.html). EigenPro solution is mathematically equivalent to the original regression problem but the algorithm is much faster due to preconditioning. At this time we are not aware of any other method with a comparable speed-up. Thus, we believe it is a "clear-cut improvement" over a classical and widely used method.

The use of the square loss for classification has a long history as well (see, e.g., http://cbcl.mit.edu/publications/ps/rlsc.pdf for a discussion, references and experimental results). The square loss typically performs as well or better than the hinge loss in terms of the test error (see the reference above or, for example, the Tables 1,2 in http://www.jmlr.org/proceedings/papers/v51/que16-supp.pdf). It is not clear that the hinge loss has any systematic advantage over the square loss for classification in kernel methods, perhaps it is being used primarily for historical reasons. While the cross-entropy (logistic) loss is commonly used with neural networks it is not a typical choice for kernel machines.

@agramfort
Copy link
Member

a few remarks:

  • we cannot create a new module for a single estimator.
  • if it's an improvement to an existing estimator it should be an option like "solver" to an existing estimator.

my 2c

@GaelVaroquaux
Copy link
Member

If it's a new faster solver, it must solve the same exact mathematical problem, as checked by convergence tests.

Reading the paper, I had the impression that the problem solved by EigenPro has additional regularization effects.

@amueller
Copy link
Member

@GaelVaroquaux you mentioned a related paper that you considered more mature, can you remind me what that was?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 23, 2019 via email

@amueller
Copy link
Member

amueller commented Aug 6, 2019

closing as merged to scikit-learn-extra

@amueller amueller closed this Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants