-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add fast kernel classifier/regressor (see #11039) #11694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please ping in a few weeks as we are focusing on releasing 0.20
|
maybe it's a tabs vs spaces thing? Sorry we still haven't released, but we'll look at this soon. There's a bunch of errors related to
Please check which version it was added, we might need a backport. |
This pull request introduces 4 alerts when merging 46488c3 into 53069c2 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
how long do the examples run? |
We did compare the fast kernel method with SVC in three examples (mnist, noisy mnist, and synthetic). The training for SVC on one dataset (noisy mnist) can take as long as 27 minuets. Fast kernel normally completes training in 10~25% of time used by SVC. It also shows consistently better testing accuracy and nearly 10X speedup over SVC on testing. Note that we run these experiments on a server with one Intel Xeon E5-1620 CPU (4 cores). @amueller as to the kernel approximation, do you mean kernel ridge regression? For now we have implemented a kernel classifier and a kernel regressor. Notably, our fast kernel regression and the kernel ridge regression (without ridge :) converge to the same optimal solution. See results for noisy mnist here: |
I meant using either sklearn.kernel_approximation.Nystroem or sklearn.kernel_approximation.RBFSampler (which implements Rahimi and Recht) and then RidgeClassifier. |
Sorry for the late update. We have compared the performance of sklearn.kernel_approximation.Nystroem with sklearn.linear_model.RidgeClassifier and our method (run for 1 epoch) on the full MNIST. The result can be seen in the attached file. We can add this test to the source if needed. |
I am sorry, but the paper behind the method is cited 23 times on scholar. This is far below our citation criterion for inclusion (https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms). Hence, this method cannot be contributed to scikit-learn. It should be contributed as a package in scikit-learn contrib. |
@GaelVaroquaux : I see your concern. So would you consider the other criterion for inclusion (https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms). "A technique that provides a clear-cut improvement on a widely-used method will also be considered for inclusion." The improvement in this MRG is clear-cut. It is a preconditioned iterative method that is theoretically guaranteed to improve the performance (see an arXiv version of the paper here https://arxiv.org/abs/1703.10622). We also show strong empirical evidence on various datasets (in both the paper and this MRG). The methods we target to improve are kernel machines (e.g., SVM and kernel regression) which are widely used. |
The sentence about providing a clear-cut improvement would be for a method that has feature parity, for instance doing the same thing but faster. I see two problems with considering the inclusion of EigeinPro:
|
@GaelVaroquaux: After some discussions on your last reply, we would like to clarify a few points in regard to our method, EigenPro, to show that it is a a clear-cut improvement over solvers for kernel regression. EigenPro is a fast solver for the classical problem of kernel regression, quite central to machine learning and statistics. Kernel regression is also implemented in scikit-learn The use of the square loss for classification has a long history as well (see, e.g., http://cbcl.mit.edu/publications/ps/rlsc.pdf for a discussion, references and experimental results). The square loss typically performs as well or better than the hinge loss in terms of the test error (see the reference above or, for example, the Tables 1,2 in http://www.jmlr.org/proceedings/papers/v51/que16-supp.pdf). It is not clear that the hinge loss has any systematic advantage over the square loss for classification in kernel methods, perhaps it is being used primarily for historical reasons. While the cross-entropy (logistic) loss is commonly used with neural networks it is not a typical choice for kernel machines. |
a few remarks:
my 2c |
If it's a new faster solver, it must solve the same exact mathematical problem, as checked by convergence tests. Reading the paper, I had the impression that the problem solved by EigenPro has additional regularization effects. |
@GaelVaroquaux you mentioned a related paper that you considered more mature, can you remind me what that was? |
@GaelVaroquaux you mentioned a related paper that you considered more mature, can you remind me what that was?
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A. Pegasos: Primal estimated
sub-gradient solver for svm. Mathematical programming. 2011 Mar
1;127(1):3-0.
|
closing as merged to scikit-learn-extra |
This pull request implements the feature in issue #11039
ToDo: