Skip to content

[MRG] Add fast kernel classifier/regressor #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Aug 2, 2019
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
fe03f0c
Moved eigenpro files to sklearn-extra
Alex7Li May 21, 2019
2a82a86
removed idea files
Alex7Li May 21, 2019
e42e447
Import from sklearn_extra not sklearn
Alex7Li May 21, 2019
850d6ec
Updated website to work with EigenPro
Alex7Li May 22, 2019
ac61cc1
Modify examples
Alex7Li May 22, 2019
9f7ef75
Update examples
Alex7Li May 22, 2019
462c053
Trying to find out why examples are crashing
Alex7Li May 22, 2019
4d3da41
removed Y squared parameter from the call to euclidean distances
Alex7Li May 22, 2019
4430b99
Fixed documentation spaces
Alex7Li May 22, 2019
bfe5513
Changed documentation as required and modified broken test
Alex7Li May 22, 2019
0db284f
Removed synthetic example
Alex7Li May 22, 2019
c359051
Undo accidental deletion of line from _fastfood
Alex7Li May 22, 2019
7ed1796
Updated tests and refactored eigenpro to add a base class
Alex7Li Jun 29, 2019
82de694
Added lines to help diagnose error, reduced size of plot_mnist.py
Alex7Li Jul 2, 2019
93fd120
Merge branch 'master' into master
Alex7Li Jul 2, 2019
db77d84
Fix plot mnist using wrong permutation number and remove print statments
Alex7Li Jul 2, 2019
df4afa4
Convert to float64 before doing conversion
Alex7Li Jul 2, 2019
e90c646
Convert to float64 for computing eigenvalues
Alex7Li Jul 2, 2019
9a92110
Convert to float64 for computing eigenvalues, try to conform to all code
Alex7Li Jul 2, 2019
3f37b36
Merge branch 'master' of https://github.com/Alex7Li/scikit-learn-extra
Alex7Li Jul 2, 2019
289115f
Apparently the doc tests want different docs again horray now watch as a
Alex7Li Jul 2, 2019
cc842ac
Reformated Files using black and implement try-except for error in
Alex7Li Jul 4, 2019
587cb7f
Change gaussian to rbf and update docs
Alex7Li Jul 10, 2019
d9d3f93
Fixed lint issue
Alex7Li Jul 10, 2019
7504399
Renamed classes and some variables, edited and added documentation fo…
Alex7Li Jul 23, 2019
fed866f
Renaming and addressing issues from Roman
Alex7Li Jul 27, 2019
5a338af
Renaming and addressing issues from Roman
Alex7Li Jul 27, 2019
8da6edc
Added code for static tests and ran the code to produce new images.
Alex7Li Jul 27, 2019
9b2f64e
Updated eigenpro.rst to give accurate descriptions of the new graphs.
Alex7Li Jul 27, 2019
2282f10
Fix merge conflicts, move files to kernrl_methods folder, and attempt to
Alex7Li Aug 1, 2019
6fef3d2
Removed extra file
Alex7Li Aug 1, 2019
ccd34a1
Trying to commit again to see if ci buidls are still there
Alex7Li Aug 1, 2019
9d6b033
Using old commit that worked previously to see if the problem is me
Alex7Li Aug 1, 2019
24cbaca
Back to current version, seems like the problem is me or the merge, not
Alex7Li Aug 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,7 @@ doc/generated/

# PyBuilder
target/

# Pycharm
.idea
venv/
Empty file added benchmarks/__init__.py
Empty file.
Empty file added benchmarks/_bench/__init__.py
Empty file.
115 changes: 115 additions & 0 deletions benchmarks/_bench/eigenpro_plot_mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from time import time

from sklearn_extra.eigenpro import EigenProClassifier
from sklearn.svm import SVC
from sklearn.datasets import fetch_openml

rng = np.random.RandomState(1)

# Generate sample data from mnist
mnist = fetch_openml("mnist_784")
mnist.data = mnist.data / 255.0
print("Data has loaded")

p = rng.permutation(60000)
x_train = mnist.data[p]
y_train = np.int32(mnist.target[p])
x_test = mnist.data[60000:]
y_test = np.int32(mnist.target[60000:])

# Run tests comparing eig to svc
eig_fit_times = []
eig_pred_times = []
eig_err = []
svc_fit_times = []
svc_pred_times = []
svc_err = []

train_sizes = [500, 1000, 2000, 5000, 10000, 20000, 40000, 60000]

bandwidth = 5.0

# Fit models to data
for train_size in train_sizes:
for name, estimator in [
(
"EigenPro",
EigenProClassifier(
n_epoch=2, bandwidth=bandwidth, random_state=rng
),
),
(
"SupportVector",
SVC(
C=5, gamma=1.0 / (2 * bandwidth * bandwidth), random_state=rng
),
),
]:
stime = time()
estimator.fit(x_train[:train_size], y_train[:train_size])
fit_t = time() - stime

stime = time()
y_pred_test = estimator.predict(x_test)
pred_t = time() - stime

err = 100.0 * np.sum(y_pred_test != y_test) / len(y_test)
if name == "EigenPro":
eig_fit_times.append(fit_t)
eig_pred_times.append(pred_t)
eig_err.append(err)
else:
svc_fit_times.append(fit_t)
svc_pred_times.append(pred_t)
svc_err.append(err)
print(
"%s Classification with %i training samples in %0.2f seconds."
"Test error %.4f" % (name, train_size, fit_t + pred_t, err)
)

# set up grid for figures
fig = plt.figure(num=None, figsize=(6, 4), dpi=160)
ax = plt.subplot2grid((2, 2), (0, 0), rowspan=2)
train_size_labels = ["500", "1k", "2k", "5k", "10k", "20k", "40k", "60k"]

# Graph fit(train) time
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.plot(train_sizes, svc_fit_times, "o--", color="g", label="SVC")
ax.plot(train_sizes, eig_fit_times, "o-", color="r", label="EigenPro")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_xlabel("train size")
ax.set_ylabel("time (seconds)")
ax.legend()
ax.set_title("Train set")
ax.set_xticks(train_sizes)
ax.set_xticks([], minor=True)
ax.set_xticklabels(train_size_labels)

# Graph prediction(test) time
ax = plt.subplot2grid((2, 2), (0, 1), rowspan=1)
ax.plot(train_sizes, eig_pred_times, "o-", color="r")
ax.plot(train_sizes, svc_pred_times, "o--", color="g")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_ylabel("time (seconds)")
ax.set_title("Test set")
ax.set_xticks(train_sizes)
ax.set_xticks([], minor=True)
ax.set_xticklabels(train_size_labels)

# Graph training error
ax = plt.subplot2grid((2, 2), (1, 1), rowspan=1)
ax.plot(train_sizes, eig_err, "o-", color="r")
ax.plot(train_sizes, svc_err, "o-", color="g")
ax.set_xscale("log")
ax.set_xticks(train_sizes)
ax.set_xticklabels(train_size_labels)
ax.set_xticks([], minor=True)
ax.set_xlabel("train size")
ax.set_ylabel("classification error %")
plt.tight_layout()
plt.show()
113 changes: 113 additions & 0 deletions benchmarks/_bench/eigenpro_plot_noisy_mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from time import time

from sklearn.datasets import fetch_openml
from sklearn_extra.eigenpro import EigenProClassifier
from sklearn.svm import SVC

rng = np.random.RandomState(1)

# Generate sample data from mnist
mnist = fetch_openml("mnist_784")
mnist.data = mnist.data / 255.0

p = rng.permutation(60000)
x_train = mnist.data[p][:60000]
y_train = np.int32(mnist.target[p][:60000])
x_test = mnist.data[60000:]
y_test = np.int32(mnist.target[60000:])

# randomize 20% of labels
p = rng.choice(len(y_train), np.int32(len(y_train) * 0.2), False)
y_train[p] = rng.choice(10, np.int32(len(y_train) * 0.2))
p = rng.choice(len(y_test), np.int32(len(y_test) * 0.2), False)
y_test[p] = rng.choice(10, np.int32(len(y_test) * 0.2))

# Run tests comparing fkc to svc
eig_fit_times = []
eig_pred_times = []
eig_err = []
svc_fit_times = []
svc_pred_times = []
svc_err = []

train_sizes = [500, 1000, 2000, 5000, 10000, 20000, 40000, 60000]

bandwidth = 5.0
# Fit models to data
for train_size in train_sizes:
for name, estimator in [
(
"EigenPro",
EigenProClassifier(
n_epoch=2, bandwidth=bandwidth, random_state=rng
),
),
("SupportVector", SVC(C=5, gamma=1.0 / (2 * bandwidth * bandwidth))),
]:
stime = time()
estimator.fit(x_train[:train_size], y_train[:train_size])
fit_t = time() - stime

stime = time()
y_pred_test = estimator.predict(x_test)
pred_t = time() - stime
err = 100.0 * np.sum(y_pred_test != y_test) / len(y_test)
if name == "EigenPro":
eig_fit_times.append(fit_t)
eig_pred_times.append(pred_t)
eig_err.append(err)
else:
svc_fit_times.append(fit_t)
svc_pred_times.append(pred_t)
svc_err.append(err)
print(
"%s Classification with %i training samples in %0.2f seconds. "
"Test error %.4f" % (name, train_size, fit_t + pred_t, err)
)

# set up grid for figures
fig = plt.figure(num=None, figsize=(6, 4), dpi=160)
ax = plt.subplot2grid((2, 2), (0, 0), rowspan=2)
train_size_labels = ["500", "1k", "2k", "5k", "10k", "20k", "40k", "60k"]

# Graph fit(train) time
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.plot(train_sizes, svc_fit_times, "o--", color="g", label="SVC")
ax.plot(train_sizes, eig_fit_times, "o-", color="r", label="EigenPro")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_xlabel("train size")
ax.set_ylabel("time (seconds)")
ax.legend()
ax.set_title("Train set")
ax.set_xticks(train_sizes)
ax.set_xticks([], minor=True)
ax.set_xticklabels(train_size_labels)

# Graph prediction(test) time
ax = plt.subplot2grid((2, 2), (0, 1), rowspan=1)
ax.plot(train_sizes, eig_pred_times, "o-", color="r")
ax.plot(train_sizes, svc_pred_times, "o--", color="g")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_ylabel("time (seconds)")
ax.set_title("Test set")
ax.set_xticks(train_sizes)
ax.set_xticks([], minor=True)
ax.set_xticklabels(train_size_labels)

# Graph training error
ax = plt.subplot2grid((2, 2), (1, 1), rowspan=1)
ax.plot(train_sizes, eig_err, "o-", color="r")
ax.plot(train_sizes, svc_err, "o-", color="g")
ax.set_xscale("log")
ax.set_xticks(train_sizes)
ax.set_xticklabels(train_size_labels)
ax.set_xticks([], minor=True)
ax.set_xlabel("train size")
ax.set_ylabel("classification error %")
plt.tight_layout()
plt.show()
117 changes: 117 additions & 0 deletions benchmarks/_bench/eigenpro_plot_synthetic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from time import time

from sklearn.datasets import make_classification
from sklearn_extra.eigenpro import EigenProClassifier
from sklearn.svm import SVC

rng = np.random.RandomState(1)

max_size = 50000
test_size = 10000

# Get data for testing

x, y = make_classification(
n_samples=max_size + test_size,
n_features=400,
n_informative=6,
random_state=rng,
)

x_train = x[:max_size]
y_train = y[:max_size]
x_test = x[max_size:]
y_test = y[max_size:]

eig_fit_times = []
eig_pred_times = []
eig_err = []
svc_fit_times = []
svc_pred_times = []
svc_err = []

train_sizes = [2000, 5000, 10000, 20000, 50000]

bandwidth = 10.0
for train_size in train_sizes:
for name, estimator in [
(
"EigenPro",
EigenProClassifier(
n_epoch=3,
bandwidth=bandwidth,
n_components=30,
subsample_size=1000,
random_state=rng,
),
),
("SupportVector", SVC(C=5, gamma=1.0 / (2 * bandwidth * bandwidth))),
]:
stime = time()
estimator.fit(x_train[:train_size], y_train[:train_size])
fit_t = time() - stime

stime = time()
y_pred_test = estimator.predict(x_test)
pred_t = time() - stime

err = 100.0 * np.sum(y_pred_test != y_test) / len(y_test)
if name == "EigenPro":
eig_fit_times.append(fit_t)
eig_pred_times.append(pred_t)
eig_err.append(err)
else:
svc_fit_times.append(fit_t)
svc_pred_times.append(pred_t)
svc_err.append(err)
print(
"%s Classification with %i training samples in %0.2f seconds."
% (name, train_size, fit_t + pred_t)
)

# set up grid for figures
fig = plt.figure(num=None, figsize=(6, 4), dpi=160)
ax = plt.subplot2grid((2, 2), (0, 0), rowspan=2)
train_size_labels = [str(s) for s in train_sizes]

# Graph fit(train) time
ax.plot(train_sizes, svc_fit_times, "o--", color="g", label="SVC")
ax.plot(train_sizes, eig_fit_times, "o-", color="r", label="FKC (EigenPro)")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_xlabel("train size")
ax.set_ylabel("time (seconds)")

ax.legend()
ax.set_title("Train set")
ax.set_xticks(train_sizes)
ax.set_xticklabels(train_size_labels)
ax.set_xticks([], minor=True)
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())

# Graph prediction(test) time
ax = plt.subplot2grid((2, 2), (0, 1), rowspan=1)
ax.plot(train_sizes, eig_pred_times, "o-", color="r")
ax.plot(train_sizes, svc_pred_times, "o--", color="g")
ax.set_xscale("log")
ax.set_yscale("log", nonposy="clip")
ax.set_ylabel("time (seconds)")
ax.set_title("Test set")
ax.set_xticks([])
ax.set_xticks([], minor=True)

# Graph training error
ax = plt.subplot2grid((2, 2), (1, 1), rowspan=1)
ax.plot(train_sizes, eig_err, "o-", color="r")
ax.plot(train_sizes, svc_err, "o-", color="g")
ax.set_xscale("log")
ax.set_xticks(train_sizes)
ax.set_xticklabels(train_size_labels)
ax.set_xticks([], minor=True)
ax.set_xlabel("train size")
ax.set_ylabel("classification error %")
plt.tight_layout()
plt.show()
17 changes: 17 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,20 @@ Kernel approximation
:template: class.rst

kernel_approximation.Fastfood

EigenPro
========

.. currentmodule:: doc

.. toctree::
modules/eigenpro

.. currentmodule:: sklearn_extra

.. autosummary::
:toctree: generated/
:template: class.rst

eigenpro.EigenProRegressor
eigenpro.EigenProClassifier
Binary file added doc/images/eigenpro_mnist.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/eigenpro_mnist_noisy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/eigenpro_synthetic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading