-
Notifications
You must be signed in to change notification settings - Fork 204
/
Copy pathgradient-boosting-machines.Rmd
545 lines (427 loc) · 18.4 KB
/
gradient-boosting-machines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
# Gradient Boosting Machines
* * *

* * *
Image Source: brucecompany.com
## Introduction
[Gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) is a
machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees. It builds the model in an iterative fashion
like other boosting methods do, and it generalizes them by allowing optimization
of an arbitrary differentiable loss function.
It is recommended that you read through the accompanying [Classification and
Regression Trees Tutorial](decision-trees.ipynb) for an overview of decision
trees.
## History
Boosting is one of the most powerful learning ideas introduced in the last
twenty years. It was originally designed for classification problems, but it can
be extended to regression as well. The motivation for boosting was a procedure
that combines the outputs of many "weak" classifiers to produce a powerful
"committee." A weak classifier (e.g. decision tree) is one whose error rate is
only slightly better than random guessing.
[AdaBoost](https://en.wikipedia.org/wiki/AdaBoost) short for "Adaptive
Boosting", is a machine learning meta-algorithm formulated by [Yoav
Freund](https://en.wikipedia.org/wiki/Yoav_Freund) and [Robert
Schapire](https://en.wikipedia.org/wiki/Robert_Schapire) in 1996, which is now
considered to be a special case of Gradient Boosting. There are [some
differences](http://stats.stackexchange.com/questions/164233/intuitive-
explanations-of-differences-between-gradient-boosting-trees-gbm-ad) between the
AdaBoost algorithm and modern Gradient Boosting. In the AdaBoost algorithm, the
"shortcomings" of existing weak learners are identified by high-weight data
points, however in Gradient Boosting, the shortcomings are identified by
gradients.
The idea of gradient boosting originated in the observation by Leo Breiman that
boosting can be interpreted as an optimization algorithm on a suitable cost
function. Explicit regression gradient boosting algorithms were subsequently
developed by Jerome H. Friedman, simultaneously with the more general functional
gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and
Marcus Frean. The latter two papers introduced the abstract view of boosting
algorithms as iterative functional gradient descent algorithms. That is,
algorithms that optimize a cost function over function space by iteratively
choosing a function (weak hypothesis) that points in the negative gradient
direction. This functional gradient view of boosting has led to the development
of boosting algorithms in many areas of machine learning and statistics beyond
regression and classification.
In general, in terms of model performance, we have the following heirarchy:
$$Boosting > Random \: Forest > Bagging > Single \: Tree$$
## Gradient Boosting
[Gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) is a
machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees. It builds the model in a stage-wise fashion
like other boosting methods do, and it generalizes them by allowing optimization
of an arbitrary differentiable loss function.
The purpose of boosting is to sequentially apply the weak classification
algorithm to repeatedly modified versions of the data, thereby producing a
sequence of weak classifiers $G_m(x)$, $m = 1, 2, ... , M$.
## Stagewise Additive Modeling
Boosting builds an additive model:
$$F(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$
where $b(x; \gamma_m)$ is a tree and $\gamma_m$ parameterizes the splits. With
boosting, the parameters, $(\beta_m, \gamma_m)$ are fit in a *stagewise*
fashion. This slows the process down, and overfits less quickly.
## AdaBoost
- AdaBoost builds an additive logistic regression model by stagewise fitting.
- AdaBoost uses an exponential loss function of the form, $L(y, F(x)) =
exp(-yF(x))$, similar to the negative binomial log-likelihood loss.
- The principal attraction of the exponential loss in the context of additive
modeling is computational; it leads to the simple modular reweighting
- Instead of fitting trees to residuals, the special form of the exponential
loss function in AdaBoost leads to fitting trees to weighted versions of the
original data.
## Gradient Boosting Algorithm
Friedman's Gradient Boosting Algorithm for a generic loss function, $L(y_i,
\gamma)$:

Source: Elements of Statistical Learning
### Loss Functions and Gradients

Source: Elements of Statistical Learning
The optimal number of iterations, T, and the learning rate, λ, depend on each
other.
## Stochastic GBM
[Stochastic Gradient Boosting](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)
(Friedman, 2002) proposed the stochastic gradient boosting algorithm that simply
samples uniformly without replacement from the dataset before estimating the
next gradient step. He found that this additional step greatly improved
performance.
## Practical Tips
- It's more common to grow shorter trees ("shrubs" or "stumps") in GBM than you
do in Random Forest.
- It's useful to try a variety of column sample (and column sample per tree)
rates.
- Don't assume that the set of optimal tuning parameters for one implementation
of GBM will carry over and also be optimal in a different GBM implementation.
## Resources
- [Trevor Hastie - Gradient Boosting & Random Forests at H2O World 2014](https:
//www.youtube.com/watch?v=wPqtzj5VZus&index=16&list=PLNtMya54qvOFQhSZ4IKKXRbMkyL
Mn0caa) (YouTube)
- [Trevor Hastie - Data Science of GBM
(2013)](http://www.slideshare.net/0xdata/gbm-27891077) (slides)
- [Mark Landry - Gradient Boosting Method and Random Forest at H2O World
2015](https://www.youtube.com/watch?v=9wn1f-30_ZY) (YouTube)
- [Peter Prettenhofer - Gradient Boosted Regression Trees in scikit-learn at
PyData London 2014](https://www.youtube.com/watch?v=IXZKgIsZRm0) (YouTube)
- [Alexey Natekin1 and Alois Knoll - Gradient boosting machines, a
tutorial](http://journal.frontiersin.org/article/10.3389/fnbot.2013.00021/full)
(blog post)
***
# GBM Software in R
This is not a comprehensive list of GBM software in R, however, we detail a few
of the most popular implementations below:
[gbm](https://cran.r-project.org/web/packages/gbm/index.html),
[xgboost](https://cran.r-project.org/web/packages/xgboost/index.html) and
[h2o](https://cran.r-project.org/web/packages/gamboostLSS/index.html).
The [CRAN Machine Learning Task
View](https://cran.r-project.org/web/views/MachineLearning.html) lists the
following projects as well. The Hinge-loss is optimized by the boosting
implementation in package
[bst](https://cran.r-project.org/web/packages/bst/index.html). Package
[GAMBoost](https://cran.r-project.org/web/packages/GAMBoost/index.html) can be
used to fit generalized additive models by a boosting algorithm. An extensible
boosting framework for generalized linear, additive and nonparametric models is
available in package
[mboost](https://cran.r-project.org/web/packages/mboost/index.html). Likelihood-
based boosting for Cox models is implemented in
[CoxBoost](https://cran.r-project.org/web/packages/CoxBoost/index.html) and for
mixed models in
[GMMBoost](https://cran.r-project.org/web/packages/GMMBoost/index.html). GAMLSS
models can be fitted using boosting by
[gamboostLSS](https://cran.r-project.org/web/packages/gamboostLSS/index.html).
## gbm
Authors: Originally written by Greg Ridgeway, added to by various authors,
currently maintained by Harry Southworth
Backend: C++
The [gbm](https://github.com/gbm-developers/gbm) R package is an implementation
of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's
gradient boosting machine. This is the original R implementation of GBM. A
presentation is available [here](https://www.slideshare.net/mark_landry/gbm-package-in-r) by Mark Landry.
Features:
- Stochastic GBM.
- Supports up to 1024 factor levels.
- Supports Classification and regression trees.
- Includes regression methods for:
- least squares
- absolute loss
- t-distribution loss
- quantile regression
- logistic
- multinomial logistic
- Poisson
- Cox proportional hazards partial likelihood
- AdaBoost exponential loss
- Huberized hinge loss
- Learning to Rank measures ([LambdaMart](https://www.microsoft.com/en-
us/research/wp-content/uploads/2016/02/tr-2008-109.pdf))
- Out-of-bag estimator for the optimal number of iterations is provided.
- Easy to overfit since early stopping functionality is not automated in this
package.
- If internal cross-validation is used, this can be parallelized to all cores on
the machine.
- Currently undergoing a major refactoring & rewrite (and has been for some
time).
- GPL-2/3 License.
```{r n=19}
#install.packages("gbm")
#install.packages("cvAUC")
library(gbm)
library(cvAUC)
```
```{r n=20}
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")
```
```{r n=21}
set.seed(1)
model <- gbm(formula = response ~ .,
distribution = "bernoulli",
data = train,
n.trees = 70,
interaction.depth = 5,
shrinkage = 0.3,
bag.fraction = 0.5,
train.fraction = 1.0,
n.cores = NULL) #will use all cores by default
```
```{r n=22}
print(model)
```
```{r n=23}
# Generate predictions on test dataset
preds <- predict(model, newdata = test, n.trees = 70)
labels <- test[,"response"]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
```
## xgboost
Authors: Tianqi Chen, Tong He, Michael Benesty
Backend: C++
The [xgboost](https://cran.r-project.org/web/packages/xgboost/index.html) R
package provides an R API to "Extreme Gradient Boosting", which is an efficient
implementation of gradient boosting framework. [Parameter tuning
guide](http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-
tuning-xgboost-with-codes-python/) and more resources
[here](https://github.com/dmlc/xgboost/tree/master/demo). The xgboost package
is quite popular on [Kaggle](http://blog.kaggle.com/tag/xgboost/) for data
mining competitions.
Features:
- Stochastic GBM with column and row sampling (per split and per tree) for
better generalization.
- Includes efficient linear model solver and tree learning algorithms.
- Parallel computation on a single machine.
- Supports various objective functions, including regression, classification and
ranking.
- The package is made to be extensible, so that users are also allowed to define
their own objectives easily.
- Apache 2.0 License.
```{r n=24}
#install.packages("xgboost")
#install.packages("cvAUC")
library(xgboost)
library(Matrix)
library(cvAUC)
```
```{r n=25}
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")
```
```{r n=26}
# Set seed because we column-sample
set.seed(1)
y <- "response"
train.mx <- sparse.model.matrix(response ~ ., train)
test.mx <- sparse.model.matrix(response ~ ., test)
dtrain <- xgb.DMatrix(train.mx, label = train[,y])
dtest <- xgb.DMatrix(test.mx, label = test[,y])
train.gdbt <- xgb.train(params = list(objective = "binary:logistic",
#num_class = 2,
#eval_metric = "mlogloss",
eta = 0.3,
max_depth = 5,
subsample = 1,
colsample_bytree = 0.5),
data = dtrain,
nrounds = 70,
watchlist = list(train = dtrain, test = dtest))
```
```{r n=29}
# Generate predictions on test dataset
preds <- predict(train.gdbt, newdata = dtest)
labels <- test[,y]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
```
```{r n=30}
#Advanced functionality of xgboost
#install.packages("Ckmeans.1d.dp")
library(Ckmeans.1d.dp)
# Compute feature importance matrix
names <- dimnames(data.matrix(train[,-1]))[[2]]
importance_matrix <- xgb.importance(names, model = train.gdbt)
# Plot feature importance
xgb.plot.importance(importance_matrix[1:10,])
```
## h2o
Authors: [Arno Candel](https://www.linkedin.com/in/candel), [Cliff
Click](http://www.cliffc.org/blog/), H2O.ai contributors
Backend: Java
[H2O GBM Tuning guide by Arno Candel](https://github.com/h2oai/h2o-3/blob/master
/h2o-docs/src/product/tutorials/gbm/gbmTuning.Rmd) and [H2O GBM
Vignette](http://docs.h2o.ai/h2o/latest-stable/h2o-
docs/booklets/GBMBooklet.pdf).
Features:
- Distributed and parallelized computation on either a single node or a multi-
node cluster.
- Automatic early stopping based on convergence of user-specied metrics to user-
specied relative tolerance.
- Stochastic GBM with column and row sampling (per split and per tree) for
better generalization.
- Support for exponential families (Poisson, Gamma, Tweedie) and loss
functions in addition to binomial (Bernoulli), Gaussian and multinomial
distributions, such as Quantile regression (including Laplace).
- Grid search for hyperparameter optimization and model selection.
- Data-distributed, which means the entire dataset does not need to fit into
memory on a single node, hence scales to any size training set.
- Uses histogram approximations of continuous variables for speedup.
- Uses dynamic binning - bin limits are reset at each tree level based on the
split bins' min and max values discovered during the last pass.
- Uses squared error to determine optimal splits.
- Distributed implementation details outlined in a [blog
post](http://blog.h2o.ai/2013/10/building-distributed-gbm-h2o/) by Cliff Click.
- Unlimited factor levels.
- Multiclass trees (one for each class) built in parallel with each other.
- Apache 2.0 Licensed.
- Model export in plain Java code for deployment in production environments.
- GUI for training & model eval/viz (H2O Flow).
```{r n=67}
#install.packages("h2o")
library(h2o)
#h2o.shutdown(prompt = FALSE) #if required
h2o.init(nthreads = -1) #Start a local H2O cluster using nthreads = num available cores
```
```{r n=68}
# Load 10-class MNIST dataset
train <- h2o.importFile("data/higgs_train_10k.csv")
test <- h2o.importFile("data/higgs_test_5k.csv")
print(dim(train))
print(dim(test))
```
```{r n=69}
# Identity the response column
y <- "response"
# Identify the predictor columns
x <- setdiff(names(train), y)
# Convert response to factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
```
```{r n=70}
# Train an H2O GBM model
model <- h2o.gbm(x = x,
y = y,
training_frame = train,
ntrees = 70,
learn_rate = 0.3,
sample_rate = 1.0,
max_depth = 5,
col_sample_rate_per_tree = 0.5,
seed = 1)
```
```{r n=71}
# Get model performance on a test set
perf <- h2o.performance(model, test)
print(perf)
```
```{r n=72}
# To retreive individual metrics
h2o.auc(perf)
```
```{.json .output n=72}
[
{
"data": {
"text/html": "0.773566288998556",
"text/latex": "0.773566288998556",
"text/markdown": "0.773566288998556",
"text/plain": "[1] 0.7735663"
},
"metadata": {},
"output_type": "display_data"
}
]
```
```{r n=73}
# Print confusion matrix
h2o.confusionMatrix(perf)
```
```{r n=74}
# Plot scoring history over time
plot(model)
```
```{r n=75}
# Retreive feature importance
vi <- h2o.varimp(model)
vi[1:10,]
```
```{r n=76}
# Plot feature importance
barplot(vi$scaled_importance,
names.arg = vi$variable,
space = 1,
las = 2,
main = "Variable Importance: H2O GBM")
```
Note that all models, data and model metrics can be viewed via the [H2O Flow
GUI](http://127.0.0.1:54321/flow/index.html), which should already be running
since you started the H2O cluster with `h2o.init()`.
```{r n=77}
# Early stopping example
# Keep in mind that when you use early stopping, you should pass a validation set
# Since the validation set is used to detmine the stopping point, a separate test set should be used for model eval
#fit <- h2o.gbm(x = x,
# y = y,
# training_frame = train,
# model_id = "gbm_fit3",
# validation_frame = valid, #only used if stopping_rounds > 0
# ntrees = 500,
# score_tree_interval = 5, #used for early stopping
# stopping_rounds = 3, #used for early stopping
# stopping_metric = "misclassification", #used for early stopping
# stopping_tolerance = 0.0005, #used for early stopping
# seed = 1)
```
```{r n=78}
# GBM hyperparamters
gbm_params <- list(learn_rate = seq(0.01, 0.1, 0.01),
max_depth = seq(2, 10, 1),
sample_rate = seq(0.5, 1.0, 0.1),
col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete",
max_models = 20)
# Train and validate a grid of GBMs
gbm_grid <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid",
training_frame = train,
validation_frame = test, #test frame will only be used to calculate metrics
ntrees = 70,
seed = 1,
hyper_params = gbm_params,
search_criteria = search_criteria)
gbm_gridperf <- h2o.getGrid(grid_id = "gbm_grid",
sort_by = "auc",
decreasing = TRUE)
print(gbm_gridperf)
```
The grid search helped a lot. The first model we trained only had a 0.774 test
set AUC, but the top GBM in our grid has a test set AUC of 0.786. More
information about grid search is available in the [H2O grid search R
tutorial](https://github.com/h2oai/h2o-tutorials/blob/master/h2o-open-
tour-2016/chicago/grid-search-model-selection.R).
# References
[1] [Friedman, Jerome H. Greedy function approximation: A gradient boosting
machine. Ann. Statist. 29 (2001), no. 5, 1189--1232. doi:10.1214/aos/1013203451.
http://projecteuclid.org/euclid.aos/1013203451.](http://projecteuclid.org/DPubS?
verb=Display&version=1.0&service=UI&handle=euclid.aos/1013203451&page=record)