|
23 | 23 | "source": [
|
24 | 24 | "## Introduction\n",
|
25 | 25 | "\n",
|
26 |
| - "[Linear Models](https://en.wikipedia.org/wiki/Linear_regression) are one of the oldest and most well known statistical prediction algorithms which nowdays is often categorized as a \"machine learning algorithm.\" [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are are a framework for modeling a response variable $y$ that is bounded or discrete. Generalized linear models allow for an arbitrary link function $g$ that relates the mean of the response variable to the predictors, i.e. $E(y) = g(β′x)$. The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between, $(-\\infty ,\\infty )$, the range of the linear predictor, and the range of the response variable (e.g. $[0,1]$). [1]\n", |
| 26 | + "[Linear Models](https://en.wikipedia.org/wiki/Linear_regression) are one of the oldest and most well known statistical prediction algorithms which nowadays is often categorized as a \"machine learning algorithm.\" [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are are a framework for modeling a response variable $y$ that is bounded or discrete. Generalized linear models allow for an arbitrary link function $g$ that relates the mean of the response variable to the predictors, i.e. $E(y) = g(β′x)$. The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between, $(-\\infty ,\\infty )$, the range of the linear predictor, and the range of the response variable (e.g. $[0,1]$). [1]\n", |
27 | 27 | "\n",
|
28 | 28 | "Therefore, GLMs allow for response variables that have error distribution models other than a normal distribution. Some common examples of GLMs are:\n",
|
29 | 29 | "- [Poisson regression](https://en.wikipedia.org/wiki/Poisson_regression) for count data.\n",
|
|
51 | 51 | "\n",
|
52 | 52 | "$$RSS(\\beta) = \\sum_{i=1}^n (y_i - x_i^T\\beta)^2$$\n",
|
53 | 53 | "\n",
|
54 |
| - "$RSS(\\beta)$ is a quadradic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in matrix notation:\n", |
| 54 | + "$RSS(\\beta)$ is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in matrix notation:\n", |
55 | 55 | "\n",
|
56 | 56 | "$$RSS(\\beta) = (\\boldsymbol{y} - \\boldsymbol{X}\\beta)^T(\\boldsymbol{y} - \\boldsymbol{X}\\beta)$$\n",
|
57 | 57 | "\n",
|
|
147 | 147 | "2. Compute the weighted [Gram matrix](https://en.wikipedia.org/wiki/Gramian_matrix) XT WX and XT z vector\n",
|
148 | 148 | "3. Decompose the Gram matrix ([Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition)) and apply ADMM solver to solve the $\\ell_1$ penalized least squares problem.\n",
|
149 | 149 | "\n",
|
150 |
| - "In the [H2O GLM](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf) implementation, steps 1 and 2 are performed distributively, and Step 3 is computed in parallel on a single node. The Gram matrix appraoch is very efficient for tall and narrow datasets when running lamnda search with a sparse solution. \n", |
| 150 | + "In the [H2O GLM](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf) implementation, steps 1 and 2 are performed distributively, and Step 3 is computed in parallel on a single node. The Gram matrix approach is very efficient for tall and narrow datasets when running lambda search with a sparse solution. \n", |
151 | 151 | "\n",
|
152 | 152 | "\n",
|
153 | 153 | "### Cyclical Coordinate Descent\n",
|
154 | 154 | "\n",
|
155 |
| - "The IRLS method can also use cyclical coordinate descent in it's inner loop (as opposed to ADMM). The [glmnet](http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html) package uses [cyclical coordinate descent](http://web.stanford.edu/~hastie/Papers/glmnet.pdf) which successively optimizes the objective function over each parameter with others fixed, and cycles repeatedly until convergence.\n", |
| 155 | + "The IRLS method can also use cyclical coordinate descent in its inner loop (as opposed to ADMM). The [glmnet](http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html) package uses [cyclical coordinate descent](http://web.stanford.edu/~hastie/Papers/glmnet.pdf) which successively optimizes the objective function over each parameter with others fixed, and cycles repeatedly until convergence.\n", |
156 | 156 | "\n",
|
157 | 157 | "Cyclical coordinate descent methods are a natural approach for solving\n",
|
158 | 158 | "convex problems with $\\ell_1$ or $\\ell_2$ constraints, or mixtures of the two (elastic net). Each coordinate-descent step is fast, with an explicit formula for each coordinate-wise minimization. The method also exploits the sparsity of the model, spending much of its time evaluating only inner products for variables with non-zero coefficients.\n",
|
|
169 | 169 | "source": [
|
170 | 170 | "## Data Preprocessing\n",
|
171 | 171 | "\n",
|
172 |
| - "In order for the coefficients to be easily interpretable, the features must be centered and scaled (aka \"normalized\"). Many software packages will allow the direct input of categorical/factor columns in the training frame, however internally any categorical columns will be expaded into binary indicator variables. The caret package offers a handy utility function, [caret::dummyVars()](http://www.rdocumentation.org/packages/caret/functions/dummyVars), for dummy/indicator expansion if you need to do this manually.\n", |
| 172 | + "In order for the coefficients to be easily interpretable, the features must be centered and scaled (aka \"normalized\"). Many software packages will allow the direct input of categorical/factor columns in the training frame, however internally any categorical columns will be expanded into binary indicator variables. The caret package offers a handy utility function, [caret::dummyVars()](http://www.rdocumentation.org/packages/caret/functions/dummyVars), for dummy/indicator expansion if you need to do this manually.\n", |
173 | 173 | "\n",
|
174 | 174 | "Missing data will need to be imputed, otherwise in many GLM packages, those rows will simply be omitted from the training set at train time. For example, in the `stats::glm()` function there is an `na.action` argument which allows the user to do one of the three options:\n",
|
175 | 175 | "\n",
|
|
1002 | 1002 | "cell_type": "markdown",
|
1003 | 1003 | "metadata": {},
|
1004 | 1004 | "source": [
|
1005 |
| - "Ok, this looks much better. And we didn't have to deal with the missing factor levels! :-)" |
| 1005 | + "OK, this looks much better. And we didn't have to deal with the missing factor levels! :-)" |
1006 | 1006 | ]
|
1007 | 1007 | },
|
1008 | 1008 | {
|
|
1015 | 1015 | "\n",
|
1016 | 1016 | "Backend: Java\n",
|
1017 | 1017 | "\n",
|
1018 |
| - "The [h2o](https://cran.r-project.org/web/packages/h2o/index.html) package offers a data-distributed implementation of Generalized Linear Models. A \"data-distribtued\" version uses distributed data frames, so that the whole design matrix does not need to fit into memory at once. The h2o package fits both regularized and non-regularized GLMs. The implementation details are documented [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf)." |
| 1018 | + "The [h2o](https://cran.r-project.org/web/packages/h2o/index.html) package offers a data-distributed implementation of Generalized Linear Models. A \"data-distributed\" version uses distributed data frames, so that the whole design matrix does not need to fit into memory at once. The h2o package fits both regularized and non-regularized GLMs. The implementation details are documented [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf)." |
1019 | 1019 | ]
|
1020 | 1020 | },
|
1021 | 1021 | {
|
|
1434 | 1434 | "source": [
|
1435 | 1435 | "### speedglm\n",
|
1436 | 1436 | "\n",
|
1437 |
| - "Also worth metioning is the [speedglm](https://cran.r-project.org/web/packages/speedglm/index.html) package, which fits Linear and Generalized Linear Models to large data sets. This is particularly useful if R is linked against an optimized [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms). For data sets of size greater of R memory, the fitting is performed by an iterative algorithm." |
| 1437 | + "Also worth mentioning is the [speedglm](https://cran.r-project.org/web/packages/speedglm/index.html) package, which fits Linear and Generalized Linear Models to large data sets. This is particularly useful if R is linked against an optimized [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms). For data sets of size greater of R memory, the fitting is performed by an iterative algorithm." |
1438 | 1438 | ]
|
1439 | 1439 | },
|
1440 | 1440 | {
|
|
1443 | 1443 | "source": [
|
1444 | 1444 | "## Regularized GLM in R\n",
|
1445 | 1445 | "\n",
|
1446 |
| - "Ok, so let's assume that we have wide, sparse, collinear or big data. If your training set falls into any of those categories, it might be a good idea to use a regularlized GLM.\n", |
| 1446 | + "OK, so let's assume that we have wide, sparse, collinear or big data. If your training set falls into any of those categories, it might be a good idea to use a regularized GLM.\n", |
1447 | 1447 | "\n",
|
1448 | 1448 | "### glmnet\n",
|
1449 | 1449 | "\n",
|
|
1457 | 1457 | "\n",
|
1458 | 1458 | "- The code can handle sparse input-matrix formats, as well as range constraints on coefficients. \n",
|
1459 | 1459 | "- Glmnet also makes use of the strong rules for efficient restriction of the active set. \n",
|
1460 |
| - "- The core of Glmnet is a set of fortran subroutines, which make for very fast execution. \n", |
| 1460 | + "- The core of Glmnet is a set of FORTRAN subroutines, which make for very fast execution. \n", |
1461 | 1461 | "- The algorithms use coordinate descent with warm starts and active set iterations. \n",
|
1462 | 1462 | "- Supports the following distributions: `\"gaussian\",\"binomial\",\"poisson\",\"multinomial\",\"cox\",\"mgaussian\"`\n",
|
1463 | 1463 | "- Supports standardization and offsets.\n",
|
|
0 commit comments