Skip to content

Commit 03dfd58

Browse files
committed
various trivial typos
1 parent e0df899 commit 03dfd58

5 files changed

+34
-34
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ Here are some practical, related topics we will cover for each algorithm:
1919

2020
- Dimensionality Issues
2121
- Sparsity
22-
- Normalizaion
22+
- Normalization
2323
- Categorical Data
2424
- Missing Data
2525
- Class Imbalance
2626
- Overfitting
2727
- Software
2828
- Scalability
2929

30-
Instructions for how to install the neccessary software for this tutorial is available [here](tutorial-installation.md). Data for the tutorial can be downloaded by running `./data/get-data.sh` (requires **wget**).
30+
Instructions for how to install the necessary software for this tutorial is available [here](tutorial-installation.md). Data for the tutorial can be downloaded by running `./data/get-data.sh` (requires **wget**).
3131

3232
## Dimensionality Issues
3333
Certain algorithms don't scale well when there are millions of features. For example, decision trees require computing some sort of metric (to determine the splits) on all the feature values (or some fraction of the values as in Random Forest and Stochastic GBM). Therefore, computation time is linear in the number of features. Other algorithms, such as GLM, scale much better to high-dimensional (n << p) and wide data with appropriate regularization (e.g. Lasso, Elastic Net, Ridge).

decision-trees.ipynb

+10-10
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
"* * *\n",
1515
"![Alt text](./images/dt.png \"Decision Tree\")\n",
1616
"* * *\n",
17-
"Decision Tree visualizaion by Tony Chu and Stephanie Yee."
17+
"Decision Tree visualization by Tony Chu and Stephanie Yee."
1818
]
1919
},
2020
{
@@ -29,7 +29,7 @@
2929
"- **Classification Tree:** A decision tree that performs classification (predicts a categorical response).\n",
3030
"- **Regression Tree:** A decision tree that performs regression (predicts a numeric response).\n",
3131
"- **Split Point:** A split point occurs at each node of the tree where a decision is made (e.g. x > 7 vs. x &leq; 7).\n",
32-
"- **Terminal Node:** A node terminal node is a node which has no decendants (child nodes). Also called a \"leaf node.\""
32+
"- **Terminal Node:** A terminal node is a node which has no descendants (child nodes). Also called a \"leaf node.\""
3333
]
3434
},
3535
{
@@ -39,7 +39,7 @@
3939
"## Properties of Trees\n",
4040
"\n",
4141
"- Can handle huge datasets.\n",
42-
"- Can handle *mixed* predictors implicitly -- numeric and categorial.\n",
42+
"- Can handle *mixed* predictors implicitly -- numeric and categorical.\n",
4343
"- Easily ignore redundant variables.\n",
4444
"- Handle missing data elegantly through *surrogate splits*.\n",
4545
"- Small trees are easy to interpret.\n",
@@ -64,7 +64,7 @@
6464
"- CART prunes trees using a cost-complexity model whose parameters are estimated by\n",
6565
"cross-validation; C4.5 uses a single-pass algorithm derived from binomial confidence\n",
6666
"limits.\n",
67-
"- With repsect to missing data, CART looks for surrogate tests that approximate the outcomes when the tested attribute has an unknown value, but C4.5 apportions the case probabilistically among the outcomes. \n",
67+
"- With respect to missing data, CART looks for surrogate tests that approximate the outcomes when the tested attribute has an unknown value, but C4.5 apportions the case probabilistically among the outcomes. \n",
6868
"\n",
6969
"\n",
7070
"Decision trees are formed by a collection of rules based on variables in the modeling data set:\n",
@@ -82,7 +82,7 @@
8282
"source": [
8383
"## Splitting Criterion & Best Split\n",
8484
"\n",
85-
"The original CART algorithm uses the Gini Impurity, where as ID3, C4.5 and C5.0 use Entropy or Information Gain (related to Entropy).\n",
85+
"The original CART algorithm uses the Gini Impurity, whereas ID3, C4.5 and C5.0 use Entropy or Information Gain (related to Entropy).\n",
8686
"\n",
8787
"### Gini Impurity\n",
8888
"\n",
@@ -106,7 +106,7 @@
106106
"Where,\n",
107107
"- $S$ is the current (data) set for which entropy is being calculated (changes every iteration of the ID3 algorithm)\n",
108108
"- $X$ is set of classes in $S$\n",
109-
"- $p(x)$ is the proportion of the number of elements in class $x$ to the number of elements in set $S$\n",
109+
"- $p(x)$ is the ratio of the number of elements in class $x$ to the number of elements in set $S$\n",
110110
"\n",
111111
"When $H(S)=0$, the set $S$ is perfectly classified (i.e. all elements in $S$ are of the same class).\n",
112112
"\n",
@@ -119,14 +119,14 @@
119119
"source": [
120120
"### Information gain\n",
121121
"\n",
122-
"Information gain $IG(A)$ is the measure of the difference in entropy from before to after the set $S$ is split on an attribute $A$. In other words, how much uncertainty in $S$ was reduced after splitting set $S$ on attribute $A$.\n",
122+
"Information gain $IG(A)$ is the measure of the difference in entropy from before to after the set $S$ is split on an attribute $A$: in other words, how much uncertainty in $S$ was reduced after splitting set $S$ on attribute $A$.\n",
123123
"\n",
124124
"$$ IG(A,S)=H(S)-\\sum _{{t\\in T}}p(t)H(t)$$\n",
125125
"\n",
126126
"Where,\n",
127127
"- $H(S)$ is the entropy of set $S$\n",
128128
"- $T$ is the set of subsets created from splitting set $S$ by attribute $A$ such that $S=\\bigcup _{{t\\in T}}t$\n",
129-
"- $p(t)$ is the proportion of the number of elements in $t$ to the number of elements in set $S$\n",
129+
"- $p(t)$ is the ratio of the number of elements in $t$ to the number of elements in set $S$\n",
130130
"- $H(t)$ is the entropy of subset $t$\n",
131131
"\n",
132132
"In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the *largest* information gain is used to split the set $S$ on this iteration."
@@ -140,7 +140,7 @@
140140
"\n",
141141
"This is an example of a decision boundary in two dimensions of a (binary) classification tree. The black circle is the Bayes Optimal decision boundary and the blue square-ish boundary is learned by the classification tree.\n",
142142
"\n",
143-
"![Alt text](./images/boundary_dt.png \"Decision Tree Bounday\")\n",
143+
"![Alt text](./images/boundary_dt.png \"Decision Tree Boundary\")\n",
144144
"Source: Elements of Statistical Learning"
145145
]
146146
},
@@ -177,7 +177,7 @@
177177
"source": [
178178
"# CART Software in R\n",
179179
"\n",
180-
"Since its more common in machine learning to use trees in an ensemble, we'll skip the code tutorial for CART in R. For reference, trees can be grown using the [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package, among others."
180+
"Since it's more common in machine learning to use trees in an ensemble, we'll skip the code tutorial for CART in R. For reference, trees can be grown using the [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package, among others."
181181
]
182182
},
183183
{

generalized-linear-models.ipynb

+10-10
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"source": [
2424
"## Introduction\n",
2525
"\n",
26-
"[Linear Models](https://en.wikipedia.org/wiki/Linear_regression) are one of the oldest and most well known statistical prediction algorithms which nowdays is often categorized as a \"machine learning algorithm.\" [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are are a framework for modeling a response variable $y$ that is bounded or discrete. Generalized linear models allow for an arbitrary link function $g$ that relates the mean of the response variable to the predictors, i.e. $E(y) = g(β′x)$. The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between, $(-\\infty ,\\infty )$, the range of the linear predictor, and the range of the response variable (e.g. $[0,1]$). [1]\n",
26+
"[Linear Models](https://en.wikipedia.org/wiki/Linear_regression) are one of the oldest and most well known statistical prediction algorithms which nowadays is often categorized as a \"machine learning algorithm.\" [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are are a framework for modeling a response variable $y$ that is bounded or discrete. Generalized linear models allow for an arbitrary link function $g$ that relates the mean of the response variable to the predictors, i.e. $E(y) = g(β′x)$. The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between, $(-\\infty ,\\infty )$, the range of the linear predictor, and the range of the response variable (e.g. $[0,1]$). [1]\n",
2727
"\n",
2828
"Therefore, GLMs allow for response variables that have error distribution models other than a normal distribution. Some common examples of GLMs are:\n",
2929
"- [Poisson regression](https://en.wikipedia.org/wiki/Poisson_regression) for count data.\n",
@@ -51,7 +51,7 @@
5151
"\n",
5252
"$$RSS(\\beta) = \\sum_{i=1}^n (y_i - x_i^T\\beta)^2$$\n",
5353
"\n",
54-
"$RSS(\\beta)$ is a quadradic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in matrix notation:\n",
54+
"$RSS(\\beta)$ is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in matrix notation:\n",
5555
"\n",
5656
"$$RSS(\\beta) = (\\boldsymbol{y} - \\boldsymbol{X}\\beta)^T(\\boldsymbol{y} - \\boldsymbol{X}\\beta)$$\n",
5757
"\n",
@@ -147,12 +147,12 @@
147147
"2. Compute the weighted [Gram matrix](https://en.wikipedia.org/wiki/Gramian_matrix) XT WX and XT z vector\n",
148148
"3. Decompose the Gram matrix ([Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition)) and apply ADMM solver to solve the $\\ell_1$ penalized least squares problem.\n",
149149
"\n",
150-
"In the [H2O GLM](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf) implementation, steps 1 and 2 are performed distributively, and Step 3 is computed in parallel on a single node. The Gram matrix appraoch is very efficient for tall and narrow datasets when running lamnda search with a sparse solution. \n",
150+
"In the [H2O GLM](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf) implementation, steps 1 and 2 are performed distributively, and Step 3 is computed in parallel on a single node. The Gram matrix approach is very efficient for tall and narrow datasets when running lambda search with a sparse solution. \n",
151151
"\n",
152152
"\n",
153153
"### Cyclical Coordinate Descent\n",
154154
"\n",
155-
"The IRLS method can also use cyclical coordinate descent in it's inner loop (as opposed to ADMM). The [glmnet](http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html) package uses [cyclical coordinate descent](http://web.stanford.edu/~hastie/Papers/glmnet.pdf) which successively optimizes the objective function over each parameter with others fixed, and cycles repeatedly until convergence.\n",
155+
"The IRLS method can also use cyclical coordinate descent in its inner loop (as opposed to ADMM). The [glmnet](http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html) package uses [cyclical coordinate descent](http://web.stanford.edu/~hastie/Papers/glmnet.pdf) which successively optimizes the objective function over each parameter with others fixed, and cycles repeatedly until convergence.\n",
156156
"\n",
157157
"Cyclical coordinate descent methods are a natural approach for solving\n",
158158
"convex problems with $\\ell_1$ or $\\ell_2$ constraints, or mixtures of the two (elastic net). Each coordinate-descent step is fast, with an explicit formula for each coordinate-wise minimization. The method also exploits the sparsity of the model, spending much of its time evaluating only inner products for variables with non-zero coefficients.\n",
@@ -169,7 +169,7 @@
169169
"source": [
170170
"## Data Preprocessing\n",
171171
"\n",
172-
"In order for the coefficients to be easily interpretable, the features must be centered and scaled (aka \"normalized\"). Many software packages will allow the direct input of categorical/factor columns in the training frame, however internally any categorical columns will be expaded into binary indicator variables. The caret package offers a handy utility function, [caret::dummyVars()](http://www.rdocumentation.org/packages/caret/functions/dummyVars), for dummy/indicator expansion if you need to do this manually.\n",
172+
"In order for the coefficients to be easily interpretable, the features must be centered and scaled (aka \"normalized\"). Many software packages will allow the direct input of categorical/factor columns in the training frame, however internally any categorical columns will be expanded into binary indicator variables. The caret package offers a handy utility function, [caret::dummyVars()](http://www.rdocumentation.org/packages/caret/functions/dummyVars), for dummy/indicator expansion if you need to do this manually.\n",
173173
"\n",
174174
"Missing data will need to be imputed, otherwise in many GLM packages, those rows will simply be omitted from the training set at train time. For example, in the `stats::glm()` function there is an `na.action` argument which allows the user to do one of the three options:\n",
175175
"\n",
@@ -1002,7 +1002,7 @@
10021002
"cell_type": "markdown",
10031003
"metadata": {},
10041004
"source": [
1005-
"Ok, this looks much better. And we didn't have to deal with the missing factor levels! :-)"
1005+
"OK, this looks much better. And we didn't have to deal with the missing factor levels! :-)"
10061006
]
10071007
},
10081008
{
@@ -1015,7 +1015,7 @@
10151015
"\n",
10161016
"Backend: Java\n",
10171017
"\n",
1018-
"The [h2o](https://cran.r-project.org/web/packages/h2o/index.html) package offers a data-distributed implementation of Generalized Linear Models. A \"data-distribtued\" version uses distributed data frames, so that the whole design matrix does not need to fit into memory at once. The h2o package fits both regularized and non-regularized GLMs. The implementation details are documented [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf)."
1018+
"The [h2o](https://cran.r-project.org/web/packages/h2o/index.html) package offers a data-distributed implementation of Generalized Linear Models. A \"data-distributed\" version uses distributed data frames, so that the whole design matrix does not need to fit into memory at once. The h2o package fits both regularized and non-regularized GLMs. The implementation details are documented [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf)."
10191019
]
10201020
},
10211021
{
@@ -1434,7 +1434,7 @@
14341434
"source": [
14351435
"### speedglm\n",
14361436
"\n",
1437-
"Also worth metioning is the [speedglm](https://cran.r-project.org/web/packages/speedglm/index.html) package, which fits Linear and Generalized Linear Models to large data sets. This is particularly useful if R is linked against an optimized [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms). For data sets of size greater of R memory, the fitting is performed by an iterative algorithm."
1437+
"Also worth mentioning is the [speedglm](https://cran.r-project.org/web/packages/speedglm/index.html) package, which fits Linear and Generalized Linear Models to large data sets. This is particularly useful if R is linked against an optimized [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms). For data sets of size greater of R memory, the fitting is performed by an iterative algorithm."
14381438
]
14391439
},
14401440
{
@@ -1443,7 +1443,7 @@
14431443
"source": [
14441444
"## Regularized GLM in R\n",
14451445
"\n",
1446-
"Ok, so let's assume that we have wide, sparse, collinear or big data. If your training set falls into any of those categories, it might be a good idea to use a regularlized GLM.\n",
1446+
"OK, so let's assume that we have wide, sparse, collinear or big data. If your training set falls into any of those categories, it might be a good idea to use a regularized GLM.\n",
14471447
"\n",
14481448
"### glmnet\n",
14491449
"\n",
@@ -1457,7 +1457,7 @@
14571457
"\n",
14581458
"- The code can handle sparse input-matrix formats, as well as range constraints on coefficients. \n",
14591459
"- Glmnet also makes use of the strong rules for efficient restriction of the active set. \n",
1460-
"- The core of Glmnet is a set of fortran subroutines, which make for very fast execution. \n",
1460+
"- The core of Glmnet is a set of FORTRAN subroutines, which make for very fast execution. \n",
14611461
"- The algorithms use coordinate descent with warm starts and active set iterations. \n",
14621462
"- Supports the following distributions: `\"gaussian\",\"binomial\",\"poisson\",\"multinomial\",\"cox\",\"mgaussian\"`\n",
14631463
"- Supports standardization and offsets.\n",

0 commit comments

Comments
 (0)