Sample data set: Survey
Location Age AnnualSalary Opinion
________ ___ ____________ ___________
'US' 40 72000 'not liked'
'Asia' 25 48000 'very liked'
'Africa' 30 54000 'not liked'
'Asia' 35 61000 'not liked'
'Africa' 44 63777 'liked'
'US' 33 58000 'very liked'
'Asia' 37 52000 'not liked'
'Africa' 55 83000 'not liked'
Remove missing data
Replace missing data
- Calculate mean, median, mod (exclude missing number in calculation)
- Replace with the most frequent value
- Using algorithms
When there are big differences in the range of values of different variables, these values need to be standardize into a fixed range.
i.e: Age and AnnualSalary
Normalization:
- Rescaling the range of features to scale the range in [0, 1] or [−1, 1].
- Formula: $$ x_{new} =\cfrac{x - min(x)}{max(x) - min(x)} $$
Standardization
- Rescale data to have a mean of 0 and a standard deviation of 1.
- Formula: $$ x_{new} =\cfrac{x - \overline{x}}{\sigma} $$
An outlier is a data point that differs significantly from other observations.
Different library will have different strategies to handle outliers aiming at removing the outliers or filling in the outliers.
Categorical variables represent types of data which may be divided into groups.
i.e: Location, Opinion
Label Encoding
Label encoding converts the data into numeric forms
i.e:
not liked -> 0
liked -> 1
very liked ->2
Location Age AnnualSalary Opinion
________ ___ ____________ ___________
'US' 40 72000 0
'Asia' 25 48000 2
'Africa' 30 54000 0
'Asia' 35 61000 0
'Africa' 44 63777 1
'US' 33 58000 2
'Asia' 37 52000 0
'Africa' 55 83000 0
One Hot Encoding
One Hot Encoding converts the data which has no relationship into dummy variables with the value of 0 and 1.
i.e
Age AnnualSalary Opinion Africa Asia US
___ ____________ ___________ ______ ____ ____
40 72000 0 0 0 1
25 48000 2 0 1 0
30 54000 0 1 0 0
35 61000 0 0 1 0
44 63777 1 1 0 0
33 58000 2 0 0 1
37 52000 0 0 1 0
55 83000 0 1 0 0
Dummy variable trap: When using One Hot Encoding, there are attributes which are highly correlated, meaning that variables can be predicted from the others. i.e: The data in the US col can be predicted from the 2 cols Africa and Asia. So removing one of the 3 cols is essential to avoid dummy variable trap
Age AnnualSalary Opinion Africa Asia
___ ____________ ___________ ______ ____
40 72000 0 0 0
25 48000 2 0 1
30 54000 0 1 0
35 61000 0 0 1
44 63777 1 1 0
33 58000 2 0 0
37 52000 0 0 1
55 83000 0 1 0
Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables
Formula model:
$$
y = b_{0} + b_{1}x
$$

Ordinary least squares method: The model minimize the sum of the distances between the observed values and the accordingly modeled value of the linear function: $$ min(\sum_{}(y - \hat{y_{i}})^2 ) $$
Formulate the model: $$ b_{1} = \cfrac{\sum_{}(x_{i} - \overline{x})(y_{i}-\overline{y})}{\sum_{}(x_{i}-\overline{x})} $$
$$
b_{0} = \overline{y} - b_{1}\overline{x}
$$
Gradient descent method: The model seeks to optimize and correct itself, starting from a random point of
Example: 
Formulate the model:
- Choose starting points for
$b_{0}$ and$b_{1}$ , total loop limit, and minimum$error$ limit, and learning rate, the standard for these values are:-
$b_{0}$ = 0 (intercept) -
$b_{1}$ = 1 (slope) -
$error$ >= 0.001 (minimum value of$error$ for the model to stop) - total_loop <= 1000 (maximum loops for the model to stop )
- learning_rate = 0.1 (the significant level that the model will adjust itself)
- the starting point for a model would be:
$y = 0 + 1x$
-
- Calculate the predicted
$\hat{y_{i}}$ by plugging observed$x_{i}$ of each data point into the model - Forming the cost function that's equaled the sum of the squared of the difference between observed
$y_{i}$ and predicted$\hat{y_{i}}$ :
- As the model needs to minimize the cost function, next step is to calculate the partial derivative of the
$cost$ in accordance to$b_{0}$ or$b_{1}$ :
- Calculate the
$error$ using the derivative calculated, and adjust it with the learning_rate :
- Correct and form a new model using the
$error$ :
- Repeat the step 2-7 until the
$error$ or the total_loop reach the limit.
Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
Formula model:
$$
y = b_{0} + b_{1}x_{1} + b_{2}x_{2} + ... + b_{n}x_{n}
$$