Sample data set: Survey
Location Age AnnualSalary Opinion
________ ___ ____________ ___________
'US' 40 72000 'not liked'
'Asia' 25 48000 'very liked'
'Africa' 30 54000 'not liked'
'Asia' 35 61000 'not liked'
'Africa' 44 63777 'liked'
'US' 33 58000 'very liked'
'Asia' 37 52000 'not liked'
'US' 40 79000 'liked'
'Africa' 55 83000 'not liked'
'US' 35 67000 'liked'
Remove missing data:
data_set = rmmissing(table_data,dim,Name,Value)
survey_data = rmmissing(survey_data,1,'MinNumMissing',3)
% remove the row with more than 3 missing data
Calculate mean, mode, median and fill in missing data:
fillmissing(A,'constant',v)
mean_Age = mean(survey_data.Age,'omitnan')
% omitnan to ignore NaN values in calculation
student_data.Age = fillmissing(survey_data.Age,'constant',mean_Age)
Other fill missing method: fillmissing(A,method)
Normalization:
survey_data.Age = normalize(survey_data.Age,'range')
Standardization:
survey_data.Age = zscore(survey_data.Age)
Remove outliers:
- Return a logical array of outliers:
isoutlier(A,method)
survey_data = survey_data(~isoutlier(survey_data.Age,'median'),:)
Fill in outliers
Fill in method: filloutliers(A,fillmethod,findmethod)
survey_data.Age = filloutliers(survey_data.Age,'center','median')
Matlab doesn't have built-in function, so we need to create our own:
Label Encoding
function new_variable = labelEncoding(variable,values_set,numbers)
[rows,col] = size(variable);
new_variable = zeros(rows,1);
for i=1:length(values_set)
indices = ismember(variable,values_set{i});
new_variable(indices) = numbers(i);
end
end
survey_data.Opinion = labelEncoding(survey_data.Opinion,{'not liked','liked','very liked'},[0 1 2])
One Hot Encoding
function data = oneHotEncoding(data,variable)
unique_values = unique(variable);
for i=1:length(unique_values)
dummy_variable(:,i) = double(ismember(variable,unique_values{i})) ;
end
T = table;
[rows, col] = size(dummy_variable);
for i=1:col
T1 = table(dummy_variable(:,i));
T1.Properties.VariableNames = unique_values(i);
T = [T T1];
end
data = [data T];
end
survey_data.Opinion = onHotEncoding(survey_data,survey_data.Location)
Avoid dummy variable trap:
survey_data = survey_data(:,1:end-1)