Skip to content

Commit b7cd425

Browse files
avi09trekhleb
andauthored
Added kmeans clustering (#595)
* added kmeans * added kmeans * added kmeans Co-authored-by: Oleksii Trekhleb <[email protected]>
1 parent 90ec1b7 commit b7cd425

File tree

4 files changed

+167
-0
lines changed

4 files changed

+167
-0
lines changed

Diff for: README.md

+1
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ a set of rules that precisely define a sequence of operations.
147147
* **Machine Learning**
148148
* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
149149
* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
150+
* `B` [k-Means](src/algorithms/ml/kmeans) - k-Means clustering algorithm
150151
* **Uncategorized**
151152
* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
152153
* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm

Diff for: src/algorithms/ml/kmeans/README.md

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# k-Means Algorithm
2+
3+
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimentions of vectors.
4+
5+
In k-Means classification, the output is a set of classess asssigned to each vector. Each cluster location is continously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
6+
7+
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used mostly for this task.
8+
9+
![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)
10+
11+
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)_
12+
13+
The algorithm is as follows:
14+
15+
1. Check for errors like invalid/inconsistent data
16+
2. Initialize the k cluster locations with initial/random k points
17+
3. Calculate the distance of each data point from each cluster
18+
4. Assign the cluster label of each data point equal to that of the cluster at it's minimum distance
19+
5. Calculate the centroid of each cluster based on the data points it contains
20+
6. Repeat each of the above steps until the centroid locations are varying
21+
22+
Here is a visualization of k-Means clustering for better understanding:
23+
24+
![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)
25+
26+
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)_
27+
28+
The centroids are moving continously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between itrations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
29+
30+
## References
31+
32+
- [k-Means neighbors algorithm on Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering)

Diff for: src/algorithms/ml/kmeans/__test__/kmeans.test.js

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import kMeans from '../kmeans';
2+
3+
describe('kMeans', () => {
4+
it('should throw an error on invalid data', () => {
5+
expect(() => {
6+
kMeans();
7+
}).toThrowError('Either dataSet or labels or toClassify were not set');
8+
});
9+
10+
it('should throw an error on inconsistent data', () => {
11+
expect(() => {
12+
kMeans([[1, 2], [1]], 2);
13+
}).toThrowError('Inconsistent vector lengths');
14+
});
15+
16+
it('should find the nearest neighbour', () => {
17+
const dataSet = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
18+
const k = 2;
19+
const expectedCluster = [0, 1, 0, 1, 1, 0, 1];
20+
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
21+
});
22+
23+
it('should find the clusters with equal distances', () => {
24+
const dataSet = [[0, 0], [1, 1], [2, 2]];
25+
const k = 3;
26+
const expectedCluster = [0, 1, 2];
27+
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
28+
});
29+
30+
it('should find the nearest neighbour in 3D space', () => {
31+
const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];
32+
const k = 2;
33+
const expectedCluster = [1, 1, 0];
34+
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
35+
});
36+
});

Diff for: src/algorithms/ml/kmeans/kmeans.js

+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
/**
2+
* Calculates calculate the euclidean distance between 2 vectors.
3+
*
4+
* @param {number[]} x1
5+
* @param {number[]} x2
6+
* @returns {number}
7+
*/
8+
function euclideanDistance(x1, x2) {
9+
// Checking for errors.
10+
if (x1.length !== x2.length) {
11+
throw new Error('Inconsistent vector lengths');
12+
}
13+
// Calculate the euclidean distance between 2 vectors and return.
14+
let squaresTotal = 0;
15+
for (let i = 0; i < x1.length; i += 1) {
16+
squaresTotal += (x1[i] - x2[i]) ** 2;
17+
}
18+
return Number(Math.sqrt(squaresTotal).toFixed(2));
19+
}
20+
/**
21+
* Classifies the point in space based on k-nearest neighbors algorithm.
22+
*
23+
* @param {number[][]} dataSet - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
24+
* @param {number} k - number of nearest neighbors which will be taken into account (preferably odd)
25+
* @return {number[]} - the class of the point
26+
*/
27+
export default function kMeans(
28+
dataSetm,
29+
k = 1,
30+
) {
31+
const dataSet = dataSetm;
32+
if (!dataSet) {
33+
throw new Error('Either dataSet or labels or toClassify were not set');
34+
}
35+
36+
// starting algorithm
37+
// assign k clusters locations equal to the location of initial k points
38+
const clusterCenters = [];
39+
const nDim = dataSet[0].length;
40+
for (let i = 0; i < k; i += 1) {
41+
clusterCenters[clusterCenters.length] = Array.from(dataSet[i]);
42+
}
43+
44+
// continue optimization till convergence
45+
// centroids should not be moving once optimized
46+
// calculate distance of each candidate vector from each cluster center
47+
// assign cluster number to each data vector according to minimum distance
48+
let flag = true;
49+
while (flag) {
50+
flag = false;
51+
// calculate and store distance of each dataSet point from each cluster
52+
for (let i = 0; i < dataSet.length; i += 1) {
53+
for (let n = 0; n < k; n += 1) {
54+
dataSet[i][nDim + n] = euclideanDistance(clusterCenters[n], dataSet[i].slice(0, nDim));
55+
}
56+
57+
// assign the cluster number to each dataSet point
58+
const sliced = dataSet[i].slice(nDim, nDim + k);
59+
let minmDistCluster = Math.min(...sliced);
60+
for (let j = 0; j < sliced.length; j += 1) {
61+
if (minmDistCluster === sliced[j]) {
62+
minmDistCluster = j;
63+
break;
64+
}
65+
}
66+
67+
if (dataSet[i].length !== nDim + k + 1) {
68+
flag = true;
69+
dataSet[i][nDim + k] = minmDistCluster;
70+
} else if (dataSet[i][nDim + k] !== minmDistCluster) {
71+
flag = true;
72+
dataSet[i][nDim + k] = minmDistCluster;
73+
}
74+
}
75+
// recalculate cluster centriod values via all dimensions of the points under it
76+
for (let i = 0; i < k; i += 1) {
77+
clusterCenters[i] = Array(nDim).fill(0);
78+
let classCount = 0;
79+
for (let j = 0; j < dataSet.length; j += 1) {
80+
if (dataSet[j][dataSet[j].length - 1] === i) {
81+
classCount += 1;
82+
for (let n = 0; n < nDim; n += 1) {
83+
clusterCenters[i][n] += dataSet[j][n];
84+
}
85+
}
86+
}
87+
for (let n = 0; n < nDim; n += 1) {
88+
clusterCenters[i][n] = Number((clusterCenters[i][n] / classCount).toFixed(2));
89+
}
90+
}
91+
}
92+
// return the clusters assigned
93+
const soln = [];
94+
for (let i = 0; i < dataSet.length; i += 1) {
95+
soln.push(dataSet[i][dataSet[i].length - 1]);
96+
}
97+
return soln;
98+
}

0 commit comments

Comments
 (0)