- Cluster analysis is a technique used in data analysis and machine learning to group similar objects or observations into clusters or subsets based on their characteristics or features.
- The goal of cluster analysis is to identify natural groupings or patterns in the data, without prior knowledge of the number or structure of the clusters.
- Put another way, the goal of clustering is to maximize the similarities of observations within a cluster and maximize the dissimilarity between clusters.
Classification vs Clustering
- Classification predicts an output category given input data.
- Clustering groups data points together based on similarities among them and differences from others.
Euclidean Distance
Put simply, Euclidean distance is the shortest path between two points (x1, y1) and (x2, y2).
Distance in 2D Space between two points (x1, y1) and (x2, y2):
\[ d(A,B) = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2} \]
Distance in 3D Space between points (x1, y1,z1) and (x2, y2,z2):
\[ d(A,B) = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2 + (z_2 – z_1)^2} \]
Distance in Multi-Dimensions: If the coordinates of A are (a1, a2,…,an) and of B are (b1, b2,…, bn):
\[ d(A,B) = \sqrt{(a_1 – b_1)^2 + (a_2 – b_2)^2 + … + (a_n – b_n)^2} \]
Centroids
- A centroid is a point that represents the center or average position of a set of points or observations in a multi-dimensional space. It is often used in clustering algorithms, where the goal is to group similar points or observations into clusters based on their distance or similarity to a central point or centroid.
- In a two-dimensional space, the centroid of a set of points is the point that has the same x-coordinate as the arithmetic mean of the x-coordinates of the points, and the same y-coordinate as the arithmetic mean of the y-coordinates of the points. In higher-dimensional spaces, the centroid is the point that has the same value as the arithmetic mean of each coordinate over all the points in the set.
- The centroid is an important concept in clustering algorithms such as k-means, where the algorithm iteratively assigns points to the nearest centroid, and updates the centroids to minimize the sum of squared distances between the points and their assigned centroid. The final centroids represent the centers of the resulting clusters.