9 Unsupervised machine learning

9.1 Clustering

Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Because there isn’t a response variable, this is an unsupervised method, which implies that it seeks to find relationships between the observations without being trained by a response variable. Clustering allows us to identify which observations are alike, and potentially categorize them therein.

9.2 Distance Measures

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. There are many methods to calculate this distance information; the choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:

Euclidean distance:

\(d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}\)

Manhattan distance:

\(d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)|\)

where \(x\) and \(y\) are two vectors of length \(n\).

The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.

9.3 K-means clustering

K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e., \(k\) clusters), where \(k\) represents the number of groups pre-specified by the researcher. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In K-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\(W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2\)

where:

\(X_i\) is a data point belonging to the cluster \(C_k\) \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)

Each observation is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers is minimized.

We can define the total within-cluster variation as follows:

\(SS_{total.within} = \sum^k_{k=1}W(C_k) = \sum^k_{k=1}\sum_{x_i \in C_k}(x_i - \mu_k)^2\)

The total within-cluster sum of square measures the compactness (i.e., goodness) of the clustering and we want it to be as small as possible.

Generally, K-means algorithm can be used as follows:

Specify the number of clusters (\(K\)) to be created.
Select randomly \(k\) objects from the data set as the initial cluster centers or means
Assign each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
For each of the \(k\) clusters, update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a \(K^{th}\) cluster is a vector of length \(p\) containing the means of all variables for the observations in the \(k^{th}\) cluster; \(p\) is the number of variables.
Iteratively minimize the total within sum of square (see Eq. 4) by iterating steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached.

9.4 K-means clustering in R

To be completed later…

Check out the following website for some examples in R: https://uc-r.github.io/kmeans_clustering