Clustering algorithms are used to group together items that are similar to one another. There are many industries where it would be beneficial and insightful to use an unsupervised learning algorithm - retailers want to group similar customers for targeted ad campaigns, biologists want to find plants that share similar characteristics, and more. We are going to explore if it would be appropriate to use some clustering algorithms to group medical patients.
We are going to look at anonymized patients who have been diagnosed with heart disease. Patients with similar characteristics might respond to the same treatments, and doctors would benefit from learning about the outcomes of patients similar to those they are treating. The data we are analyzing comes from the V.A. Medical Center in Long Beach, CA. For more information, see here.
Before beginning a project, it is important to get an idea of what the patient data looks like. In addition, the clustering algorithms used below require that the data be numeric, so it is necessary to ensure the patient data doesn’t need any transformations. You will also be brushing up on your base R skills for some analysis.
Code
# loading the dataheart_disease =read.csv("heart_disease_patients.csv")# print the first ten rows of the data sethead(heart_disease, 10)
It is important to conduct some exploratory data analysis to familiarize ourselves with the data before clustering. This will help us learn more about the variables and make an informed decision about whether we should scale the data. Because k-means and hierarchical clustering measures similarity between points using a distance formula, it can place extra emphasis on certain variables that have a larger scale and thus larger differences between points.
Exploratory data analysis helps us to understand the characteristics of the patients in the data. We need to get an idea of the value ranges of the variables and their distributions. This will also be helpful when we evaluate the clusters of patients from the algorithms. Are there more patients of one gender? What might an outlier look like?
Code
# evidence that the data should be scaled?summary(heart_disease)
id age sex cp
Min. : 1.0 Min. :29.00 Min. :0.0000 Min. :1.000
1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000
Median :152.0 Median :56.00 Median :1.0000 Median :3.000
Mean :152.0 Mean :54.44 Mean :0.6799 Mean :3.158
3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :303.0 Max. :77.00 Max. :1.0000 Max. :4.000
trestbps chol fbs restecg
Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
thalach exang oldpeak slope
Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
Median :153.0 Median :0.0000 Median :0.80 Median :2.000
Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
Code
# remove idheart_disease = heart_disease[ , !(names(heart_disease) %in%c('id'))]# scaling data and saving as a data framescaled =scale(heart_disease)# what does data look like now?summary(scaled)
age sex cp trestbps
Min. :-2.8145 Min. :-1.4549 Min. :-2.2481 Min. :-2.14149
1st Qu.:-0.7124 1st Qu.:-1.4549 1st Qu.:-0.1650 1st Qu.:-0.66420
Median : 0.1727 Median : 0.6851 Median :-0.1650 Median :-0.09601
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7259 3rd Qu.: 0.6851 3rd Qu.: 0.8765 3rd Qu.: 0.47218
Max. : 2.4961 Max. : 0.6851 Max. : 0.8765 Max. : 3.88132
chol fbs restecg thalach
Min. :-2.3310 Min. :-0.4169 Min. :-0.995103 Min. :-3.4364
1st Qu.:-0.6894 1st Qu.:-0.4169 1st Qu.:-0.995103 1st Qu.:-0.7041
Median :-0.1100 Median :-0.4169 Median : 0.009951 Median : 0.1483
Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
3rd Qu.: 0.5467 3rd Qu.:-0.4169 3rd Qu.: 1.015005 3rd Qu.: 0.7166
Max. : 6.1283 Max. : 2.3905 Max. : 1.015005 Max. : 2.2904
exang oldpeak slope
Min. :-0.6955 Min. :-0.8954 Min. :-0.9747
1st Qu.:-0.6955 1st Qu.:-0.8954 1st Qu.:-0.9747
Median :-0.6955 Median :-0.2064 Median : 0.6480
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.4331 3rd Qu.: 0.4827 3rd Qu.: 0.6480
Max. : 1.4331 Max. : 4.4445 Max. : 2.2708
3. Let’s start grouping patients
Once we’ve figured out if we need to modify the data and made any necessary changes, we can now start the clustering process. For the k-means algorithm, it is necessary to select the number of clusters in advance.
It is also important to make sure that your results are reproducible when conducting a statistical analysis. This means that when someone runs your code on the same data, they will get the same results as you reported. Therefore, if you’re conducting an analysis that has a random aspect, it is necessary to set a seed to ensure reproducibility.
Reproducibility is especially important since doctors will potentially be using our results to treat patients. It is vital that another analyst can see where the groups come from and be able to verify the results.
Code
# set the seed so that results are reproducibleseed_val =10set.seed(seed_val)# select a number of clustersk =5# run the k-means algorithmsfirst_clust =kmeans(scaled, centers = k, nstart =1)# how many patients are in each groupfirst_clust$size
[1] 66 43 88 61 45
4. Another round of k-means
Because the k-means algorithm initially selects the cluster centers by randomly selecting points, different iterations of the algorithm can result in different clusters being created. If the algorithm is truly grouping together similar observations (as opposed to clustering noise), then cluster assignments will be somewhat robust between different iterations of the algorithm.
With regards to the heart disease data, this would mean that the same patients would be grouped together even when the algorithm is initialized at different random points. If patients are not in similar clusters with various algorithm runs, then the clustering method isn’t picking up on meaningful relationships between patients.
We’re going to explore how the patients are grouped together with another iteration of the k-means algorithm. We will then be able to compare the resulting groups of patients.
Code
# set the seedseed_val =38set.seed(seed_val)# run the k-means algorithmsk =5second_clust =kmeans(scaled, k, nstart=1)# how many patients are in each groupsecond_clust$size
[1] 65 43 61 46 88
5. Comparing patient clusters
It is important that the clusters resulting from the k-means algorithm are stable. Even though the algorithm begins by randomly initializing the cluster centers, if the k-means algorithm is the right choice for the data, then different initializations of the algorithm will result in similar clusters.
The clusters from different iterations may not be exactly the same, but the clusters should be roughly the same size and have similar distributions of variables. If there is a lot of change in clusters between different iterations of the algorithm, then k-means clustering is not a good choice for the data.
It is not possible to validate that the clusters obtained from an algorithm are ground truth are accurate since there is no true labeling for patients. Thus, it is necessary to examine how the clusters change between different iterations of the algorithm. We’re going to use some visualizations to get an idea of the cluster stabilities. That way we can see how certain patient characteristics may have been used to group patients together.
I
Code
# adding cluster assignments to the dataheart_disease['first_clust'] = first_clust$clusterheart_disease['second_clust'] = second_clust$cluster# load ggplot2library(ggplot2)# creating the plots of age and chol for the first clustering algorithmplot_one =ggplot(heart_disease, aes(x=age, y=chol, color=as.factor(first_clust))) +geom_point()plot_one
Code
# creating the plots of age and chol for the second clustering algorithmplot_two =ggplot(heart_disease, aes(x=age, y=chol, color=as.factor(second_clust))) +geom_point()plot_two
6. Hierarchical clustering: another clustering approach
An alternative to k-means clustering is hierarchical clustering. This method works well when the data has a nested structure. It is possible that the data from heart disease patients follows this type of structure. For example, if men are more likely to exhibit certain characteristics, those characteristics might be nested inside the gender variable. Hierarchical clustering also does not require the number of clusters to be selected prior to running the algorithm.
Clusters can be selected by using the dendrogram. The dendrogram allows one to see how similar observations are to one another and are useful in selecting the number of clusters to group the data. It is now time for us to see how hierarchical clustering groups the data.
Code
# executing hierarchical clustering with complete linkagehier_clust_1 =hclust(dist(scaled), method='complete')# printing the dendrogramplot(hier_clust_1)
Code
# getting cluster assignments based on number of selected clustershc_1_assign <-cutree(hier_clust_1, 5)
7. Hierarchical clustering round two
In hierarchical clustering, there are multiple ways to measure the dissimilarity between clusters of observations. Complete linkage records the largest dissimilarity between any two points in the two clusters being compared. On the other hand, single linkage is the smallest dissimilarity between any two points in the clusters. Different linkages will result in different clusters being formed.
We want to explore different algorithms to group our heart disease patients. The best way to measure dissimilarity between patients could be to look at the smallest difference between patients and minimize that difference when grouping together clusters. It is always a good idea to explore different dissimilarity measures. Let’s implement hierarchical clustering using a new linkage function.
Code
# executing hierarchical clustering with complete linkagehier_clust_2 =hclust(dist(scaled), method='single')# printing the dendrogramplot(hier_clust_2)
Code
# getting cluster assignments based on number of selected clustershc_2_assign <-cutree(hier_clust_2,5)
8. Comparing clustering results
The doctors are interested in grouping similar patients together in order to determine appropriate treatments. Therefore, they want to have clusters with more than a few patients to see different treatment options. While it is possible for a patient to be in a cluster by themselves, this means that the treatment they received might not be recommended for someone else in the group.
As with the k-means algorithm, the way to evaluate the clusters is to investigate which patients are being grouped together. Are there patterns evident in the cluster assignments or do they seem to be groups of noise? We’re going to examine the clusters resulting from the two hierarchical algorithms.
Code
# adding assignments of chosen hierarchical linkageheart_disease['hc_clust'] = hc_1_assign# remove 'sex', 'first_clust', and 'second_clust' variableshd_simple = heart_disease[, !(names(heart_disease) %in%c('sex', 'first_clust', 'second_clust'))]# getting mean and standard deviation summary statisticsclust_summary =do.call(data.frame, aggregate(. ~hc_clust, data = hd_simple, function(x) c(avg =mean(x), sd =sd(x))))clust_summary
In addition to looking at the distributions of variables in each of the hierarchical clustering run, we will make visualizations to evaluate the algorithms. Even though the data has more than two dimensions, we can get an idea of how the data clusters by looking at a scatterplot of two variables. We want to look for patterns that appear in the data and see what patients get clustered together.
Code
# plotting age and cholplot_one =ggplot(hd_simple, aes(x=age, y=chol, color=as.factor(hc_clust))) +geom_point()plot_one
Code
# plotting oldpeak and trestbpsplot_two =ggplot(hd_simple, aes(oldpeak, trestbps, color=as.factor(hc_clust))) +geom_point()plot_two
10. Conclusion
Now that we’ve tried out multiple clustering algorithms, it is necessary to determine if we think any of them will work for clustering our patients. For the k-means algorithm, it is imperative that similar clusters are produced for each iteration of the algorithm. We want to make sure that the algorithm is clustering signal as opposed to noise.
For the sake of the doctors, we also want to have multiple patients in each group so they can compare treatments. We only did some preliminary work to explore the performance of the algorithms. It is necessary to create more visualizations and explore how the algorithms group other variables. Based on the above analysis, are there any algorithms that you would want to investigate further to group patients? Remember that it is important the k-mean algorithm seems stable when running multiple iterations.