The nTARP Package

David Reeping

1. Introduction to the n-TARP clustering method

This R Notebook demonstrates the functionality of the nTARP package, which implements a high-dimensional clustering technique called Thresholding After Random Projections (n-TARP). The method is described in Yellamraju & Boutin (2018) and Tarun (2019). In many datasets, the number of variables (dimensionality) can be much larger than the number of observations, which can make traditional distance-based clustering approaches unreliable or prone to non-convergence. To address this, n-TARP projects the high-dimensional data (in \(D\)) into one-dimensional spaces (in \(D'\)) repeatedly and applies k-means in each projection to identify clusters (typically splitting the data into two clusters at each step).

Figure 1: The premise of n-TARP, projecting data from higher dimensions into 1-D using random projections to find the best clustering

The algorithm flows like so:

Let \(x_1, \dots, x_m \in \mathbb{R}^p\) denote the observations in the dataset, \(D\). And initialize lists, \(C\) (to store individual candidate cluster solutions), \(M\) (to store the cluster membership for each \(x_i\)), and \(W\) (to store the within sum of squares for each candidate solution).

  1. For each projection, \(i = 1, \dots, n\):

    1. Generate a random vector \(r_i \in \mathbb{R}^p\).

    2. Project each observation \(x_j\) onto \(r_i\) using the dot product to form the projected dataset \(D_i'\): \[ x_j^\top r_i \]

    3. Apply 2-means clustering to the projected values in \(D_i'\) to obtain two clusters. Store the cluster membership found in \(M\) and the kmeans object in \(C\).

    4. Compute the normalized within-cluster sum of squares \(W_i\) for this clustering and add it to the list \(W\). To find \(W_i\), we take the within sum of squares of the solution \(\text{WSS}_i\) and divide it by the variance of the projected data \(\sigma_{D_i'}^2\) multiplied by the sample size \(|D|\).

\[ W_i = \frac{\text{WSS}_i}{\sigma_{D_i'}^2 \cdot |D|} \]

  1. End loop.

  2. Identify and store the optimal random vector \(r_i*\) associated with the smallest within sum of squares value \(W_i*\). Set the optimal cluster solution \(C*\) as the kmeans object associated with \(r_i*\) and \(W_i*\).

  3. Set \(m_1\) to be the mean of cluster 1 and \(m_2\) to be the mean of cluster 2 from the optimal solution. If the \(m_1 > m_2\), set the direction vector \(d\) to be 1. Else, set \(d\) as 2.

  4. Compute and store the threshold \(t\) that separates the two clusters along the selected projection. The threshold is essentially the halfway point between the extreme values of the two clusters. Let \(D_i^{(1)}\) be the projected points belonging to cluster 1 and \(D_i^{(2)}\) be the projected points belonging to cluster 2, then:

\[ \text{If } d = 1: \quad \frac{\min(D_i^{(1)}) - \max(D_i^{(2)})}{2} + \max(D_i^{(2)}) \\[1em] \text{If } d = 2: \quad \frac{\min(D_i^{(2)}) - \max(D_i^{(1)})}{2} + \max(D_i^{(1)}) \]

A flowchart can summarizing the process can be seen below in Figure 2.

Figure 2: How the nTARP process works

As you might gather from the loop at the core of the algorithm, several candidate solutions are generated and one is chosen based on the within sum of squares. Although n-TARP has been used sparingly, its applications have been notably diverse. In some cases, researchers have addressed the randomness of the algorithm by tracking cluster membership across iterations and aggregating results into composite profiles. In others, n-TARP has been applied as a conventional clustering tool to identify meaningful splits between participant groups.

A conceptually distinct use of n-TARP is as a classifier. After training, the algorithm yields two key parameters: (1) the optimal random projection associated with the smallest normalized within-cluster sum of squares and (2) the threshold separating clusters along that projection. In some applications, researchers have used n-TARP to classify individuals and then examine how the resulting patterns relate to external outcomes, extending the method beyond description into inferential analysis.

Together, these examples illustrate the flexibility of n-TARP, which are now accessible through this package.

1.1 Installing the package and running the core n-TARP function

Let’s load the package.

library(nTARP)

To illustrate how the method works, we will use the classic iris dataset, which contains three types of flowers. Our goal is to cluster the observations based on the four measured variables.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Note that there are 150 observations with three classes, 50 observations each.

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

The primary function in the package is nTARP. To use it, the main input is the dataset to be clustered.

You must then specify the number of projections, which determines how many random vectors are used to project the data into one-dimensional spaces for clustering. A common starting value is 1,000 projections, increasing to 10,000 or more for datasets with larger sample sizes or higher dimensionality. There is generally no statistical disadvantage to increasing the number of projections—only additional computational time and resource use.

Finally, the withinss_threshold argument sets the maximum normalized within-cluster sum of squares that a candidate solution can have to be considered acceptable. Based on prior work (Yellamraju & Boutin, 2018, IEEE Transactions on Image Processing, 27(4), 1927–1938), a value of 0.36 or lower is often a reasonable benchmark for determining whether a solution contains sufficient structure.

set.seed(123532) #random seed for reproducibility
cluster_solution <- nTARP(data = iris[,1:4],
                          number_of_projections = 1000,
                          withinss_threshold = 0.36
                          )

1.1 Getting the optimal solution from the nTARP function

From here, there are several ways to explore the results. We can examine the outputs one by one.

The first output is the result of the kmeans function corresponding to the best clustering solution identified according to the within-cluster sum of squares criterion. This object contains:

If you would like to inspect this solution directly, you can access it as follows:

cluster_solution$OptimalSolution
## K-means clustering with 2 clusters of sizes 100, 50
## 
## Cluster means:
##        [,1]
## 1 -0.145632
## 2  3.473701
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 34.651463  9.664858
##  (between_SS / total_SS =  90.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
paste("The optimal within sum of squares was", round(cluster_solution$OptimalWithinss,2))
## [1] "The optimal within sum of squares was 0.09"

1.2 Using the results of nTARP as a classifier

The next outputs of the nTARP function are particularly useful for those who want to use nTARP as a classifier — that is, to classify new data points using the best random projection. The three relevant outputs are OptimalProjection, Threshold, and Direction.

-OptimalProjection is the random vector associated with the solution that has the smallest within-cluster sum of squares. -Threshold specifies the boundary that separates cluster 1 from cluster 2 in the one-dimensional projected space. -Direction indicates which cluster has the larger mean along the optimal projection.

To classify a new data point, you multiply it by the OptimalProjection using the dot product to get its 1-D projection. Then compare this value to the Threshold:

-If the projected value is greater than the threshold, assign it to the cluster indicated by Direction. -If it is less than or equal to the threshold, assign it to the other cluster.

This mechanism allows nTARP to be used as a simple linear classifier based on the high-dimensional projections.

print("This is the random vector associated with the best cluster.")
## [1] "This is the random vector associated with the best cluster."
cluster_solution$OptimalProjection
## [1]  0.3276245  0.9473305 -0.9327176 -0.2040798
print("This is the boundary where the clusters transition.")
## [1] "This is the boundary where the clusters transition."
cluster_solution$Threshold
## [1] 2.785788
print("The *direction* tells us which cluster has the larger mean, so in this case, points with values larger than the threshold should be assigned to cluster 2.")
## [1] "The *direction* tells us which cluster has the larger mean, so in this case, points with values larger than the threshold should be assigned to cluster 2."
cluster_solution$Direction
## [1] 2

For example, we can take the first row of the iris dataset and see how it would be assigned to a cluster using the nTARP classifier. By applying the OptimalProjection and comparing the projected value to the Threshold, we find that the first observation should be assigned to cluster 2. We can confirm this by checking the optimal solution in nTARP_result$OptimalSolution —indeed, the first observation was assigned to cluster 2!

observation_to_classify <- iris[1,1:4]
observation_after_projection <- sum(observation_to_classify * cluster_solution$OptimalProjection)
print("The resulting 1-D projection is...")
## [1] "The resulting 1-D projection is..."
observation_after_projection 
## [1] 3.639921
paste("Our direction, ", cluster_solution$Direction,", tells us that cluster 2 has a mean bigger than the threshold, so since the projected value ", round(observation_after_projection,2)," is larger than the threshold ",round(cluster_solution$Threshold,2)," we should assign it to cluster 2.", sep="")
## [1] "Our direction, 2, tells us that cluster 2 has a mean bigger than the threshold, so since the projected value 3.64 is larger than the threshold 2.79 we should assign it to cluster 2."

1.3 Checking the clusterability of the dataset

The next output allows us to assess the overall clusterability of the dataset. Since nTARP uses multiple random projections, you can examine the within-cluster sum of squares (withinss) for each candidate solution it attempted. As the number of projections increases, the distribution of withinss values typically approaches a normal or approximately normal distribution. What you want to look for is the proportion of solutions with withinss below the threshold (commonly 0.36). The larger this proportion, the more evidence that the dataset contains meaningful clusters.

hist(cluster_solution$AllWithinss)

1.4 The last outputs of the nTARP function

The final outputs provide a list of all the clusters, the original data, and any ID numbers used to keep track of observations - the latter two of which are used for functions expanding on the biclustering nature of nTARP. So, if you want to pull out a specific cluster solution, you can do so using the Clusterings output. Note that the solutions need to be treated as pairs, so the first and second entries are one solution (as shown in the next chunk). The function is written this way for the purposes of making the other functionality a bit easier to run. It’s not anticipated that users would be interacting with these outputs, and is more so for internal calculations.

one_feasible_solution <- list(cluster_solution$Clusterings[[1]],cluster_solution$Clusterings[[2]])
one_feasible_solution 
## [[1]]
##  [1]  51  52  53  54  55  56  57  59  61  63  64  66  68  69  70  72  73  74  75
## [20]  76  77  78  79  80  81  82  83  84  87  88  90  91  92  93  95  96  97  98
## [39] 100 103 104 105 106 108 109 110 111 112 113 117 118 119 120 121 123 124 125
## [58] 126 127 128 129 130 131 132 133 134 135 136 138 140 141 142 144 146 147 148
## 
## [[2]]
##  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
## [20]  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38
## [39]  39  40  41  42  43  44  45  46  47  48  49  50  58  60  62  65  67  71  85
## [58]  86  89  94  99 101 102 107 114 115 116 122 137 139 143 145 149 150

It should also be noted that nTARP only keeps the solutions meeting the within sum of squares threshold established by the user. As can be seen by dividing the length of the list tracking the cluster solutions by two, there are fewer solutions than the number of projections.

length(cluster_solution$Clusterings)/2
## [1] 949

2. Extending nTARP to form multiple clusters

In most cases, it is desirable to form more than two clusters for a dataset. To support this, the nTARP package includes a function called nTARP_bisecting, which applies nTARP in a recursive bisecting fashion. This function repeatedly splits clusters until a minimum cluster size is reached.

The arguments are the same as for the standard nTARP function, with the addition of a new argument: minimum_cluster_size_percent. This argument provides a stopping rule to prevent clusters from being broken down into very small groups. The function will stop splitting a cluster if its size is smaller than the percentage specified. By default, this threshold is set to 20%.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting <- nTARP_bisecting(data = iris[,1:4],
                                    number_of_projections = 1000,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20 #default value
                                    )
## Finding solution for cluster: 0
## Finding solution for cluster: 01
## Finding solution for cluster: 012
## Finding solution for cluster: 011
## Finding solution for cluster: 02

After running the nTARP_bisecting function, you will obtain three main outputs:

-A list of the individual nTARP solutions (as described in Section 1). -The best clusters at each step, based on the within-cluster sum of squares. -All within-cluster sum of squares values for each solution at each step.

The most useful output for users is the BestClusters entry. Examining this output provides information on cluster sizes, cluster labels, and the within-cluster sum of squares for each selected solution.

cluster_solution_bisecting$BestClusters
## $Cluster_0
## $Cluster_0$BestSolutionWSS
## [1] 0.09152546
## 
## $Cluster_0$Cluster1Size
## [1] 100
## 
## $Cluster_0$Cluster2Size
## [1] 50
## 
## $Cluster_0$ClustersForSolution
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1
## 
## $Cluster_0$LabeledClusters
## $Cluster_0$LabeledClusters$`Cluster 1`
##   [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [19]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
##  [37]  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
##  [55] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
##  [73] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
##  [91] 141 142 143 144 145 146 147 148 149 150
## 
## $Cluster_0$LabeledClusters$`Cluster 2`
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## 
## 
## $Cluster_01
## $Cluster_01$BestSolutionWSS
## [1] 0.2009998
## 
## $Cluster_01$Cluster1Size
## [1] 48
## 
## $Cluster_01$Cluster2Size
## [1] 52
## 
## $Cluster_01$ClustersForSolution
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## $Cluster_01$LabeledClusters
## $Cluster_01$LabeledClusters$`Cluster 1`
##  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  70
## [20]  71  72  74  75  76  77  78  79  80  81  82  83  85  86  87  88  89  90  91
## [39]  92  93  94  95  96  97  98  99 100 134
## 
## $Cluster_01$LabeledClusters$`Cluster 2`
##  [1]  69  73  84 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
## [20] 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 135 136
## [39] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
## 
## 
## 
## $Cluster_012
## $Cluster_012$BestSolutionWSS
## [1] 0.2420616
## 
## $Cluster_012$Cluster1Size
## [1] 26
## 
## $Cluster_012$Cluster2Size
## [1] 26
## 
## $Cluster_012$ClustersForSolution
##  [1] 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 1 2 1 1 1 2 1 2 2 2 1 2
## [39] 2 1 1 2 2 2 1 2 2 2 1 1 2 1
## 
## $Cluster_012$LabeledClusters
## $Cluster_012$LabeledClusters$`Cluster 1`
##  [1]  69  73  84 102 104 107 108 109 111 112 114 117 120 122 124 126 127 128 130
## [20] 135 138 139 143 147 148 150
## 
## $Cluster_012$LabeledClusters$`Cluster 2`
##  [1] 101 103 105 106 110 113 115 116 118 119 121 123 125 129 131 132 133 136 137
## [20] 140 141 142 144 145 146 149
## 
## 
## 
## $Cluster_011
## $Cluster_011$BestSolutionWSS
## [1] 0.2148706
## 
## $Cluster_011$Cluster1Size
## [1] 31
## 
## $Cluster_011$Cluster2Size
## [1] 17
## 
## $Cluster_011$ClustersForSolution
##  [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1
## [39] 1 2 2 1 1 1 1 2 1 1
## 
## $Cluster_011$LabeledClusters
## $Cluster_011$LabeledClusters$`Cluster 1`
##  [1]  51  52  53  55  56  57  59  62  64  66  67  68  71  74  75  76  77  78  79
## [20]  85  86  87  89  91  92  95  96  97  98 100 134
## 
## $Cluster_011$LabeledClusters$`Cluster 2`
##  [1] 54 58 60 61 63 65 70 72 80 81 82 83 88 90 93 94 99
## 
## 
## 
## $Cluster_02
## $Cluster_02$BestSolutionWSS
## [1] 0.2419245
## 
## $Cluster_02$Cluster1Size
## [1] 22
## 
## $Cluster_02$Cluster2Size
## [1] 28
## 
## $Cluster_02$ClustersForSolution
##  [1] 1 1 2 2 2 2 2 1 2 1 1 2 1 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 1 1 2 2 1 1 1 2
## [39] 2 1 2 1 2 2 2 1 2 2 1 1
## 
## $Cluster_02$LabeledClusters
## $Cluster_02$LabeledClusters$`Cluster 1`
##  [1]  1  2  8 10 11 13 15 19 21 26 28 29 31 32 35 36 37 40 42 46 49 50
## 
## $Cluster_02$LabeledClusters$`Cluster 2`
##  [1]  3  4  5  6  7  9 12 14 16 17 18 20 22 23 24 25 27 30 33 34 38 39 41 43 44
## [26] 45 47 48

Ultimately, we want to obtain the final cluster labels for each observation. This can be done using the build_solution_from_labeled_clusters function, which returns a data frame detailing how each datapoint was clustered.

The function requires two inputs: -The BestClusters output from the nTARP_bisecting function. -A vector of IDs to identify each observation.

If you don’t have specific IDs to track observations, you can simply use 1:nrow(x), where x is your dataset. This assigns sequential numeric labels to each row.

labeled_clusters <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting$BestClusters, ids = 1:nrow(iris))
labeled_clusters[sample(1:150,10,replace = FALSE),] #We'll take a sampling of the observations
##      ID ClusterPath Cluster_0 Cluster_01 Cluster_011 Cluster_012 Cluster_02
## 3     3          22         2         NA          NA          NA          2
## 40   40          21         2         NA          NA          NA          1
## 83   83         112         1          1           2          NA         NA
## 38   38          22         2         NA          NA          NA          2
## 78   78         111         1          1           1          NA         NA
## 131 131         122         1          2          NA           2         NA
## 119 119         122         1          2          NA           2         NA
## 44   44          22         2         NA          NA          NA          2
## 9     9          22         2         NA          NA          NA          2
## 10   10          21         2         NA          NA          NA          1
##     FinalClusterID
## 3                6
## 40               5
## 83               2
## 38               6
## 78               1
## 131              4
## 119              4
## 44               6
## 9                6
## 10               5

When we inspect the result of build_solution_from_labeled_clusters, we can see that the data frame keeps track of which solutions each observation participated in using a numeric coding system.

For example, a code like Cluster 012 means: - The observation was in cluster 1 during the first split. - Then, when cluster 1 was subdivided, the observation ended up in cluster 2 of the next split.

Finally, these cluster path codes are translated into a single integer to record the final cluster assignment for each observation.

#2.1 Trimming clusters If you think that the resulting clusters are too small after reviewing the output of build_solution_from_labeled_clusters, we have a simple function called consolidate_clusters that can be used. Let’s see what we have here relative to the class labels we expected from this dataset:

table(labeled_clusters$FinalClusterID,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         30         1
##   2      0         17         0
##   3      0          3        23
##   4      0          0        26
##   5     22          0         0
##   6     28          0         0

It appears that nTARP was able to cluster the three flower types reasonably well, but it split clusters too finely. To reconcile this, we can use the consolidate_clusters function.

The only inputs required are: - The output from build_solution_from_labeled_clusters - The two clusters you want to combine.

The function returns a data frame with the clusters merged, and we can reassign the labeled_clusters variable with this updated result:

labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)
##    
##     setosa versicolor virginica
##   1      0          3        23
##   2      0          0        26
##   3     22          0         0
##   4     28          0         0
##   5      0         47         1

From the second line, we can see that we lost one cluster as a result of the merging. We can keep doing this two more times.

#Merge clusters two more times
labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         47         1
##   2      0          3        49
##   3     50          0         0

After this pruning step, we find that the clustering split the flower types with 97% accuracy.

#2.2 Using a contextual variable to support clustering Because of how nTARP operates, we are effectively taking multiple random “slices” of the same dataset and obtaining different candidate clusterings. At its core, cluster analysis is descriptive—we are building a useful representation of the structure in our data so that we can ultimately act on what we learn about subgroups.

But what if there is a variable you would like to use to guide the clustering process?

Instead of selecting the best solution based solely on the within-cluster sum of squares, we can modify the objective function to evaluate how well candidate clusterings separate observations with respect to a contextual variable. In other words, we can allow nTARP to search for structure that is meaningful relative to some external criterion.

More specifically, we can approach clustering in a way similar to a decision tree.

In the traditional CART algorithm, splits are chosen to maximize separation of predefined class labels by minimizing node “impurity” (e.g., Gini index). Similarly, in nTARP, we can evaluate candidate clusterings based on how well they separate observations according to a contextual variable, such as a known class label.

To evaluate the internal validity of a clustering solution and examine how well clusters differ on a given variable, we use an impurity metric known as the Gini index.

Originally developed to measure income inequality, the Gini index has since been widely adopted in machine learning, particularly in training decision tree models. In clustering contexts, it provides a measure of how homogeneous each cluster is with respect to a categorical variable.

The Gini index for a cluster \(t\), denoted as \(\text{Gini}(t)\), is defined as:

\[ \text{Gini}(t) = 1 - \sum_{i=1}^{c} p(i \mid t)^2 \] where:

Lower Gini values indicate greater purity (i.e., more homogeneous clusters), while higher values indicate greater impurity.

To assess how well a clustering solution separates observations based on a particular variable, we compute the gain, defined as the reduction in impurity from the full dataset (parent node) to the set of clusters.

The gain is calculated as:

\[ \text{Gain} = \text{Gini}(\text{Parent}) - \sum_{t=1}^{g} \frac{n_t}{n} \, \text{Gini}(t) \]

where:

The gain represents the weighted reduction in impurity achieved by the clustering solution. Larger gain values indicate that the clusters more effectively separate observations on the chosen variable.

The maximum possible gain is equal to \(\text{Gini}(\text{Parent})\), which would occur if each cluster were perfectly pure with respect to the variable (i.e., all \(\text{Gini}(t) = 0\)).

To illustrate the process, Figure 3 presents a simplified flowchart of the bisecting nTARP algorithm, highlighting its general workflow without delving into the more nitpicky details.

Figure 3: How the nTARP process works when iteratively breaking down clusters with and without a contextual variable

Let’s see how this works for the iris dataset, where the goal is to cluster observations with respect to the Species variable. We use the exact same function as before, but now we add the contextual variable:

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_contextual <- nTARP_bisecting(data = iris[,1:4],
                                                         number_of_projections = 1000,
                                                         withinss_threshold = 0.36,
                                                         minimum_cluster_size_percent = 20,
                                                         contextual_variable = iris$Species
                                                         )
## Finding solution for cluster: 0
## Finding solution for cluster: 01
## Finding solution for cluster: 011
labeled_clusters <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_contextual$BestClusters, ids = 1:nrow(iris))

If it’s desirable to do so, you can append the contextual data to the output of build_solution_from_labled_clusters using the optional argument contextual_variables_df. Here, we can just append the iris data as an example.

labeled_clusters_with_context <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_contextual$BestClusters, ids = 1:nrow(iris), contextual_variables_df = iris)
head(labeled_clusters_with_context)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID ClusterPath
## 1          5.1         3.5          1.4         0.2  setosa  1           2
## 2          4.9         3.0          1.4         0.2  setosa  2           2
## 3          4.7         3.2          1.3         0.2  setosa  3           2
## 4          4.6         3.1          1.5         0.2  setosa  4           2
## 5          5.0         3.6          1.4         0.2  setosa  5           2
## 6          5.4         3.9          1.7         0.4  setosa  6           2
##   Cluster_0 Cluster_01 Cluster_011 FinalClusterID
## 1         2         NA          NA              4
## 2         2         NA          NA              4
## 3         2         NA          NA              4
## 4         2         NA          NA              4
## 5         2         NA          NA              4
## 6         2         NA          NA              4

Let’s take a look at the results…

table(labeled_clusters$FinalClusterID,iris$Species)
##    
##     setosa versicolor virginica
##   1      0          0        36
##   2      0          2        14
##   3      0         48         0
##   4     50          0         0

First, when using a contextual variable, you may notice that the nTARP function is less aggressive in splitting clusters unnecessarily. In fact, for the first split in both cases, it already identified a cluster containing only a single class. With the contextual variable, the algorithm recognizes that further splitting is unnecessary and stops appropriately.

Without a contextual variable, the function may continue splitting, potentially creating clusters that are too fine. In the author’s experience, over-splitting is generally safer than under-splitting, as the resulting clusters can always be merged later.

To clean up the solution, we can use the consolidate_clusters function. For example, we can merge clusters 1 and 2 to ensure that all virginica observations are grouped together:

labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48         0
##   2     50          0         0
##   3      0          2        50

You’ll notice that this contextual variable made our clustering even better, with 99% accuracy.

By recursively bisecting clusters and assessing cluster “quality” through metrics like normalized within-cluster sum of squares or Gini-based purity gains, nTARP can uncover structure in high-dimensional data while controlling for minimum cluster size. The approach is flexible, allowing post-hoc merging of clusters, and is particularly well-suited for applications where standard clustering algorithms struggle with sparse, high-dimensional, or noisy data.

#3. Another high dimensional example To close out the demonstration, we can try to apply nTARP to one more dataset, the wine dataset from the HDclassif package. The wine dataset from the HDclassif package contains 178 observations of wines derived from three different cultivars grown in the same region of Italy. Each wine is described by 13 continuous chemical measurements, such as alcohol content, flavanoids, magnesium, color intensity, and proline, along with a class label indicating cultivar membership.

library(HDclassif)
## Warning: package 'HDclassif' was built under R version 4.5.2
## Loading required package: MASS
data(wine)
head(wine)
##   class    V1   V2   V3   V4  V5   V6   V7   V8   V9  V10  V11  V12  V13
## 1     1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2     1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3     1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4     1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5     1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93  735
## 6     1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
wine[,2:14] <- scale(wine[,2:14])

Like the iris dataset, the wine dataset contains three distinct classes. However, unlike the iris data—where the classes are perfectly balanced—the wine dataset is moderately imbalanced. The imbalance is not extreme, but it is worth noting, as class imbalance can influence both clustering behavior and how we interpret separation quality.

table(wine$class)
## 
##  1  2  3 
## 59 71 48

We’ll try the bisecting nTARP approach without a contextual variable first. We’ll also use more projections.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_wine <- nTARP_bisecting(data = wine[,2:14],
                                    number_of_projections = 100^2,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20 #default value
                                    )
## Finding solution for cluster: 0
## Finding solution for cluster: 02
## Finding solution for cluster: 022
## Finding solution for cluster: 021
## Finding solution for cluster: 01
labeled_clusters_wine <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_wine$BestClusters, ids = 1:nrow(wine))
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1  0  3 25
##   2  0  1 23
##   3  0 28  0
##   4  2 31  0
##   5 38  1  0
##   6 19  7  0

Processing the clusters, we see that 1+2, 3+4, and 5+6 appear to be reasonable combinations of clusters.

#Merge clusters two more times
print("This is the result of combining clusters 1 and 2, during which 3-6 were relabled to be 1-4 with the merged cluster being labeled 5.")
## [1] "This is the result of combining clusters 1 and 2, during which 3-6 were relabled to be 1-4 with the merged cluster being labeled 5."
labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1  0 28  0
##   2  2 31  0
##   3 38  1  0
##   4 19  7  0
##   5  0  4 48
#Merge 1 and 2 again
print("This is the result of combining clusters 3 and 4 (that were relabeled 1 and 2 in the last step). Now there are 4 clusters left.")
## [1] "This is the result of combining clusters 3 and 4 (that were relabeled 1 and 2 in the last step). Now there are 4 clusters left."
labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1 38  1  0
##   2 19  7  0
##   3  0  4 48
##   4  2 59  0
#Merge 1 and 2 again

print("We combine that last two clusters, giving us 3 clusters at the end.")
## [1] "We combine that last two clusters, giving us 3 clusters at the end."
labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1  0  4 48
##   2  2 59  0
##   3 57  8  0

Now that we’ve reduced the clusters to three, we can see we get an accuracy of 92%. Let’s see if this improves with the contextual variable.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_wine <- nTARP_bisecting(data = wine[,2:14],
                                    number_of_projections = 100^2,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20,
                                    contextual_variable = wine$class
                                    )
## Finding solution for cluster: 0
## Finding solution for cluster: 02
## Finding solution for cluster: 021
## Finding solution for cluster: 01
labeled_clusters_wine <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_wine$BestClusters, ids = 1:nrow(wine))
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1 55  0  0
##   2  4  2  0
##   3  0  0 40
##   4  0  2  8
##   5  0 67  0

Processing the clusters, we see that 1+2 and 3+4 appear to be reasonable combinations of clusters.

#Merge clusters two more times
print("This is the result of combining clusters 1 and 2, during which 3-5 were relabled to be 1-3 with the merged cluster being labeled 4.")
## [1] "This is the result of combining clusters 1 and 2, during which 3-5 were relabled to be 1-3 with the merged cluster being labeled 4."
labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1  0  0 40
##   2  0  2  8
##   3  0 67  0
##   4 59  2  0
#Merge 1 and 2 again
print("We combine that last two clusters, giving us 3 clusters at the end.")
## [1] "We combine that last two clusters, giving us 3 clusters at the end."
labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)
##    
##      1  2  3
##   1  0 67  0
##   2 59  2  0
##   3  0  2 48

Now that we’ve reduced the solution to three clusters, the clustering aligns closely with the true wine cultivars, yielding an accuracy of 98%. This demonstrates that nTARP, especially when combined with post-processing (e.g., consolidate_clusters), can recover meaningful structure even in moderately imbalanced datasets.

Importantly, the contextual variable does not have to be a class label; it can be any categorical variable of interest. Using a contextual variable allows nTARP to guide the clustering toward patterns that are most relevant for a particular analytical or predictive goal, much like how decision trees split nodes to maximize purity with respect to a target variable.