1.2 Using the results of nTARP as a classifier

The next outputs of the nTARP function are particularly useful for those who want to use nTARP as a classifier — that is, to classify new data points using the best random projection. The three relevant outputs are OptimalProjection, Threshold, and Direction.

-OptimalProjection is the random vector associated with the solution that has the smallest within-cluster sum of squares. -Threshold specifies the boundary that separates cluster 1 from cluster 2 in the one-dimensional projected space. -Direction indicates which cluster has the larger mean along the optimal projection.

To classify a new data point, you multiply it by the OptimalProjection using the dot product to get its 1-D projection. Then compare this value to the Threshold:

-If the projected value is greater than the threshold, assign it to the cluster indicated by Direction. -If it is less than or equal to the threshold, assign it to the other cluster.

This mechanism allows nTARP to be used as a simple linear classifier based on the high-dimensional projections.

print("This is the random vector associated with the best cluster.")

## [1] "This is the random vector associated with the best cluster."

cluster_solution$OptimalProjection

## [1]  0.3276245  0.9473305 -0.9327176 -0.2040798

print("This is the boundary where the clusters transition.")

## [1] "This is the boundary where the clusters transition."

cluster_solution$Threshold

## [1] 2.785788

print("The *direction* tells us which cluster has the larger mean, so in this case, points with values larger than the threshold should be assigned to cluster 2.")

## [1] "The *direction* tells us which cluster has the larger mean, so in this case, points with values larger than the threshold should be assigned to cluster 2."

cluster_solution$Direction

## [1] 2

For example, we can take the first row of the iris dataset and see how it would be assigned to a cluster using the nTARP classifier. By applying the OptimalProjection and comparing the projected value to the Threshold, we find that the first observation should be assigned to cluster 2. We can confirm this by checking the optimal solution in nTARP_result$OptimalSolution —indeed, the first observation was assigned to cluster 2!

observation_to_classify <- iris[1,1:4]
observation_after_projection <- sum(observation_to_classify * cluster_solution$OptimalProjection)
print("The resulting 1-D projection is...")

## [1] "The resulting 1-D projection is..."

observation_after_projection

## [1] 3.639921

paste("Our direction, ", cluster_solution$Direction,", tells us that cluster 2 has a mean bigger than the threshold, so since the projected value ", round(observation_after_projection,2)," is larger than the threshold ",round(cluster_solution$Threshold,2)," we should assign it to cluster 2.", sep="")

## [1] "Our direction, 2, tells us that cluster 2 has a mean bigger than the threshold, so since the projected value 3.64 is larger than the threshold 2.79 we should assign it to cluster 2."

2. Extending nTARP to form multiple clusters

In most cases, it is desirable to form more than two clusters for a dataset. To support this, the nTARP package includes a function called nTARP_bisecting, which applies nTARP in a recursive bisecting fashion. This function repeatedly splits clusters until a minimum cluster size is reached.

The arguments are the same as for the standard nTARP function, with the addition of a new argument: minimum_cluster_size_percent. This argument provides a stopping rule to prevent clusters from being broken down into very small groups. The function will stop splitting a cluster if its size is smaller than the percentage specified. By default, this threshold is set to 20%.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting <- nTARP_bisecting(data = iris[,1:4],
                                    number_of_projections = 1000,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20 #default value
                                    )

## Finding solution for cluster: 0

## Finding solution for cluster: 01

## Finding solution for cluster: 012

## Finding solution for cluster: 011

## Finding solution for cluster: 02

After running the nTARP_bisecting function, you will obtain three main outputs:

-A list of the individual nTARP solutions (as described in Section 1). -The best clusters at each step, based on the within-cluster sum of squares. -All within-cluster sum of squares values for each solution at each step.

The most useful output for users is the BestClusters entry. Examining this output provides information on cluster sizes, cluster labels, and the within-cluster sum of squares for each selected solution.

cluster_solution_bisecting$BestClusters

## $Cluster_0
## $Cluster_0$BestSolutionWSS
## [1] 0.09152546
## 
## $Cluster_0$Cluster1Size
## [1] 100
## 
## $Cluster_0$Cluster2Size
## [1] 50
## 
## $Cluster_0$ClustersForSolution
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1
## 
## $Cluster_0$LabeledClusters
## $Cluster_0$LabeledClusters$`Cluster 1`
##   [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [19]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
##  [37]  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
##  [55] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
##  [73] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
##  [91] 141 142 143 144 145 146 147 148 149 150
## 
## $Cluster_0$LabeledClusters$`Cluster 2`
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## 
## 
## $Cluster_01
## $Cluster_01$BestSolutionWSS
## [1] 0.2009998
## 
## $Cluster_01$Cluster1Size
## [1] 48
## 
## $Cluster_01$Cluster2Size
## [1] 52
## 
## $Cluster_01$ClustersForSolution
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## $Cluster_01$LabeledClusters
## $Cluster_01$LabeledClusters$`Cluster 1`
##  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  70
## [20]  71  72  74  75  76  77  78  79  80  81  82  83  85  86  87  88  89  90  91
## [39]  92  93  94  95  96  97  98  99 100 134
## 
## $Cluster_01$LabeledClusters$`Cluster 2`
##  [1]  69  73  84 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
## [20] 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 135 136
## [39] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
## 
## 
## 
## $Cluster_012
## $Cluster_012$BestSolutionWSS
## [1] 0.2420616
## 
## $Cluster_012$Cluster1Size
## [1] 26
## 
## $Cluster_012$Cluster2Size
## [1] 26
## 
## $Cluster_012$ClustersForSolution
##  [1] 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 2 1 2 2 1 2 2 1 2 1 2 1 2 1 1 1 2 1 2 2 2 1 2
## [39] 2 1 1 2 2 2 1 2 2 2 1 1 2 1
## 
## $Cluster_012$LabeledClusters
## $Cluster_012$LabeledClusters$`Cluster 1`
##  [1]  69  73  84 102 104 107 108 109 111 112 114 117 120 122 124 126 127 128 130
## [20] 135 138 139 143 147 148 150
## 
## $Cluster_012$LabeledClusters$`Cluster 2`
##  [1] 101 103 105 106 110 113 115 116 118 119 121 123 125 129 131 132 133 136 137
## [20] 140 141 142 144 145 146 149
## 
## 
## 
## $Cluster_011
## $Cluster_011$BestSolutionWSS
## [1] 0.2148706
## 
## $Cluster_011$Cluster1Size
## [1] 31
## 
## $Cluster_011$Cluster2Size
## [1] 17
## 
## $Cluster_011$ClustersForSolution
##  [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1
## [39] 1 2 2 1 1 1 1 2 1 1
## 
## $Cluster_011$LabeledClusters
## $Cluster_011$LabeledClusters$`Cluster 1`
##  [1]  51  52  53  55  56  57  59  62  64  66  67  68  71  74  75  76  77  78  79
## [20]  85  86  87  89  91  92  95  96  97  98 100 134
## 
## $Cluster_011$LabeledClusters$`Cluster 2`
##  [1] 54 58 60 61 63 65 70 72 80 81 82 83 88 90 93 94 99
## 
## 
## 
## $Cluster_02
## $Cluster_02$BestSolutionWSS
## [1] 0.2419245
## 
## $Cluster_02$Cluster1Size
## [1] 22
## 
## $Cluster_02$Cluster2Size
## [1] 28
## 
## $Cluster_02$ClustersForSolution
##  [1] 1 1 2 2 2 2 2 1 2 1 1 2 1 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 1 1 2 2 1 1 1 2
## [39] 2 1 2 1 2 2 2 1 2 2 1 1
## 
## $Cluster_02$LabeledClusters
## $Cluster_02$LabeledClusters$`Cluster 1`
##  [1]  1  2  8 10 11 13 15 19 21 26 28 29 31 32 35 36 37 40 42 46 49 50
## 
## $Cluster_02$LabeledClusters$`Cluster 2`
##  [1]  3  4  5  6  7  9 12 14 16 17 18 20 22 23 24 25 27 30 33 34 38 39 41 43 44
## [26] 45 47 48

Ultimately, we want to obtain the final cluster labels for each observation. This can be done using the build_solution_from_labeled_clusters function, which returns a data frame detailing how each datapoint was clustered.

The function requires two inputs: -The BestClusters output from the nTARP_bisecting function. -A vector of IDs to identify each observation.

If you don’t have specific IDs to track observations, you can simply use 1:nrow(x), where x is your dataset. This assigns sequential numeric labels to each row.

labeled_clusters <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting$BestClusters, ids = 1:nrow(iris))
labeled_clusters[sample(1:150,10,replace = FALSE),] #We'll take a sampling of the observations

##      ID ClusterPath Cluster_0 Cluster_01 Cluster_011 Cluster_012 Cluster_02
## 3     3          22         2         NA          NA          NA          2
## 40   40          21         2         NA          NA          NA          1
## 83   83         112         1          1           2          NA         NA
## 38   38          22         2         NA          NA          NA          2
## 78   78         111         1          1           1          NA         NA
## 131 131         122         1          2          NA           2         NA
## 119 119         122         1          2          NA           2         NA
## 44   44          22         2         NA          NA          NA          2
## 9     9          22         2         NA          NA          NA          2
## 10   10          21         2         NA          NA          NA          1
##     FinalClusterID
## 3                6
## 40               5
## 83               2
## 38               6
## 78               1
## 131              4
## 119              4
## 44               6
## 9                6
## 10               5

When we inspect the result of build_solution_from_labeled_clusters, we can see that the data frame keeps track of which solutions each observation participated in using a numeric coding system.

Each solution starts with Cluster 0, representing the first split of the dataset into two clusters.
Subsequent numbers in the code indicate which cluster from the previous split was further subdivided.

For example, a code like Cluster 012 means: - The observation was in cluster 1 during the first split. - Then, when cluster 1 was subdivided, the observation ended up in cluster 2 of the next split.

Finally, these cluster path codes are translated into a single integer to record the final cluster assignment for each observation.

#2.1 Trimming clusters If you think that the resulting clusters are too small after reviewing the output of build_solution_from_labeled_clusters, we have a simple function called consolidate_clusters that can be used. Let’s see what we have here relative to the class labels we expected from this dataset:

table(labeled_clusters$FinalClusterID,iris$Species)

##    
##     setosa versicolor virginica
##   1      0         30         1
##   2      0         17         0
##   3      0          3        23
##   4      0          0        26
##   5     22          0         0
##   6     28          0         0

It appears that nTARP was able to cluster the three flower types reasonably well, but it split clusters too finely. To reconcile this, we can use the consolidate_clusters function.

The only inputs required are: - The output from build_solution_from_labeled_clusters - The two clusters you want to combine.

The function returns a data frame with the clusters merged, and we can reassign the labeled_clusters variable with this updated result:

labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)

##    
##     setosa versicolor virginica
##   1      0          3        23
##   2      0          0        26
##   3     22          0         0
##   4     28          0         0
##   5      0         47         1

From the second line, we can see that we lost one cluster as a result of the merging. We can keep doing this two more times.

#Merge clusters two more times
labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)

##    
##     setosa versicolor virginica
##   1      0         47         1
##   2      0          3        49
##   3     50          0         0

After this pruning step, we find that the clustering split the flower types with 97% accuracy.

#2.2 Using a contextual variable to support clustering Because of how nTARP operates, we are effectively taking multiple random “slices” of the same dataset and obtaining different candidate clusterings. At its core, cluster analysis is descriptive—we are building a useful representation of the structure in our data so that we can ultimately act on what we learn about subgroups.

But what if there is a variable you would like to use to guide the clustering process?

Instead of selecting the best solution based solely on the within-cluster sum of squares, we can modify the objective function to evaluate how well candidate clusterings separate observations with respect to a contextual variable. In other words, we can allow nTARP to search for structure that is meaningful relative to some external criterion.

More specifically, we can approach clustering in a way similar to a decision tree.

In the traditional CART algorithm, splits are chosen to maximize separation of predefined class labels by minimizing node “impurity” (e.g., Gini index). Similarly, in nTARP, we can evaluate candidate clusterings based on how well they separate observations according to a contextual variable, such as a known class label.

To evaluate the internal validity of a clustering solution and examine how well clusters differ on a given variable, we use an impurity metric known as the Gini index.

Originally developed to measure income inequality, the Gini index has since been widely adopted in machine learning, particularly in training decision tree models. In clustering contexts, it provides a measure of how homogeneous each cluster is with respect to a categorical variable.

The Gini index for a cluster $t$, denoted as $\text{Gini}(t)$, is defined as:

\[ \text{Gini}(t) = 1 - \sum_{i=1}^{c} p(i \mid t)^2 \] where:

$c$ is the number of class labels,
$p(i \mid t)$ is the proportion of observations belonging to class $i$ within cluster $t$.

Lower Gini values indicate greater purity (i.e., more homogeneous clusters), while higher values indicate greater impurity.

To assess how well a clustering solution separates observations based on a particular variable, we compute the gain, defined as the reduction in impurity from the full dataset (parent node) to the set of clusters.

The gain is calculated as:

\[ \text{Gain} = \text{Gini}(\text{Parent}) - \sum_{t=1}^{g} \frac{n_t}{n} \, \text{Gini}(t) \]

where:

$\text{Gini}(\text{Parent})$ is the Gini index of the entire dataset prior to clustering,
$g$ is the number of clusters,
$n_t$ is the number of observations in cluster $t$,
$n$ is the total number of observations.

The gain represents the weighted reduction in impurity achieved by the clustering solution. Larger gain values indicate that the clusters more effectively separate observations on the chosen variable.

The maximum possible gain is equal to $\text{Gini}(\text{Parent})$, which would occur if each cluster were perfectly pure with respect to the variable (i.e., all $\text{Gini}(t) = 0$).

To illustrate the process, Figure 3 presents a simplified flowchart of the bisecting nTARP algorithm, highlighting its general workflow without delving into the more nitpicky details.

Figure 3: How the nTARP process works when iteratively breaking down clusters with and without a contextual variable

Let’s see how this works for the iris dataset, where the goal is to cluster observations with respect to the Species variable. We use the exact same function as before, but now we add the contextual variable:

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_contextual <- nTARP_bisecting(data = iris[,1:4],
                                                         number_of_projections = 1000,
                                                         withinss_threshold = 0.36,
                                                         minimum_cluster_size_percent = 20,
                                                         contextual_variable = iris$Species
                                                         )

## Finding solution for cluster: 0

## Finding solution for cluster: 01

## Finding solution for cluster: 011

labeled_clusters <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_contextual$BestClusters, ids = 1:nrow(iris))

If it’s desirable to do so, you can append the contextual data to the output of build_solution_from_labled_clusters using the optional argument contextual_variables_df. Here, we can just append the iris data as an example.

labeled_clusters_with_context <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_contextual$BestClusters, ids = 1:nrow(iris), contextual_variables_df = iris)
head(labeled_clusters_with_context)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID ClusterPath
## 1          5.1         3.5          1.4         0.2  setosa  1           2
## 2          4.9         3.0          1.4         0.2  setosa  2           2
## 3          4.7         3.2          1.3         0.2  setosa  3           2
## 4          4.6         3.1          1.5         0.2  setosa  4           2
## 5          5.0         3.6          1.4         0.2  setosa  5           2
## 6          5.4         3.9          1.7         0.4  setosa  6           2
##   Cluster_0 Cluster_01 Cluster_011 FinalClusterID
## 1         2         NA          NA              4
## 2         2         NA          NA              4
## 3         2         NA          NA              4
## 4         2         NA          NA              4
## 5         2         NA          NA              4
## 6         2         NA          NA              4

Let’s take a look at the results…

table(labeled_clusters$FinalClusterID,iris$Species)

##    
##     setosa versicolor virginica
##   1      0          0        36
##   2      0          2        14
##   3      0         48         0
##   4     50          0         0

First, when using a contextual variable, you may notice that the nTARP function is less aggressive in splitting clusters unnecessarily. In fact, for the first split in both cases, it already identified a cluster containing only a single class. With the contextual variable, the algorithm recognizes that further splitting is unnecessary and stops appropriately.

Without a contextual variable, the function may continue splitting, potentially creating clusters that are too fine. In the author’s experience, over-splitting is generally safer than under-splitting, as the resulting clusters can always be merged later.

To clean up the solution, we can use the consolidate_clusters function. For example, we can merge clusters 1 and 2 to ensure that all virginica observations are grouped together:

labeled_clusters <- consolidate_clusters(cluster_path_matrix = labeled_clusters,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters$FinalClusterID,iris$Species)

##    
##     setosa versicolor virginica
##   1      0         48         0
##   2     50          0         0
##   3      0          2        50

You’ll notice that this contextual variable made our clustering even better, with 99% accuracy.

By recursively bisecting clusters and assessing cluster “quality” through metrics like normalized within-cluster sum of squares or Gini-based purity gains, nTARP can uncover structure in high-dimensional data while controlling for minimum cluster size. The approach is flexible, allowing post-hoc merging of clusters, and is particularly well-suited for applications where standard clustering algorithms struggle with sparse, high-dimensional, or noisy data.

#3. Another high dimensional example To close out the demonstration, we can try to apply nTARP to one more dataset, the wine dataset from the HDclassif package. The wine dataset from the HDclassif package contains 178 observations of wines derived from three different cultivars grown in the same region of Italy. Each wine is described by 13 continuous chemical measurements, such as alcohol content, flavanoids, magnesium, color intensity, and proline, along with a class label indicating cultivar membership.

library(HDclassif)

## Warning: package 'HDclassif' was built under R version 4.5.2

## Loading required package: MASS

data(wine)
head(wine)

##   class    V1   V2   V3   V4  V5   V6   V7   V8   V9  V10  V11  V12  V13
## 1     1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2     1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3     1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4     1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5     1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93  735
## 6     1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

wine[,2:14] <- scale(wine[,2:14])

Like the iris dataset, the wine dataset contains three distinct classes. However, unlike the iris data—where the classes are perfectly balanced—the wine dataset is moderately imbalanced. The imbalance is not extreme, but it is worth noting, as class imbalance can influence both clustering behavior and how we interpret separation quality.

table(wine$class)

## 
##  1  2  3 
## 59 71 48

We’ll try the bisecting nTARP approach without a contextual variable first. We’ll also use more projections.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_wine <- nTARP_bisecting(data = wine[,2:14],
                                    number_of_projections = 100^2,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20 #default value
                                    )

## Finding solution for cluster: 0

## Finding solution for cluster: 02

## Finding solution for cluster: 022

## Finding solution for cluster: 021

## Finding solution for cluster: 01

labeled_clusters_wine <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_wine$BestClusters, ids = 1:nrow(wine))
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1  0  3 25
##   2  0  1 23
##   3  0 28  0
##   4  2 31  0
##   5 38  1  0
##   6 19  7  0

Processing the clusters, we see that 1+2, 3+4, and 5+6 appear to be reasonable combinations of clusters.

#Merge clusters two more times
print("This is the result of combining clusters 1 and 2, during which 3-6 were relabled to be 1-4 with the merged cluster being labeled 5.")

## [1] "This is the result of combining clusters 1 and 2, during which 3-6 were relabled to be 1-4 with the merged cluster being labeled 5."

labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1  0 28  0
##   2  2 31  0
##   3 38  1  0
##   4 19  7  0
##   5  0  4 48

#Merge 1 and 2 again
print("This is the result of combining clusters 3 and 4 (that were relabeled 1 and 2 in the last step). Now there are 4 clusters left.")

## [1] "This is the result of combining clusters 3 and 4 (that were relabeled 1 and 2 in the last step). Now there are 4 clusters left."

labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1 38  1  0
##   2 19  7  0
##   3  0  4 48
##   4  2 59  0

#Merge 1 and 2 again

print("We combine that last two clusters, giving us 3 clusters at the end.")

## [1] "We combine that last two clusters, giving us 3 clusters at the end."

labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1  0  4 48
##   2  2 59  0
##   3 57  8  0

Now that we’ve reduced the clusters to three, we can see we get an accuracy of 92%. Let’s see if this improves with the contextual variable.

set.seed(123532) #random seed for reproducibility
cluster_solution_bisecting_wine <- nTARP_bisecting(data = wine[,2:14],
                                    number_of_projections = 100^2,
                                    withinss_threshold = 0.36,
                                    minimum_cluster_size_percent = 20,
                                    contextual_variable = wine$class
                                    )

## Finding solution for cluster: 0

## Finding solution for cluster: 02

## Finding solution for cluster: 021

## Finding solution for cluster: 01

labeled_clusters_wine <- build_solution_from_labeled_clusters(nTARP_best_clusters= cluster_solution_bisecting_wine$BestClusters, ids = 1:nrow(wine))
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1 55  0  0
##   2  4  2  0
##   3  0  0 40
##   4  0  2  8
##   5  0 67  0

Processing the clusters, we see that 1+2 and 3+4 appear to be reasonable combinations of clusters.

#Merge clusters two more times
print("This is the result of combining clusters 1 and 2, during which 3-5 were relabled to be 1-3 with the merged cluster being labeled 4.")

## [1] "This is the result of combining clusters 1 and 2, during which 3-5 were relabled to be 1-3 with the merged cluster being labeled 4."

labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1  0  0 40
##   2  0  2  8
##   3  0 67  0
##   4 59  2  0

#Merge 1 and 2 again
print("We combine that last two clusters, giving us 3 clusters at the end.")

## [1] "We combine that last two clusters, giving us 3 clusters at the end."

labeled_clusters_wine <- consolidate_clusters(cluster_path_matrix = labeled_clusters_wine,
                                         first_cluster_to_combine = 1,
                                         second_cluster_to_combine = 2)
table(labeled_clusters_wine$FinalClusterID,wine$class)

##    
##      1  2  3
##   1  0 67  0
##   2 59  2  0
##   3  0  2 48

Now that we’ve reduced the solution to three clusters, the clustering aligns closely with the true wine cultivars, yielding an accuracy of 98%. This demonstrates that nTARP, especially when combined with post-processing (e.g., consolidate_clusters), can recover meaningful structure even in moderately imbalanced datasets.

Importantly, the contextual variable does not have to be a class label; it can be any categorical variable of interest. Using a contextual variable allows nTARP to guide the clustering toward patterns that are most relevant for a particular analytical or predictive goal, much like how decision trees split nodes to maximize purity with respect to a target variable.

The nTARP Package

David Reeping

1. Introduction to the n-TARP clustering method

1.1 Installing the package and running the core n-TARP function

1.1 Getting the optimal solution from the nTARP function

1.2 Using the results of nTARP as a classifier

1.3 Checking the clusterability of the dataset

1.4 The last outputs of the nTARP function

2. Extending nTARP to form multiple clusters