2025-02-16
This is an implementation of the mapper algorithm by Singh, Mémoli, and Carlsson (2007).
To install and use the most recent CRAN upload of this package, run the following:
install.packages("mappeR")
library(mappeR)
To install the latest development version of this package from Github, run the following commands:
install.packages("devtools")
library(devtools)
devtools::install_github("https://github.com/Uiowa-Applied-Topology/mappeR/tree/dev", upgrade=FALSE)
library(mappeR)
If you’re installing from Github, you might need to do some more stuff:
apt-get install r-base-dev
(or similar).Mapper is a way to view a point cloud \(P\) through a “lens” of our choice.
Consider a function
f: P \to \mathbb{R}
Cover \(\mathbb{R}\) in a set of intervals \(`\{I_i\}_{i=1}^n`\), so that every point in \(P\) is contained in some level set \(L_i = f^{-1}(I_i)\). We may then construct a graph
G = (V,E)
based on this cover to reflect the original data, where
V = \{L_i \mid L_i \neq \varnothing\}
and
E = \{\{L_i, L_j\}\mid L_i\cap L_j \neq \varnothing,\ i\neq j\}
This is the basic idea of the mapper algorithm, with the addition that each level set is first refined into clusters based on the intrinsic pairwise distances of the data according to some clustering algorithm. That is, we partition each level set \(L_i\) into \(k_i\) disjoint clusters
L_i = \bigsqcup_{j=1}^{k_i} C_j
and build a new graph \(G' = (V', E')\) that is homomorphic to \(G\) defined by
V' = \bigsqcup_{i=1}^{n}\{C_j\}_{j=1}^{k_i}
and
E' = \{\{C_i, C_j\}\mid C_i\cap C_j \neq \varnothing\}
So in general, the ingredients to construct a mapper graph are
= 5000
num_points
= data.frame(
P.data x = sapply(1:num_points, function(x)
sin(x) * 10) + rnorm(num_points, 0, 0.1),
y = sapply(1:num_points, function(x)
cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1),
z = sapply(1:num_points, function(x)
10 * sin(x) ^ 2 * cos(x)) + rnorm(num_points, 0, 0.1)
)
= dist(P.data) P.dist
Here is a point cloud \(P\) formed by adding a bit of uniform noise to 5000 points regularly sampled from the parametric curve
\gamma(t) = \begin{cases}x = 10\ \sin(t)\\ y=10\ \sin(t)\ \cos^2(t)\\ z=10\ \sin^2(t)\ \cos(t) \end{cases}
This seems to form a kind of figure-8 curve just based on this projection. But as we can see from the 2D projections, the “shape” of the data set we’re seeing really does depend on how we’re looking at it:
We will build graphs using the outline of the mapper algorithm described, with real-valued lens functions.
Parameters:
# lens function
= P.data$x
projx
# cover parameters to generate a width-balanced cover
= 10
num_bins = 25
percent_overlap
# generate the cover
= create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap)
xcover
# bin tester machine machine
<- function(endpoints) {
check_in_interval return(function(x) (endpoints[1] - x <= 0) & (endpoints[2] - x >= 0))
}
# each of the "cover" elements will really be a function that checks if a data point lives in it
= apply(xcover, 1, check_in_interval)
xcovercheck
# build the mapper objects
= create_mapper_object(
xmapper data = P.data,
dists = P.dist,
filtered_data = projx,
cover_element_tests = xcovercheck,
clusterer = local_hierarchical_clusterer("single") # built-in mappeR method
)
The object returned by create_mapper_object
is a list of
two dataframes containing vertex and edge information.
Vertex information:
id
: vertex IDcluster_size
: number of datapoints in clustermean_dist_to_medoid
: mean distance to medoid of
clusterdata
: names of datapoints in clusterpatch
: level set IDEdge information:
source
: vertex ID of edge sourcetarget
: vertex ID of edge targetweight
: Jaccard index of edge; intersection divided by
unionoverlap_data
: names of datapoints in overlapoverlap_size
: number of datapoints overlapBy toying with the general mapper parameters, we can obtain different flavors of the algorithm. In the ball mapper flavor, we simply use the inclusion into the ambient space of the data as our lens function, and let the cover do the work. Specifically, we cover the ambient space with \(\varepsilon\)-balls by creating a \(\varepsilon\)-net, which can be done with a greedy algorithm.
Parameters:
# creates a cover using a greedy algorithm
= create_balls(data = P.data, dists = P.dist, eps = .25)
balls
# ball tester machine machine
<- function(ball) {
is_in_ball return(function(x) x %in% ball)
}
# filtering is just giving back the data (row names because my balls are lists of data point names, so the filter should match)
= create_mapper_object(P.data, P.dist, rownames(P.data), lapply(balls, is_in_ball)) ballmapper
mappeR
has built-in methods for:
1D mapper
create_1D_mapper_object(data, dists, filtered_data, cover, clusterer)
Ball mapper
create_ball_mapper_object(data, dists, eps)
Clusterball mapper
create_clusterball_mapper_object(data, ball_dists, clustering_dists, eps, clusterer)
mappeR
has two built-in clusterers, each of which
implement agglomerative hierarchical clustering using
fastcluster
. The first is
local_hierarchical_clusterer(method)
, which will cut each
dendrogram according to its longest unbroken branches — it cuts in a
“local” context. The second is
global_hierarchical_clusterer(method, dists)
, which will
run hierarchical clustering on dists
to obtain a uniform
cutting height for each dendrogram — it cuts in a “global” context.
Any of the linkage methods below will work:
"single"
: single linkage"complete"
: complete linkage"average"
: average linkage"mcquitty"
: McQuitty linkage"centroid"
: centroid linkage"median"
: median linkage"ward.D"
: Ward linkage (assumes squared distances)"ward.D2"
: Ward linkage (assumes nonsquared
distances)You may also build your own function to use for clustering; this
function must be able to accept a list of distance matrices and return a
list of cluster information vectors for each matrix. A “cluster
information vector” means a named vector (a vector whose elements have
names) with datapoint IDs for names and cluster IDs for values. The
cluster IDs should be positive integers but do not need to be unique
across all of the information vectors; mappeR
will handle
this for you.