fastmath.clustering
Clustering.
Various clustering algrorithms backed by SMILE library.
Only partition clustering is implemented.
Input data
It’s always sequence of n-sized samples as sequences.
For example, 2d samples [[1 2] [2 2] [3 3] ...]
For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers
[1 2 3]
;; or
[[1] [2] [3]]
Distances
Some of the methods use distance functions, use fastmath.distance namespace to create one.
Output
Every function returns record which contains:
:type
- name of the method used:data
- input data:clustering
- sequence of cluster ids:sizes
- sizes of clusters:clusters
- number of clusters:predict
- predicting function (see below), qualify additional sample:representatives
- list of centroids or averages:info
- additional statistics for your samples (like distortion):obj
- SMILE object
Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.
Record acts as function and can qualify additonal sample by calling :predict
function (or just call predict), for example (data
is sequence of 3d samples):
(let [cl (k-means data 10)] (cl [0 1 2]))
See k-means
Regrouping
Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:
:key
- cluster id or:outliers
:data
- samples which belong to the cluster:representative
- centroid or average vector if the former is not available:size
- size of cluster
Categories
Other vars: ->ClusteringResult clarans clustering-methods-list dbscan denclue deterministic-annealing g-means k-means lloyd map->ClusteringResult mec predict regroup spectral x-means
Constants
- outlier-id =
2147483647
clarans
(clarans data clusters)
(clarans data dist clusters)
(clarans data dist clusters max-neighbor)
Clustering Large Applications based upon RANdomized Search algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
euclidean
- clusters - number of clusters
- max-neighbor (optional) - maximum number of neighbors checked during random search
See more in SMILE doc
Examples
Usage
(dissoc (clarans (repeatedly
1000
(fn* []
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
d/chebyshev
2)
:data :clustering
:obj :predict)
;;=> {:clusters 2,
;;=> :info {:distortion 11761.0},
;;=> :representatives ((124.0) (0.0)),
;;=> :sizes (895 105 0),
;;=> :type :clarans}
clustering-methods-list
List of clustering methods.
Examples
List of methods
clustering-methods-list
;;=> (:spectral :dbscan
;;=> :k-means :mec
;;=> :clarans :g-means
;;=> :lloyd :x-means
;;=> :deterministic-annealing :denclue)
dbscan
(dbscan data min-pts radius)
(dbscan data dist min-pts radius)
Density-Based Spatial Clustering of Applications with Noise algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
euclidean
- min-pts - minimum number of neighbors
- radius - the neighborhood radius
See more in SMILE doc
Examples
3d vectors
(dissoc (dbscan
(repeatedly
5000
(fn* []
(vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
10
20)
:data :clustering
:obj :predict)
;;=> {:clusters 8,
;;=> :info {},
;;=> :representatives nil,
;;=> :sizes (1093 1140 1173 110 1109 134 121 120 0),
;;=> :type :dbscan}
denclue
(denclue data sigma m)
(denclue data sigma m tolerance)
(denclue data sigma m tolerance min-pts)
DENsity CLUstering algorithm.
Input:
- data - sequence of samples
- sigma - gaussian kernel parameter
- m - number of selected samples, much smaller than number of all samples
- tolerance (optional) - tolerance of hill-climbing procedure
- min-pts (optional) - minimum number of neighbors for a core attractor
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
1
10))
;;=> [2 (51 49 0) nil]
(map
(fn [m] (dissoc m :data))
(regroup
(denclue (repeatedly 1000
(fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
1
10)))
;;=> ({:key 0, :representative 5.485104749166474, :size 909}
;;=> {:key 1, :representative 0.49484681954689236, :size 91})
deterministic-annealing
(deterministic-annealing data max-clusters)
(deterministic-annealing data max-clusters alpha)
(deterministic-annealing data max-clusters alpha max-iter)
(deterministic-annealing data max-clusters alpha max-iter tolerance)
(deterministic-annealing data max-clusters alpha max-iter tolerance split-tolerance)
Deterministic Annealing algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
- alpha (optional) - temperature decreasing factor (valued from 0 to 1)
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
- split-tolerance (optional) - tolerance to split a cluster
See more in SMILE doc
Examples
Usage
(map (fn [m] (dissoc m :data))
(-> (repeatedly 1000
(fn* []
(vector (r/randval (r/grand) (r/grand 5 1.0))
(r/randval (r/grand) (r/grand 5 1.0)))))
(deterministic-annealing 4 0.5)
(regroup)))
;;=> ({:key 0,
;;=> :representative (0.02571403464849379 -0.041487546808452326),
;;=> :size 225}
;;=> {:key 3,
;;=> :representative (4.958779088412497 4.935661039001764),
;;=> :size 269}
;;=> {:key 2,
;;=> :representative (0.005632263753536093 5.024473916660704),
;;=> :size 265}
;;=> {:key 1,
;;=> :representative (5.132011425016014 -0.04658062914962453),
;;=> :size 241})
g-means
(g-means data clusters)
(g-means data clusters max-iter)
(g-means data clusters max-iter tolerance)
G-Means
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
4))
;;=> [2 (48 52 0) ((5.529580900980793) (0.46507987463900274))]
k-means
(k-means data clusters)
(k-means data clusters max-iter)
(k-means data clusters max-iter tolerance)
K-Means++ algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Usage
(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=> {:clustering (3 3 3 0 0 3 0 2 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 2.0000000000000004},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x53cea40f "Cluster distortion: 2.00000\nCluster size of 9 data points:\nCluster 1 3 (33.3%)\nCluster 2 1 (11.1%)\nCluster 3 1 (11.1%)\nCluster 4 4 (44.4%)\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=> :sizes (3 1 1 4 0),
;;=> :type :k-means}
Clusters group into separate maps.
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2), :key 3, :representative (2.0), :size 4}
;;=> {:data (-1 -1 -1), :key 0, :representative (-1.0), :size 3}
;;=> {:data (11), :key 2, :representative (11.0), :size 1}
;;=> {:data (111), :key 1, :representative (111.0), :size 1})
Use as predictor
(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
[(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]
lloyd
(lloyd data clusters)
(lloyd data clusters max-iter)
(lloyd data clusters max-iter tolerance)
K-Means algorithm, lloyd.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
mec
(mec data max-clusters radius)
(mec data dist max-clusters radius)
Nonparametric Minimum Conditional Entropy Clustering algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- max-clusters - maximum number of clusters
- radius - the neighborhood radius
See more in SMILE doc
Examples
2d vectors
(dissoc (mec (repeatedly
5000
(fn* []
(vector
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
d/manhattan
8
20)
:data :clustering
:obj :predict)
;;=> {:clusters 6,
;;=> :info {:entropy 368.91228362531825},
;;=> :representatives nil,
;;=> :sizes (243 1155 257 1174 1056 1115 0),
;;=> :type :mec}
regroup
(regroup {:keys [clustering data representatives sizes]})
Transform clustering result into list of clusters as separate maps.
Every map contain:
:key
- cluster id or:outliers
:data
- samples which belong to the cluster:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Representative is always a n-dimensional sequence even if input is a list of numbers.
Empty clusters are skipped.
Examples
Result of clustering with regrouping
(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering.ClusteringResult
;;=> {:clustering (2 5 4 3 3 5 3 0 1),
;;=> :clusters 7,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 0.0},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0xada293 "Cluster distortion: 0.00000\nCluster size of 9 data points:\nCluster 1 1 (11.1%)\nCluster 2 1 (11.1%)\nCluster 3 1 (11.1%)\nCluster 4 3 (33.3%)\nCluster 5 1 (11.1%)\nCluster 6 2 (22.2%)\nCluster 7 0 ( 0.0%)\n"],
;;=> :predict #,
;;=> :representatives ((11.0) (111.0) (1.0) (-1.0) (3.0) (2.0) (##NaN)),
;;=> :sizes (1 1 1 3 1 2 0 0),
;;=> :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 0, :representative (1.0), :size 1}
;;=> {:data (2 2), :key 5, :representative (2.0), :size 2}
;;=> {:data (3), :key 4, :representative (3.0), :size 1}
;;=> {:data (-1 -1 -1), :key 3, :representative (-1.0), :size 3}
;;=> {:data (11), :key 2, :representative (11.0), :size 1}
;;=> {:data (111), :key 1, :representative (111.0), :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6
spectral
(spectral data clusters sigma)
(spectral data clusters sigma max-iters tolerance)
(spectral data clusters samples sigma)
(spectral data clusters samples sigma max-iters tolerance)
Spectral clustering
Input:
- data - sequence of samples
- clusters - number of clusters
- samples (optional) - number of random samples for Nystrom approximation
- sigma - width parameter for Gaussian kernel
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of k-means convergence test
See more in SMILE doc
Examples
2d vectors
(dissoc (spectral
(repeatedly
500
(fn* []
(vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
4
1)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:distortion 38.649549206140456},
;;=> :representatives nil,
;;=> :sizes (186 58 36 220 0),
;;=> :type :spectral}
2d vectors
(dissoc (lloyd
(repeatedly
500
(fn* []
(vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
4
1)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:distortion 316885.16459203116},
;;=> :representatives ((114.44117647058823 122.84558823529412)
;;=> (125.50230414746544 -0.29493087557603687)
;;=> (-0.6888888888888889 76.68888888888888)
;;=> (137.15686274509804 127.91176470588235)),
;;=> :sizes (136 217 45 102 0),
;;=> :type :lloyd}
x-means
(x-means data clusters)
(x-means data clusters max-iter)
(x-means data clusters max-iter tolerance)
X-Means
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Expect 2 clusters, gaussian distribution.
((juxt :clusters :sizes :representatives)
(x-means (repeatedly 10000
(fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
4))
;;=> [2 (4969 5031 0) ((4.998015855439847) (-0.0011218297060218083))]