fastmath-clustering.core
Clustering.
Various clustering algrorithms backed by SMILE library.
Only partition clustering is implemented.
Input data
It’s always sequence of n-sized samples as sequences.
For example, 2d samples [[1 2] [2 2] [3 3] ...]
For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers
[1 2 3]
;; or
[[1] [2] [3]]
Distances
Some of the methods use distance functions, use fastmath-clustering.distance namespace to create or use one.
Output
Every function returns record which contains:
:type
- name of the method used:data
- input data:clustering
- sequence of cluster ids:sizes
- sizes of clusters:clusters
- number of clusters:predict
- predicting function (see below), qualify additional sample:representatives
- list of centroids or averages:info
- additional statistics for your samples (like distortion):obj
- SMILE object
Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.
Record acts as function and can qualify additonal sample by calling :predict
function (or just call predict), for example (data
is sequence of 3d samples):
(let [cl (k-means data 10)] (cl [0 1 2]))
See k-means
Regrouping
Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:
:key
- cluster id or:outliers
:data
- samples which belong to the cluster:representative
- centroid or average vector if the former is not available:size
- size of cluster
Categories
Other vars: ->ClusteringResult clarans clustering-methods-list dbscan denclue deterministic-annealing g-means k-means lloyd map->ClusteringResult mec predict regroup spectral x-means
Constants
- outlier-id =
2147483647
clarans
(clarans data clusters)
(clarans data dist clusters)
(clarans data dist clusters max-neighbor)
Clustering Large Applications based upon RANdomized Search algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
euclidean
- clusters - number of clusters
- max-neighbor (optional) - maximum number of neighbors checked during random search
See more in SMILE doc
Examples
Usage
(dissoc (clarans (repeatedly
1000
(fn* [] (randval 0.1 (irand -10 10) (irand 100 150))))
d/chebyshev
2)
:data :clustering
:obj :predict)
;;=> {:clusters 2,
;;=> :info {:distortion 11663.0},
;;=> :representatives ((126.0) (0.0)),
;;=> :sizes (892 108 0),
;;=> :type :clarans}
clustering-methods-list
List of clustering methods.
Examples
List of methods
clustering-methods-list
;;=> (:spectral :dbscan
;;=> :k-means :mec
;;=> :clarans :g-means
;;=> :lloyd :x-means
;;=> :deterministic-annealing :denclue)
dbscan
(dbscan data min-pts radius)
(dbscan data dist min-pts radius)
Density-Based Spatial Clustering of Applications with Noise algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
euclidean
- min-pts - minimum number of neighbors
- radius - the neighborhood radius
See more in SMILE doc
Examples
3d vectors
(dissoc (dbscan (repeatedly
5000
(fn* []
(vector (randval 0.1 (irand -10 10) (irand 100 150))
(randval (irand -10 10) (irand 100 150))
(randval (irand -10 10) (irand 100 150)))))
10
20)
:data :clustering
:obj :predict)
;;=> {:clusters 8,
;;=> :info {},
;;=> :representatives nil,
;;=> :sizes (1146 1135 1144 134 128 1104 105 104 0),
;;=> :type :dbscan}
denclue
(denclue data sigma m)
(denclue data sigma m tolerance)
(denclue data sigma m tolerance min-pts)
DENsity CLUstering algorithm.
Input:
- data - sequence of samples
- sigma - gaussian kernel parameter
- m - number of selected samples, much smaller than number of all samples
- tolerance (optional) - tolerance of hill-climbing procedure
- min-pts (optional) - minimum number of neighbors for a core attractor
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(denclue (repeatedly 100 (fn* [] (randval (drand) (drand 5 6)))) 1 10))
;;=> [2 (56 44 0) nil]
(map
(fn [m] (dissoc m :data))
(regroup
(denclue (repeatedly 1000 (fn* [] (randval 0.1 (drand) (drand 5 6))))
1
10)))
;;=> ({:key 0, :representative 5.505301647597941, :size 904}
;;=> {:key 1, :representative 0.5564660309322467, :size 96})
deterministic-annealing
(deterministic-annealing data max-clusters)
(deterministic-annealing data max-clusters alpha)
(deterministic-annealing data max-clusters alpha max-iter)
(deterministic-annealing data max-clusters alpha max-iter tolerance)
(deterministic-annealing data max-clusters alpha max-iter tolerance split-tolerance)
Deterministic Annealing algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
- alpha (optional) - temperature decreasing factor (valued from 0 to 1)
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
- split-tolerance (optional) - tolerance to split a cluster
See more in SMILE doc
Examples
Usage
(map (fn [m] (dissoc m :data))
(-> (repeatedly 1000
(fn* []
(vector (randval (grand) (grand 5 1.0))
(randval (grand) (grand 5 1.0)))))
(deterministic-annealing 4 0.5)
(regroup)))
;;=> ({:key 0,
;;=> :representative (-0.058611378919933725 -0.013552096996870055),
;;=> :size 281}
;;=> {:key 2,
;;=> :representative (4.940048941144213 0.02248054975189403),
;;=> :size 232}
;;=> {:key 3,
;;=> :representative (4.92413297904191 4.976792268230151),
;;=> :size 252}
;;=> {:key 1,
;;=> :representative (0.02674989928953795 5.039325640322584),
;;=> :size 235})
g-means
(g-means data clusters)
(g-means data clusters max-iter)
(g-means data clusters max-iter tolerance)
G-Means
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(g-means (repeatedly 100 (fn* [] (randval (drand) (drand 5 6)))) 4))
;;=> [2 (52 48 0) ((5.499012776162546) (0.44981210779716996))]
k-means
(k-means data clusters)
(k-means data clusters max-iter)
(k-means data clusters max-iter tolerance)
K-Means++ algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Usage
(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath_clustering.core.ClusteringResult
;;=> {:clustering (3 3 3 0 0 3 0 2 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 2.0000000000000004},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x1fd5eb93 "Cluster distortion: 2.00000\nCluster size of 9 data points:\nCluster 1 3 (33.3%)\nCluster 2 1 (11.1%)\nCluster 3 1 (11.1%)\nCluster 4 4 (44.4%)\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=> :sizes (3 1 1 4 0),
;;=> :type :k-means}
Clusters group into separate maps.
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2), :key 3, :representative (2.0), :size 4}
;;=> {:data (-1 -1 -1), :key 0, :representative (-1.0), :size 3}
;;=> {:data (11), :key 2, :representative (11.0), :size 1}
;;=> {:data (111), :key 1, :representative (111.0), :size 1})
Use as predictor
(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
[(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]
lloyd
(lloyd data clusters)
(lloyd data clusters max-iter)
(lloyd data clusters max-iter tolerance)
K-Means algorithm, lloyd.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
mec
(mec data max-clusters radius)
(mec data dist max-clusters radius)
Nonparametric Minimum Conditional Entropy Clustering algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- max-clusters - maximum number of clusters
- radius - the neighborhood radius
See more in SMILE doc
Examples
2d vectors
(dissoc (mec (repeatedly 5000
(fn* []
(vector
(randval 0.1 (irand -10 10) (irand 100 150))
(randval (irand -10 10) (irand 100 150)))))
d/manhattan
8
20)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:entropy 0.0},
;;=> :representatives nil,
;;=> :sizes (236 2223 231 2310 0),
;;=> :type :mec}
regroup
(regroup {:keys [clustering data representatives sizes]})
Transform clustering result into list of clusters as separate maps.
Every map contain:
:key
- cluster id or:outliers
:data
- samples which belong to the cluster:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Representative is always a n-dimensional sequence even if input is a list of numbers.
Empty clusters are skipped.
Examples
Result of clustering with regrouping
(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath_clustering.core.ClusteringResult
;;=> {:clustering (5 3 4 0 0 3 0 2 1),
;;=> :clusters 7,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 0.0},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x1f15bbb "Cluster distortion: 0.00000\nCluster size of 9 data points:\nCluster 1 3 (33.3%)\nCluster 2 1 (11.1%)\nCluster 3 1 (11.1%)\nCluster 4 2 (22.2%)\nCluster 5 1 (11.1%)\nCluster 6 1 (11.1%)\nCluster 7 0 ( 0.0%)\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (11.0) (2.0) (3.0) (1.0) (##NaN)),
;;=> :sizes (3 1 1 2 1 1 0 0),
;;=> :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 5, :representative (1.0), :size 1}
;;=> {:data (2 2), :key 2, :representative (2.0), :size 2}
;;=> {:data (3), :key 4, :representative (3.0), :size 1}
;;=> {:data (-1 -1 -1), :key 3, :representative (-1.0), :size 3}
;;=> {:data (11), :key 0, :representative (11.0), :size 1}
;;=> {:data (111), :key 1, :representative (111.0), :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6
spectral
(spectral data clusters sigma)
(spectral data clusters sigma max-iters tolerance)
(spectral data clusters samples sigma)
(spectral data clusters samples sigma max-iters tolerance)
Spectral clustering
Input:
- data - sequence of samples
- clusters - number of clusters
- samples (optional) - number of random samples for Nystrom approximation
- sigma - width parameter for Gaussian kernel
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of k-means convergence test
See more in SMILE doc
Examples
2d vectors
(dissoc (spectral
(repeatedly 500
(fn* []
(vector
(randval 0.1 (irand -10 10) (irand 100 150))
(randval (irand -10 10) (irand 100 150)))))
4
1)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:distortion 19.149259019097173},
;;=> :representatives nil,
;;=> :sizes (263 185 32 20 0),
;;=> :type :spectral}
2d vectors
(dissoc (lloyd (repeatedly
500
(fn* []
(vector (randval 0.1 (irand -10 10) (irand 100 150))
(randval (irand -10 10) (irand 100 150)))))
4
1)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:distortion 340165.77580992214},
;;=> :representatives ((79.75 -1.1704545454545454)
;;=> (0.10344827586206896 123.89655172413794)
;;=> (124.16589861751152 123.33640552995392)
;;=> (130.3734939759036 -0.536144578313253)),
;;=> :sizes (88 29 217 166 0),
;;=> :type :lloyd}
x-means
(x-means data clusters)
(x-means data clusters max-iter)
(x-means data clusters max-iter tolerance)
X-Means
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- tolerance (optional) - tolerance of convergence test
See more in SMILE doc
Examples
Expect 2 clusters, gaussian distribution.
((juxt :clusters :sizes :representatives)
(x-means (repeatedly 10000 (fn* [] (randval (grand) (grand 5 1.0))))
4))
;;=> [2 (5011 4989 0) ((5.008429199918388) (-0.004122666084406418))]