fastmath.clustering
Clustering algorithms.
Various clustering algrorithms backed by SMILE library.
Currently implemented: only partition clustering.
Input data
It’s always sequence of n-sized samples as sequences.
For example, 2d samples [[1 2] [2 2] [3 3] ...]
For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers
[1 2 3]
;; or
[[1] [2] [3]]
Distances
Some of the methods use distance functions, use fastmath.distance namespace to create one.
Output
Every function returns record which contains:
:type
- name of the method used:data
- input data:clustering
- sequence of cluster ids:sizes
- sizes of clusters:clusters
- number of clusters:predict
- predicting function (see below), qualify additional sample:representatives
- list of centroids or medoids if available:info
- additional statistics for your samples (like distortion):obj
- SMILE object
Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.
Record acts as function and can qualify additonal sample by calling :predict
function (or just call predict), for example (data
is sequence of 3d samples):
(let [cl (k-means data 10)] (cl [0 1 2]))
See k-means
Regrouping
Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:
:key
- cluster id:data
- samples which belong to the cluster:outliers?
- does it contain outliers or not:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Categories
Other vars: ->ClusteringResult clarans clustering-methods-list dbscan denclue deterministic-annealing g-means k-means map->ClusteringResult mec neural-gas predict regroup x-means
Constants
- outlier-id =
2147483647
clarans
(clarans data clusters)
(clarans data dist clusters)
(clarans data dist clusters max-neighbor)
(clarans data dist clusters max-neighbor num-local)
Clustering Large Applications based upon RANdomized Search algorithm.
Input:
- data - sequence of samples
- clusters - numbe of clusters
Optional:
- dist - distance method, default
:euclidean
- max-neighbor - maximum number of neighbors checked during random search
- num-local - the number of local minima to search for
See more in SMILE doc
Examples
Usage
(dissoc (clarans (repeatedly
1000
(fn* []
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
d/chebyshev
2)
:data :clustering
:obj :predict)
;;=> {:clusters 2,
;;=> :info {:distortion 11772.0, :maxneighbor 100, :numlocalminima 4},
;;=> :representatives ((124.0) (0.0)),
;;=> :sizes (906 94),
;;=> :type :clarans}
clustering-methods-list
List of clustering methods.
Examples
List of methods
clustering-methods-list
;;=> (:dbscan :k-means :mec
;;=> :clarans :g-means
;;=> :neural-gas :x-means
;;=> :deterministic-annealing :denclue)
dbscan
(dbscan data min-pts radius)
(dbscan data dist min-pts radius)
Density-Based Spatial Clustering of Applications with Noise algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- min-pts - minimum number of neighbors
- radius - the neighborhood radius
See more in SMILE doc
Examples
3d vectors
(dissoc (dbscan
(repeatedly
5000
(fn* []
(vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
10
20)
:data :clustering
:obj :predict)
;;=> {:clusters 8,
;;=> :info {:minpts 10.0, :radius 20.0},
;;=> :representatives nil,
;;=> :sizes (1156 1112 136 1135 1086 134 122 119 0),
;;=> :type :dbscan}
denclue
(denclue data sigma m)
DENsity CLUstering algorithm.
Input:
- data - sequence of samples
- sigma - gaussian kernel parameter
- m - number of selected samples, much slower than number of all samples
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
1
10))
;;=> [2 (48 52) nil]
(map
(fn [m] (dissoc m :data))
(regroup
(denclue (repeatedly 1000
(fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
1
10)))
;;=> ({:key 0,
;;=> :outliers? false,
;;=> :representative 5.489619301329023,
;;=> :size 909}
;;=> {:key 1,
;;=> :outliers? false,
;;=> :representative 0.49735065256841965,
;;=> :size 91})
deterministic-annealing
(deterministic-annealing data max-clusters)
(deterministic-annealing data max-clusters alpha)
Deterministic Annealing algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
- alpha (optional) - temperature decreasing factor (valued from 0 to 1)
See more in SMILE doc
Examples
Usage
(map (fn [m] (dissoc m :data))
(-> (repeatedly 1000
(fn* []
(vector (r/randval (r/grand) (r/grand 5 1.0))
(r/randval (r/grand) (r/grand 5 1.0)))))
(deterministic-annealing 4 0.5)
(regroup)))
;;=> ({:key 1,
;;=> :outliers? false,
;;=> :representative (4.990142167914382 2.508954952403819),
;;=> :size 487}
;;=> {:key 0,
;;=> :outliers? false,
;;=> :representative (0.05914135281035301 2.3939707464726445),
;;=> :size 513})
g-means
(g-means data max-clusters)
G-Means algorithm.
Input:
- data - sequence of samples
- max-clusters - maximum number of clusters
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
4))
;;=> [2 (49 51) ((0.5370248588228403) (5.485254148442907))]
k-means
(k-means data clusters)
(k-means data clusters max-iter)
(k-means data clusters max-iter runs)
K-Means++ algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- runs (optional) - maximum number of runs
See more in SMILE doc
Examples
Usage
(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=> {:clustering (3 3 3 0 0 3 0 2 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 2.0000000000000004},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x1e5a2318 "K-Means distortion: 2.00000\nClusters of 9 data points of dimension 1:\n 0\t 3 (33.3%)\n 1\t 1 (11.1%)\n 2\t 1 (11.1%)\n 3\t 4 (44.4%)\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=> :sizes (3 1 1 4),
;;=> :type :k-means}
Clusters group into separate maps.
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2),
;;=> :key 3,
;;=> :outliers? false,
;;=> :representative (2.0),
;;=> :size 4}
;;=> {:data (-1 -1 -1),
;;=> :key 0,
;;=> :outliers? false,
;;=> :representative (-1.0),
;;=> :size 3}
;;=> {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=> {:data (111),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (111.0),
;;=> :size 1})
Use as predictor
(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
[(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]
mec
(mec data max-clusters radius)
(mec data dist max-clusters radius)
Nonparametric Minimum Conditional Entropy Clustering algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- max-clusters - maximum number of clusters
- radius - the neighborhood radius
See more in SMILE doc
Examples
2d vectors
(dissoc (mec (repeatedly
5000
(fn* []
(vector
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
d/manhattan
8
20)
:data :clustering
:obj :predict)
;;=> {:clusters 7,
;;=> :info {:entropy 445.78598430030064},
;;=> :representatives nil,
;;=> :sizes (981 801 1445 247 616 232 678),
;;=> :type :mec}
neural-gas
(neural-gas data clusters)
(neural-gas data clusters lambda-i lambda-f eps-i eps-f steps)
Neural Gas algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
Optional:
- lambda-i - intial lambda value (soft learning radius/rate)
- lambda-f - final lambda value
- eps-i - initial epsilon value (learning rate)
- eps-f - final epsilon value
- steps - number of iterations
See more in SMILE doc
Examples
Usage
(neural-gas [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=> {:clustering (0 0 0 0 0 0 0 3 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 100.21135606663056},
;;=> :obj
;;=> #object[smile.vq.NeuralGas 0xd814247 "Neural Gas distortion: 100.21136\nClusters of 9 data points of dimension 1:\n 0\t 7 (77.8%)\n 1\t 1 (11.1%)\n 2\t 0 ( 0.0%)\n 3\t 1 (11.1%)\n"],
;;=> :predict #,
;;=> :representatives ((0.6767057694246992)
;;=> (102.02680736489556)
;;=> (38.047386026088205)
;;=> (9.498429886559977)),
;;=> :sizes (7 1 0 1),
;;=> :type :neural-gas}
Clusters group into separate maps.
(regroup (neural-gas [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 -1 -1 2 -1),
;;=> :key 0,
;;=> :outliers? false,
;;=> :representative (0.6767057694246992),
;;=> :size 7}
;;=> {:data (11),
;;=> :key 3,
;;=> :outliers? false,
;;=> :representative (9.498429886559977),
;;=> :size 1}
;;=> {:data (111),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (102.02680736489556),
;;=> :size 1})
regroup
(regroup clustered-data)
Transform clustering result into list of clusters as separate maps.
Every map contain:
:key
- cluster id:data
- samples which belong to the cluster:outliers?
- does it contain outliers or not:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Representative is always a n-dimensional sequence even if input is a list of numbers.
Empty clusters are skipped.
Examples
Result of clustering with regrouping
(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering.ClusteringResult
;;=> {:clustering (5 3 4 0 0 3 0 2 1),
;;=> :clusters 7,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 0.0},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x210e8d34 "K-Means distortion: 0.00000\nClusters of 9 data points of dimension 1:\n 0\t 3 (33.3%)\n 1\t 1 (11.1%)\n 2\t 1 (11.1%)\n 3\t 2 (22.2%)\n 4\t 1 (11.1%)\n 5\t 1 (11.1%)\n 6\t 0 ( 0.0%)\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (11.0) (2.0) (3.0) (1.0) (##NaN)),
;;=> :sizes (3 1 1 2 1 1 0),
;;=> :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 5, :outliers? false, :representative (1.0), :size 1}
;;=> {:data (2 2), :key 3, :outliers? false, :representative (2.0), :size 2}
;;=> {:data (3), :key 4, :outliers? false, :representative (3.0), :size 1}
;;=> {:data (-1 -1 -1),
;;=> :key 2,
;;=> :outliers? false,
;;=> :representative (-1.0),
;;=> :size 3}
;;=> {:data (11), :key 0, :outliers? false, :representative (11.0), :size 1}
;;=> {:data (111),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (111.0),
;;=> :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6
x-means
(x-means data max-clusters)
X-Means algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
See more in SMILE doc
Examples
Expect 2 clusters, gaussian distribution.
((juxt :clusters :sizes :representatives)
(x-means (repeatedly 10000
(fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
4))
;;=> [2 (4906 5094) ((-6.044219999840173E-4) (4.972183058432598))]