fastmath-clustering.core

Clustering.

Various clustering algrorithms backed by SMILE library.

Only partition clustering is implemented.

Input data

It’s always sequence of n-sized samples as sequences.

For example, 2d samples [[1 2] [2 2] [3 3] ...]

For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers

[1 2 3]
;; or
[[1] [2] [3]]

Distances

Some of the methods use distance functions, use fastmath-clustering.distance namespace to create or use one.

Output

Every function returns record which contains:

  • :type - name of the method used
  • :data - input data
  • :clustering - sequence of cluster ids
  • :sizes - sizes of clusters
  • :clusters - number of clusters
  • :predict - predicting function (see below), qualify additional sample
  • :representatives - list of centroids or averages
  • :info - additional statistics for your samples (like distortion)
  • :obj - SMILE object

Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.

Record acts as function and can qualify additonal sample by calling :predict function (or just call predict), for example (data is sequence of 3d samples):

(let [cl (k-means data 10)] (cl [0 1 2]))

See k-means

Regrouping

Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:

  • :key - cluster id or :outliers
  • :data - samples which belong to the cluster
  • :representative - centroid or average vector if the former is not available
  • :size - size of cluster

Constants

clarans

(clarans data clusters)(clarans data dist clusters)(clarans data dist clusters max-neighbor)

Clustering Large Applications based upon RANdomized Search algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default euclidean
  • clusters - number of clusters
  • max-neighbor (optional) - maximum number of neighbors checked during random search

See more in SMILE doc

Examples

Usage

(dissoc (clarans (repeatedly
                  1000
                  (fn* [] (randval 0.1 (irand -10 10) (irand 100 150))))
                 d/chebyshev
                 2)
        :data :clustering
        :obj :predict)
;;=> {:clusters 2,
;;=>  :info {:distortion 11663.0},
;;=>  :representatives ((126.0) (0.0)),
;;=>  :sizes (892 108 0),
;;=>  :type :clarans}

clustering-methods-list

List of clustering methods.

Examples

List of methods

clustering-methods-list
;;=> (:spectral :dbscan
;;=>            :k-means :mec
;;=>            :clarans :g-means
;;=>            :lloyd :x-means
;;=>            :deterministic-annealing :denclue)

dbscan

(dbscan data min-pts radius)(dbscan data dist min-pts radius)

Density-Based Spatial Clustering of Applications with Noise algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default euclidean
  • min-pts - minimum number of neighbors
  • radius - the neighborhood radius

See more in SMILE doc

Examples

3d vectors

(dissoc (dbscan (repeatedly
                 5000
                 (fn* []
                   (vector (randval 0.1 (irand -10 10) (irand 100 150))
                           (randval (irand -10 10) (irand 100 150))
                           (randval (irand -10 10) (irand 100 150)))))
                10
                20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 8,
;;=>  :info {},
;;=>  :representatives nil,
;;=>  :sizes (1146 1135 1144 134 128 1104 105 104 0),
;;=>  :type :dbscan}

denclue

(denclue data sigma m)(denclue data sigma m tolerance)(denclue data sigma m tolerance min-pts)

DENsity CLUstering algorithm.

Input:

  • data - sequence of samples
  • sigma - gaussian kernel parameter
  • m - number of selected samples, much smaller than number of all samples
  • tolerance (optional) - tolerance of hill-climbing procedure
  • min-pts (optional) - minimum number of neighbors for a core attractor

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (denclue (repeatedly 100 (fn* [] (randval (drand) (drand 5 6)))) 1 10))
;;=> [2 (56 44 0) nil]
(map
 (fn [m] (dissoc m :data))
 (regroup
  (denclue (repeatedly 1000 (fn* [] (randval 0.1 (drand) (drand 5 6))))
           1
           10)))
;;=> ({:key 0, :representative 5.505301647597941, :size 904}
;;=>  {:key 1, :representative 0.5564660309322467, :size 96})

deterministic-annealing

(deterministic-annealing data max-clusters)(deterministic-annealing data max-clusters alpha)(deterministic-annealing data max-clusters alpha max-iter)(deterministic-annealing data max-clusters alpha max-iter tolerance)(deterministic-annealing data max-clusters alpha max-iter tolerance split-tolerance)

Deterministic Annealing algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters
  • alpha (optional) - temperature decreasing factor (valued from 0 to 1)
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test
  • split-tolerance (optional) - tolerance to split a cluster

See more in SMILE doc

Examples

Usage

(map (fn [m] (dissoc m :data))
     (-> (repeatedly 1000
                     (fn* []
                       (vector (randval (grand) (grand 5 1.0))
                               (randval (grand) (grand 5 1.0)))))
         (deterministic-annealing 4 0.5)
         (regroup)))
;;=> ({:key 0,
;;=>   :representative (-0.058611378919933725 -0.013552096996870055),
;;=>   :size 281}
;;=>  {:key 2,
;;=>   :representative (4.940048941144213 0.02248054975189403),
;;=>   :size 232}
;;=>  {:key 3,
;;=>   :representative (4.92413297904191 4.976792268230151),
;;=>   :size 252}
;;=>  {:key 1,
;;=>   :representative (0.02674989928953795 5.039325640322584),
;;=>   :size 235})

g-means

(g-means data clusters)(g-means data clusters max-iter)(g-means data clusters max-iter tolerance)

G-Means

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (g-means (repeatedly 100 (fn* [] (randval (drand) (drand 5 6)))) 4))
;;=> [2 (52 48 0) ((5.499012776162546) (0.44981210779716996))]

k-means

(k-means data clusters)(k-means data clusters max-iter)(k-means data clusters max-iter tolerance)

K-Means++ algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Usage

(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath_clustering.core.ClusteringResult
;;=>  {:clustering (3 3 3 0 0 3 0 2 1),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 2.0000000000000004},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x1fd5eb93 "Cluster distortion: 2.00000\nCluster size of 9 data points:\nCluster    1      3 (33.3%)\nCluster    2      1 (11.1%)\nCluster    3      1 (11.1%)\nCluster    4      4 (44.4%)\n"],
;;=>   :predict #,
;;=>   :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=>   :sizes (3 1 1 4 0),
;;=>   :type :k-means}

Clusters group into separate maps.

(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2), :key 3, :representative (2.0), :size 4}
;;=>  {:data (-1 -1 -1), :key 0, :representative (-1.0), :size 3}
;;=>  {:data (11), :key 2, :representative (11.0), :size 1}
;;=>  {:data (111), :key 1, :representative (111.0), :size 1})

Use as predictor

(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
  [(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]

lloyd

(lloyd data clusters)(lloyd data clusters max-iter)(lloyd data clusters max-iter tolerance)

K-Means algorithm, lloyd.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

mec

(mec data max-clusters radius)(mec data dist max-clusters radius)

Nonparametric Minimum Conditional Entropy Clustering algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • max-clusters - maximum number of clusters
  • radius - the neighborhood radius

See more in SMILE doc

Examples

2d vectors

(dissoc (mec (repeatedly 5000
                         (fn* []
                           (vector
                            (randval 0.1 (irand -10 10) (irand 100 150))
                            (randval (irand -10 10) (irand 100 150)))))
             d/manhattan
             8
             20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:entropy 0.0},
;;=>  :representatives nil,
;;=>  :sizes (236 2223 231 2310 0),
;;=>  :type :mec}

outlier-id

const

;;=> 2147483647

Id of the cluster which contain outliers.

predict

(predict cluster in)

Predict cluster for given vector

regroup

(regroup {:keys [clustering data representatives sizes]})

Transform clustering result into list of clusters as separate maps.

Every map contain:

  • :key - cluster id or :outliers
  • :data - samples which belong to the cluster
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Representative is always a n-dimensional sequence even if input is a list of numbers.

Empty clusters are skipped.

Examples

Result of clustering with regrouping

(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath_clustering.core.ClusteringResult
;;=>  {:clustering (5 3 4 0 0 3 0 2 1),
;;=>   :clusters 7,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 0.0},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x1f15bbb "Cluster distortion: 0.00000\nCluster size of 9 data points:\nCluster    1      3 (33.3%)\nCluster    2      1 (11.1%)\nCluster    3      1 (11.1%)\nCluster    4      2 (22.2%)\nCluster    5      1 (11.1%)\nCluster    6      1 (11.1%)\nCluster    7      0 ( 0.0%)\n"],
;;=>   :predict #,
;;=>   :representatives ((-1.0) (111.0) (11.0) (2.0) (3.0) (1.0) (##NaN)),
;;=>   :sizes (3 1 1 2 1 1 0 0),
;;=>   :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 5, :representative (1.0), :size 1}
;;=>  {:data (2 2), :key 2, :representative (2.0), :size 2}
;;=>  {:data (3), :key 4, :representative (3.0), :size 1}
;;=>  {:data (-1 -1 -1), :key 3, :representative (-1.0), :size 3}
;;=>  {:data (11), :key 0, :representative (11.0), :size 1}
;;=>  {:data (111), :key 1, :representative (111.0), :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6

spectral

(spectral data clusters sigma)(spectral data clusters sigma max-iters tolerance)(spectral data clusters samples sigma)(spectral data clusters samples sigma max-iters tolerance)

Spectral clustering

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • samples (optional) - number of random samples for Nystrom approximation
  • sigma - width parameter for Gaussian kernel
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of k-means convergence test

See more in SMILE doc

Examples

2d vectors

(dissoc (spectral
         (repeatedly 500
                     (fn* []
                       (vector
                        (randval 0.1 (irand -10 10) (irand 100 150))
                        (randval (irand -10 10) (irand 100 150)))))
         4
         1)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:distortion 19.149259019097173},
;;=>  :representatives nil,
;;=>  :sizes (263 185 32 20 0),
;;=>  :type :spectral}

2d vectors

(dissoc (lloyd (repeatedly
                500
                (fn* []
                  (vector (randval 0.1 (irand -10 10) (irand 100 150))
                          (randval (irand -10 10) (irand 100 150)))))
               4
               1)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:distortion 340165.77580992214},
;;=>  :representatives ((79.75 -1.1704545454545454)
;;=>                    (0.10344827586206896 123.89655172413794)
;;=>                    (124.16589861751152 123.33640552995392)
;;=>                    (130.3734939759036 -0.536144578313253)),
;;=>  :sizes (88 29 217 166 0),
;;=>  :type :lloyd}

x-means

(x-means data clusters)(x-means data clusters max-iter)(x-means data clusters max-iter tolerance)

X-Means

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Expect 2 clusters, gaussian distribution.

((juxt :clusters :sizes :representatives)
 (x-means (repeatedly 10000 (fn* [] (randval (grand) (grand 5 1.0))))
          4))
;;=> [2 (5011 4989 0) ((5.008429199918388) (-0.004122666084406418))]