fastmath.clustering

Clustering.

Various clustering algrorithms backed by SMILE library.

Only partition clustering is implemented.

Input data

It’s always sequence of n-sized samples as sequences.

For example, 2d samples [[1 2] [2 2] [3 3] ...]

For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers

[1 2 3]
;; or
[[1] [2] [3]]

Distances

Some of the methods use distance functions, use fastmath.distance namespace to create one.

Output

Every function returns record which contains:

  • :type - name of the method used
  • :data - input data
  • :clustering - sequence of cluster ids
  • :sizes - sizes of clusters
  • :clusters - number of clusters
  • :predict - predicting function (see below), qualify additional sample
  • :representatives - list of centroids or averages
  • :info - additional statistics for your samples (like distortion)
  • :obj - SMILE object

Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.

Record acts as function and can qualify additonal sample by calling :predict function (or just call predict), for example (data is sequence of 3d samples):

(let [cl (k-means data 10)] (cl [0 1 2]))

See k-means

Regrouping

Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:

  • :key - cluster id or :outliers
  • :data - samples which belong to the cluster
  • :representative - centroid or average vector if the former is not available
  • :size - size of cluster

Constants

clarans

(clarans data clusters)(clarans data dist clusters)(clarans data dist clusters max-neighbor)

Clustering Large Applications based upon RANdomized Search algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default euclidean
  • clusters - number of clusters
  • max-neighbor (optional) - maximum number of neighbors checked during random search

See more in SMILE doc

Examples

Usage

(dissoc (clarans (repeatedly
                  1000
                  (fn* []
                    (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
                 d/chebyshev
                 2)
        :data :clustering
        :obj :predict)
;;=> {:clusters 2,
;;=>  :info {:distortion 11761.0},
;;=>  :representatives ((124.0) (0.0)),
;;=>  :sizes (895 105 0),
;;=>  :type :clarans}

clustering-methods-list

List of clustering methods.

Examples

List of methods

clustering-methods-list
;;=> (:spectral :dbscan
;;=>            :k-means :mec
;;=>            :clarans :g-means
;;=>            :lloyd :x-means
;;=>            :deterministic-annealing :denclue)

dbscan

(dbscan data min-pts radius)(dbscan data dist min-pts radius)

Density-Based Spatial Clustering of Applications with Noise algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default euclidean
  • min-pts - minimum number of neighbors
  • radius - the neighborhood radius

See more in SMILE doc

Examples

3d vectors

(dissoc (dbscan
         (repeatedly
          5000
          (fn* []
            (vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150)))))
         10
         20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 8,
;;=>  :info {},
;;=>  :representatives nil,
;;=>  :sizes (1093 1140 1173 110 1109 134 121 120 0),
;;=>  :type :dbscan}

denclue

(denclue data sigma m)(denclue data sigma m tolerance)(denclue data sigma m tolerance min-pts)

DENsity CLUstering algorithm.

Input:

  • data - sequence of samples
  • sigma - gaussian kernel parameter
  • m - number of selected samples, much smaller than number of all samples
  • tolerance (optional) - tolerance of hill-climbing procedure
  • min-pts (optional) - minimum number of neighbors for a core attractor

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          1
          10))
;;=> [2 (51 49 0) nil]
(map
 (fn [m] (dissoc m :data))
 (regroup
  (denclue (repeatedly 1000
                       (fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
           1
           10)))
;;=> ({:key 0, :representative 5.485104749166474, :size 909}
;;=>  {:key 1, :representative 0.49484681954689236, :size 91})

deterministic-annealing

(deterministic-annealing data max-clusters)(deterministic-annealing data max-clusters alpha)(deterministic-annealing data max-clusters alpha max-iter)(deterministic-annealing data max-clusters alpha max-iter tolerance)(deterministic-annealing data max-clusters alpha max-iter tolerance split-tolerance)

Deterministic Annealing algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters
  • alpha (optional) - temperature decreasing factor (valued from 0 to 1)
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test
  • split-tolerance (optional) - tolerance to split a cluster

See more in SMILE doc

Examples

Usage

(map (fn [m] (dissoc m :data))
     (-> (repeatedly 1000
                     (fn* []
                       (vector (r/randval (r/grand) (r/grand 5 1.0))
                               (r/randval (r/grand) (r/grand 5 1.0)))))
         (deterministic-annealing 4 0.5)
         (regroup)))
;;=> ({:key 0,
;;=>   :representative (0.02571403464849379 -0.041487546808452326),
;;=>   :size 225}
;;=>  {:key 3,
;;=>   :representative (4.958779088412497 4.935661039001764),
;;=>   :size 269}
;;=>  {:key 2,
;;=>   :representative (0.005632263753536093 5.024473916660704),
;;=>   :size 265}
;;=>  {:key 1,
;;=>   :representative (5.132011425016014 -0.04658062914962453),
;;=>   :size 241})

g-means

(g-means data clusters)(g-means data clusters max-iter)(g-means data clusters max-iter tolerance)

G-Means

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          4))
;;=> [2 (48 52 0) ((5.529580900980793) (0.46507987463900274))]

k-means

(k-means data clusters)(k-means data clusters max-iter)(k-means data clusters max-iter tolerance)

K-Means++ algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Usage

(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=>  {:clustering (3 3 3 0 0 3 0 2 1),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 2.0000000000000004},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x53cea40f "Cluster distortion: 2.00000\nCluster size of 9 data points:\nCluster    1      3 (33.3%)\nCluster    2      1 (11.1%)\nCluster    3      1 (11.1%)\nCluster    4      4 (44.4%)\n"],
;;=>   :predict #,
;;=>   :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=>   :sizes (3 1 1 4 0),
;;=>   :type :k-means}

Clusters group into separate maps.

(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2), :key 3, :representative (2.0), :size 4}
;;=>  {:data (-1 -1 -1), :key 0, :representative (-1.0), :size 3}
;;=>  {:data (11), :key 2, :representative (11.0), :size 1}
;;=>  {:data (111), :key 1, :representative (111.0), :size 1})

Use as predictor

(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
  [(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]

lloyd

(lloyd data clusters)(lloyd data clusters max-iter)(lloyd data clusters max-iter tolerance)

K-Means algorithm, lloyd.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

mec

(mec data max-clusters radius)(mec data dist max-clusters radius)

Nonparametric Minimum Conditional Entropy Clustering algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • max-clusters - maximum number of clusters
  • radius - the neighborhood radius

See more in SMILE doc

Examples

2d vectors

(dissoc (mec (repeatedly
              5000
              (fn* []
                (vector
                 (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                 (r/randval (r/irand -10 10) (r/irand 100 150)))))
             d/manhattan
             8
             20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 6,
;;=>  :info {:entropy 368.91228362531825},
;;=>  :representatives nil,
;;=>  :sizes (243 1155 257 1174 1056 1115 0),
;;=>  :type :mec}

outlier-id

const

;;=> 2147483647

Id of the cluster which contain outliers.

predict

(predict cluster in)

Predict cluster for given vector

regroup

(regroup {:keys [clustering data representatives sizes]})

Transform clustering result into list of clusters as separate maps.

Every map contain:

  • :key - cluster id or :outliers
  • :data - samples which belong to the cluster
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Representative is always a n-dimensional sequence even if input is a list of numbers.

Empty clusters are skipped.

Examples

Result of clustering with regrouping

(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering.ClusteringResult
;;=>  {:clustering (2 5 4 3 3 5 3 0 1),
;;=>   :clusters 7,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 0.0},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0xada293 "Cluster distortion: 0.00000\nCluster size of 9 data points:\nCluster    1      1 (11.1%)\nCluster    2      1 (11.1%)\nCluster    3      1 (11.1%)\nCluster    4      3 (33.3%)\nCluster    5      1 (11.1%)\nCluster    6      2 (22.2%)\nCluster    7      0 ( 0.0%)\n"],
;;=>   :predict #,
;;=>   :representatives ((11.0) (111.0) (1.0) (-1.0) (3.0) (2.0) (##NaN)),
;;=>   :sizes (1 1 1 3 1 2 0 0),
;;=>   :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 0, :representative (1.0), :size 1}
;;=>  {:data (2 2), :key 5, :representative (2.0), :size 2}
;;=>  {:data (3), :key 4, :representative (3.0), :size 1}
;;=>  {:data (-1 -1 -1), :key 3, :representative (-1.0), :size 3}
;;=>  {:data (11), :key 2, :representative (11.0), :size 1}
;;=>  {:data (111), :key 1, :representative (111.0), :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6

spectral

(spectral data clusters sigma)(spectral data clusters sigma max-iters tolerance)(spectral data clusters samples sigma)(spectral data clusters samples sigma max-iters tolerance)

Spectral clustering

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • samples (optional) - number of random samples for Nystrom approximation
  • sigma - width parameter for Gaussian kernel
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of k-means convergence test

See more in SMILE doc

Examples

2d vectors

(dissoc (spectral
         (repeatedly
          500
          (fn* []
            (vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150)))))
         4
         1)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:distortion 38.649549206140456},
;;=>  :representatives nil,
;;=>  :sizes (186 58 36 220 0),
;;=>  :type :spectral}

2d vectors

(dissoc (lloyd
         (repeatedly
          500
          (fn* []
            (vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150)))))
         4
         1)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:distortion 316885.16459203116},
;;=>  :representatives ((114.44117647058823 122.84558823529412)
;;=>                    (125.50230414746544 -0.29493087557603687)
;;=>                    (-0.6888888888888889 76.68888888888888)
;;=>                    (137.15686274509804 127.91176470588235)),
;;=>  :sizes (136 217 45 102 0),
;;=>  :type :lloyd}

x-means

(x-means data clusters)(x-means data clusters max-iter)(x-means data clusters max-iter tolerance)

X-Means

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • tolerance (optional) - tolerance of convergence test

See more in SMILE doc

Examples

Expect 2 clusters, gaussian distribution.

((juxt :clusters :sizes :representatives)
 (x-means (repeatedly 10000
                      (fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
          4))
;;=> [2 (4969 5031 0) ((4.998015855439847) (-0.0011218297060218083))]