fastmath.clustering

Clustering algorithms.

Various clustering algrorithms backed by SMILE library.

Currently implemented: only partition clustering.

Input data

It’s always sequence of n-sized samples as sequences.

For example, 2d samples [[1 2] [2 2] [3 3] ...]

For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers

[1 2 3]
;; or
[[1] [2] [3]]

Distances

Some of the methods use distance functions, use fastmath.distance namespace to create one.

Output

Every function returns record which contains:

  • :type - name of the method used
  • :data - input data
  • :clustering - sequence of cluster ids
  • :sizes - sizes of clusters
  • :clusters - number of clusters
  • :predict - predicting function (see below), qualify additional sample
  • :representatives - list of centroids or medoids if available
  • :info - additional statistics for your samples (like distortion)
  • :obj - SMILE object

Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.

Record acts as function and can qualify additonal sample by calling :predict function (or just call predict), for example (data is sequence of 3d samples):

(let [cl (k-means data 10)] (cl [0 1 2]))

See k-means

Regrouping

Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:

  • :key - cluster id
  • :data - samples which belong to the cluster
  • :outliers? - does it contain outliers or not
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Constants

clarans

(clarans data clusters)(clarans data dist clusters)(clarans data dist clusters max-neighbor)(clarans data dist clusters max-neighbor num-local)

Clustering Large Applications based upon RANdomized Search algorithm.

Input:

  • data - sequence of samples
  • clusters - numbe of clusters

Optional:

  • dist - distance method, default :euclidean
  • max-neighbor - maximum number of neighbors checked during random search
  • num-local - the number of local minima to search for

See more in SMILE doc

Examples

Usage

(dissoc (clarans (repeatedly
                  1000
                  (fn* []
                    (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
                 d/chebyshev
                 2)
        :data :clustering
        :obj :predict)
;;=> {:clusters 2,
;;=>  :info {:distortion 11772.0, :maxneighbor 100, :numlocalminima 4},
;;=>  :representatives ((124.0) (0.0)),
;;=>  :sizes (906 94),
;;=>  :type :clarans}

clustering-methods-list

List of clustering methods.

Examples

List of methods

clustering-methods-list
;;=> (:dbscan :k-means :mec
;;=>          :clarans :g-means
;;=>          :neural-gas :x-means
;;=>          :deterministic-annealing :denclue)

dbscan

(dbscan data min-pts radius)(dbscan data dist min-pts radius)

Density-Based Spatial Clustering of Applications with Noise algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • min-pts - minimum number of neighbors
  • radius - the neighborhood radius

See more in SMILE doc

Examples

3d vectors

(dissoc (dbscan
         (repeatedly
          5000
          (fn* []
            (vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150)))))
         10
         20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 8,
;;=>  :info {:minpts 10.0, :radius 20.0},
;;=>  :representatives nil,
;;=>  :sizes (1156 1112 136 1135 1086 134 122 119 0),
;;=>  :type :dbscan}

denclue

(denclue data sigma m)

DENsity CLUstering algorithm.

Input:

  • data - sequence of samples
  • sigma - gaussian kernel parameter
  • m - number of selected samples, much slower than number of all samples

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          1
          10))
;;=> [2 (48 52) nil]
(map
 (fn [m] (dissoc m :data))
 (regroup
  (denclue (repeatedly 1000
                       (fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
           1
           10)))
;;=> ({:key 0,
;;=>   :outliers? false,
;;=>   :representative 5.489619301329023,
;;=>   :size 909}
;;=>  {:key 1,
;;=>   :outliers? false,
;;=>   :representative 0.49735065256841965,
;;=>   :size 91})

deterministic-annealing

(deterministic-annealing data max-clusters)(deterministic-annealing data max-clusters alpha)

Deterministic Annealing algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters
  • alpha (optional) - temperature decreasing factor (valued from 0 to 1)

See more in SMILE doc

Examples

Usage

(map (fn [m] (dissoc m :data))
     (-> (repeatedly 1000
                     (fn* []
                       (vector (r/randval (r/grand) (r/grand 5 1.0))
                               (r/randval (r/grand) (r/grand 5 1.0)))))
         (deterministic-annealing 4 0.5)
         (regroup)))
;;=> ({:key 1,
;;=>   :outliers? false,
;;=>   :representative (4.990142167914382 2.508954952403819),
;;=>   :size 487}
;;=>  {:key 0,
;;=>   :outliers? false,
;;=>   :representative (0.05914135281035301 2.3939707464726445),
;;=>   :size 513})

g-means

(g-means data max-clusters)

G-Means algorithm.

Input:

  • data - sequence of samples
  • max-clusters - maximum number of clusters

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          4))
;;=> [2 (49 51) ((0.5370248588228403) (5.485254148442907))]

k-means

(k-means data clusters)(k-means data clusters max-iter)(k-means data clusters max-iter runs)

K-Means++ algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • runs (optional) - maximum number of runs

See more in SMILE doc

Examples

Usage

(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=>  {:clustering (3 3 3 0 0 3 0 2 1),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 2.0000000000000004},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x1e5a2318 "K-Means distortion: 2.00000\nClusters of 9 data points of dimension 1:\n  0\t    3 (33.3%)\n  1\t    1 (11.1%)\n  2\t    1 (11.1%)\n  3\t    4 (44.4%)\n"],
;;=>   :predict #,
;;=>   :representatives ((-1.0) (111.0) (11.0) (2.0)),
;;=>   :sizes (3 1 1 4),
;;=>   :type :k-means}

Clusters group into separate maps.

(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2),
;;=>   :key 3,
;;=>   :outliers? false,
;;=>   :representative (2.0),
;;=>   :size 4}
;;=>  {:data (-1 -1 -1),
;;=>   :key 0,
;;=>   :outliers? false,
;;=>   :representative (-1.0),
;;=>   :size 3}
;;=>  {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (111.0),
;;=>   :size 1})

Use as predictor

(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
  [(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]

mec

(mec data max-clusters radius)(mec data dist max-clusters radius)

Nonparametric Minimum Conditional Entropy Clustering algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • max-clusters - maximum number of clusters
  • radius - the neighborhood radius

See more in SMILE doc

Examples

2d vectors

(dissoc (mec (repeatedly
              5000
              (fn* []
                (vector
                 (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                 (r/randval (r/irand -10 10) (r/irand 100 150)))))
             d/manhattan
             8
             20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 7,
;;=>  :info {:entropy 445.78598430030064},
;;=>  :representatives nil,
;;=>  :sizes (981 801 1445 247 616 232 678),
;;=>  :type :mec}

neural-gas

(neural-gas data clusters)(neural-gas data clusters lambda-i lambda-f eps-i eps-f steps)

Neural Gas algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters

Optional:

  • lambda-i - intial lambda value (soft learning radius/rate)
  • lambda-f - final lambda value
  • eps-i - initial epsilon value (learning rate)
  • eps-f - final epsilon value
  • steps - number of iterations

See more in SMILE doc

Examples

Usage

(neural-gas [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering.ClusteringResult
;;=>  {:clustering (0 0 0 0 0 0 0 3 1),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 100.21135606663056},
;;=>   :obj
;;=>   #object[smile.vq.NeuralGas 0xd814247 "Neural Gas distortion: 100.21136\nClusters of 9 data points of dimension 1:\n  0\t    7 (77.8%)\n  1\t    1 (11.1%)\n  2\t    0 ( 0.0%)\n  3\t    1 (11.1%)\n"],
;;=>   :predict #,
;;=>   :representatives ((0.6767057694246992)
;;=>                     (102.02680736489556)
;;=>                     (38.047386026088205)
;;=>                     (9.498429886559977)),
;;=>   :sizes (7 1 0 1),
;;=>   :type :neural-gas}

Clusters group into separate maps.

(regroup (neural-gas [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 -1 -1 2 -1),
;;=>   :key 0,
;;=>   :outliers? false,
;;=>   :representative (0.6767057694246992),
;;=>   :size 7}
;;=>  {:data (11),
;;=>   :key 3,
;;=>   :outliers? false,
;;=>   :representative (9.498429886559977),
;;=>   :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (102.02680736489556),
;;=>   :size 1})

outlier-id

const

;;=> 2147483647

Id of the cluster which contain outliers.

predict

(predict cluster in)

Predict cluster for given vector

regroup

(regroup clustered-data)

Transform clustering result into list of clusters as separate maps.

Every map contain:

  • :key - cluster id
  • :data - samples which belong to the cluster
  • :outliers? - does it contain outliers or not
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Representative is always a n-dimensional sequence even if input is a list of numbers.

Empty clusters are skipped.

Examples

Result of clustering with regrouping

(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering.ClusteringResult
;;=>  {:clustering (5 3 4 0 0 3 0 2 1),
;;=>   :clusters 7,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 0.0},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x210e8d34 "K-Means distortion: 0.00000\nClusters of 9 data points of dimension 1:\n  0\t    3 (33.3%)\n  1\t    1 (11.1%)\n  2\t    1 (11.1%)\n  3\t    2 (22.2%)\n  4\t    1 (11.1%)\n  5\t    1 (11.1%)\n  6\t    0 ( 0.0%)\n"],
;;=>   :predict #,
;;=>   :representatives ((-1.0) (111.0) (11.0) (2.0) (3.0) (1.0) (##NaN)),
;;=>   :sizes (3 1 1 2 1 1 0),
;;=>   :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 5, :outliers? false, :representative (1.0), :size 1}
;;=>  {:data (2 2), :key 3, :outliers? false, :representative (2.0), :size 2}
;;=>  {:data (3), :key 4, :outliers? false, :representative (3.0), :size 1}
;;=>  {:data (-1 -1 -1),
;;=>   :key 2,
;;=>   :outliers? false,
;;=>   :representative (-1.0),
;;=>   :size 3}
;;=>  {:data (11), :key 0, :outliers? false, :representative (11.0), :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (111.0),
;;=>   :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6

x-means

(x-means data max-clusters)

X-Means algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters

See more in SMILE doc

Examples

Expect 2 clusters, gaussian distribution.

((juxt :clusters :sizes :representatives)
 (x-means (repeatedly 10000
                      (fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
          4))
;;=> [2 (4906 5094) ((-6.044219999840173E-4) (4.972183058432598))]