Statistics

This notebook provides a comprehensive overview and examples of the statistical functions available in the fastmath.stats namespace. It covers a wide array of descriptive statistics, measures of spread, quantiles, moments, correlation, distance metrics, contingency table analysis, binary classification metrics, effect sizes, various statistical tests (normality, binomial, t/z, variance, goodness-of-fit, ANOVA, autocorrelation), time series analysis tools (ACF, PACF), data transformations, and histogram generation. Each section introduces the relevant concepts and demonstrates the use of fastmath.stats functions with illustrative datasets (mtcars, iris, winequality).

Datasets

To illustrate statistical functions three dataset will be used:

mtcars
iris
winequality

A fastmath.dev.dataset as ds will be used to access data.

Select the :mpg column from mtcars

(ds/mtcars :mpg)

(21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4)

Select the :mpg column with a row predicate

(ds/mtcars :mpg (fn [row] (and (< (row :mpg) 20.0) (zero? (row :am)))))

(18.7 18.1 14.3 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2)

Group by the :am column and select :mpg

(ds/by ds/mtcars :am :mpg)

{1 (21 21 22.8 32.4 30.4 33.9 27.3 26 30.4 15.8 19.7 15 21.4), 0 (21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5 15.5 15.2 13.3 19.2)}

`mtcars`

11 attributes of 32 different cars comparison

name	qsec	cyl	am	gear	disp	wt	drat	hp	mpg	vs	carb
Mazda RX4	16.46	6	1	4	160	2.62	3.9	110	21	0	4
Mazda RX4 Wag	17.02	6	1	4	160	2.875	3.9	110	21	0	4
Datsun 710	18.61	4	1	4	108	2.32	3.85	93	22.8	1	1
Hornet 4 Drive	19.44	6	0	3	258	3.215	3.08	110	21.4	1	1
Hornet Sportabout	17.02	8	0	3	360	3.44	3.15	175	18.7	0	2
Valiant	20.22	6	0	3	225	3.46	2.76	105	18.1	1	1
Duster 360	15.84	8	0	3	360	3.57	3.21	245	14.3	0	4
Merc 240D	20	4	0	4	146.7	3.19	3.69	62	24.4	1	2
Merc 230	22.9	4	0	4	140.8	3.15	3.92	95	22.8	1	2
Merc 280	18.3	6	0	4	167.6	3.44	3.92	123	19.2	1	4
Merc 280C	18.9	6	0	4	167.6	3.44	3.92	123	17.8	1	4
Merc 450SE	17.4	8	0	3	275.8	4.07	3.07	180	16.4	0	3
Merc 450SL	17.6	8	0	3	275.8	3.73	3.07	180	17.3	0	3
Merc 450SLC	18	8	0	3	275.8	3.78	3.07	180	15.2	0	3
Cadillac Fleetwood	17.98	8	0	3	472	5.25	2.93	205	10.4	0	4
Lincoln Continental	17.82	8	0	3	460	5.424	3	215	10.4	0	4
Chrysler Imperial	17.42	8	0	3	440	5.345	3.23	230	14.7	0	4
Fiat 128	19.47	4	1	4	78.7	2.2	4.08	66	32.4	1	1
Honda Civic	18.52	4	1	4	75.7	1.615	4.93	52	30.4	1	2
Toyota Corolla	19.9	4	1	4	71.1	1.835	4.22	65	33.9	1	1
Toyota Corona	20.01	4	0	3	120.1	2.465	3.7	97	21.5	1	1
Dodge Challenger	16.87	8	0	3	318	3.52	2.76	150	15.5	0	2
AMC Javelin	17.3	8	0	3	304	3.435	3.15	150	15.2	0	2
Camaro Z28	15.41	8	0	3	350	3.84	3.73	245	13.3	0	4
Pontiac Firebird	17.05	8	0	3	400	3.845	3.08	175	19.2	0	2
Fiat X1-9	18.9	4	1	4	79	1.935	4.08	66	27.3	1	1
Porsche 914-2	16.7	4	1	5	120.3	2.14	4.43	91	26	0	2
Lotus Europa	16.9	4	1	5	95.1	1.513	3.77	113	30.4	1	2
Ford Pantera L	14.5	8	1	5	351	3.17	4.22	264	15.8	0	4
Ferrari Dino	15.5	6	1	5	145	2.77	3.62	175	19.7	0	6
Maserati Bora	14.6	8	1	5	301	3.57	3.54	335	15	0	8
Volvo 142E	18.6	4	1	4	121	2.78	4.11	109	21.4	1	2

`iris`

Sepal and Petal iris species comparison

sepal-length	sepal-width	petal-length	petal-width	species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
5.7	4.4	1.5	0.4	setosa
5.4	3.9	1.3	0.4	setosa
5.1	3.5	1.4	0.3	setosa
5.7	3.8	1.7	0.3	setosa
5.1	3.8	1.5	0.3	setosa
5.4	3.4	1.7	0.2	setosa
5.1	3.7	1.5	0.4	setosa
4.6	3.6	1	0.2	setosa
5.1	3.3	1.7	0.5	setosa
4.8	3.4	1.9	0.2	setosa
5	3	1.6	0.2	setosa
5	3.4	1.6	0.4	setosa
5.2	3.5	1.5	0.2	setosa
5.2	3.4	1.4	0.2	setosa
4.7	3.2	1.6	0.2	setosa
4.8	3.1	1.6	0.2	setosa
5.4	3.4	1.5	0.4	setosa
5.2	4.1	1.5	0.1	setosa
5.5	4.2	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5	3.2	1.2	0.2	setosa
5.5	3.5	1.3	0.2	setosa
4.9	3.1	1.5	0.1	setosa
4.4	3	1.3	0.2	setosa
5.1	3.4	1.5	0.2	setosa
5	3.5	1.3	0.3	setosa
4.5	2.3	1.3	0.3	setosa
4.4	3.2	1.3	0.2	setosa
5	3.5	1.6	0.6	setosa
5.1	3.8	1.9	0.4	setosa
4.8	3	1.4	0.3	setosa
5.1	3.8	1.6	0.2	setosa
4.6	3.2	1.4	0.2	setosa
5.3	3.7	1.5	0.2	setosa
5	3.3	1.4	0.2	setosa
7	3.2	4.7	1.4	versicolor
6.4	3.2	4.5	1.5	versicolor
6.9	3.1	4.9	1.5	versicolor
5.5	2.3	4	1.3	versicolor
6.5	2.8	4.6	1.5	versicolor
5.7	2.8	4.5	1.3	versicolor
6.3	3.3	4.7	1.6	versicolor
4.9	2.4	3.3	1	versicolor
6.6	2.9	4.6	1.3	versicolor
5.2	2.7	3.9	1.4	versicolor
5	2	3.5	1	versicolor
5.9	3	4.2	1.5	versicolor
6	2.2	4	1	versicolor
6.1	2.9	4.7	1.4	versicolor
5.6	2.9	3.6	1.3	versicolor
6.7	3.1	4.4	1.4	versicolor
5.6	3	4.5	1.5	versicolor
5.8	2.7	4.1	1	versicolor
6.2	2.2	4.5	1.5	versicolor
5.6	2.5	3.9	1.1	versicolor
5.9	3.2	4.8	1.8	versicolor
6.1	2.8	4	1.3	versicolor
6.3	2.5	4.9	1.5	versicolor
6.1	2.8	4.7	1.2	versicolor
6.4	2.9	4.3	1.3	versicolor
6.6	3	4.4	1.4	versicolor
6.8	2.8	4.8	1.4	versicolor
6.7	3	5	1.7	versicolor
6	2.9	4.5	1.5	versicolor
5.7	2.6	3.5	1	versicolor
5.5	2.4	3.8	1.1	versicolor
5.5	2.4	3.7	1	versicolor
5.8	2.7	3.9	1.2	versicolor
6	2.7	5.1	1.6	versicolor
5.4	3	4.5	1.5	versicolor
6	3.4	4.5	1.6	versicolor
6.7	3.1	4.7	1.5	versicolor
6.3	2.3	4.4	1.3	versicolor
5.6	3	4.1	1.3	versicolor
5.5	2.5	4	1.3	versicolor
5.5	2.6	4.4	1.2	versicolor
6.1	3	4.6	1.4	versicolor
5.8	2.6	4	1.2	versicolor
5	2.3	3.3	1	versicolor
5.6	2.7	4.2	1.3	versicolor
5.7	3	4.2	1.2	versicolor
5.7	2.9	4.2	1.3	versicolor
6.2	2.9	4.3	1.3	versicolor
5.1	2.5	3	1.1	versicolor
5.7	2.8	4.1	1.3	versicolor
6.3	3.3	6	2.5	virginica
5.8	2.7	5.1	1.9	virginica
7.1	3	5.9	2.1	virginica
6.3	2.9	5.6	1.8	virginica
6.5	3	5.8	2.2	virginica
7.6	3	6.6	2.1	virginica
4.9	2.5	4.5	1.7	virginica
7.3	2.9	6.3	1.8	virginica
6.7	2.5	5.8	1.8	virginica
7.2	3.6	6.1	2.5	virginica
6.5	3.2	5.1	2	virginica
6.4	2.7	5.3	1.9	virginica
6.8	3	5.5	2.1	virginica
5.7	2.5	5	2	virginica
5.8	2.8	5.1	2.4	virginica
6.4	3.2	5.3	2.3	virginica
6.5	3	5.5	1.8	virginica
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
6	2.2	5	1.5	virginica
6.9	3.2	5.7	2.3	virginica
5.6	2.8	4.9	2	virginica
7.7	2.8	6.7	2	virginica
6.3	2.7	4.9	1.8	virginica
6.7	3.3	5.7	2.1	virginica
7.2	3.2	6	1.8	virginica
6.2	2.8	4.8	1.8	virginica
6.1	3	4.9	1.8	virginica
6.4	2.8	5.6	2.1	virginica
7.2	3	5.8	1.6	virginica
7.4	2.8	6.1	1.9	virginica
7.9	3.8	6.4	2	virginica
6.4	2.8	5.6	2.2	virginica
6.3	2.8	5.1	1.5	virginica
6.1	2.6	5.6	1.4	virginica
7.7	3	6.1	2.3	virginica
6.3	3.4	5.6	2.4	virginica
6.4	3.1	5.5	1.8	virginica
6	3	4.8	1.8	virginica
6.9	3.1	5.4	2.1	virginica
6.7	3.1	5.6	2.4	virginica
6.9	3.1	5.1	2.3	virginica
5.8	2.7	5.1	1.9	virginica
6.8	3.2	5.9	2.3	virginica
6.7	3.3	5.7	2.5	virginica
6.7	3	5.2	2.3	virginica
6.3	2.5	5	1.9	virginica
6.5	3	5.2	2	virginica
6.2	3.4	5.4	2.3	virginica
5.9	3	5.1	1.8	virginica

`winequality`

White Portugese Vinho Verde wine consisting set of quality parameters (physicochemical and sensory).

Data

Let’s bind some common columns to a global vars

`mtcars`

Miles per US gallon

(def mpg (ds/mtcars :mpg))

Horsepower

(def hp (ds/mtcars :hp))

Weight of the car

(def wt (ds/mtcars :wt))

`iris`

Sepal lengths

(def sepal-lengths (ds/by ds/iris :species :sepal-length))

(def setosa-sepal-length (sepal-lengths :setosa))

(def virginica-sepal-length (sepal-lengths :virginica))

Sepal widths

(def sepal-widths (ds/by ds/iris :species :sepal-width))

(def setosa-sepal-width (sepal-widths :setosa))

(def virginica-sepal-width (sepal-widths :virginica))

`winequality`

(def residual-sugar (ds/winequality "residual sugar"))

(def alcohol (ds/winequality "alcohol"))

Basic Descriptive Statistics

General summary statistics for describing the center and range of data.

Defined functions

minimum, maximum
sum
mean
geomean, harmean, powmean
mode, modes
wmode, wmodes
stats-map

Basic

This section covers fundamental descriptive statistics including finding the smallest (minimum) and largest (maximum) values, and calculating the total (sum).

Examples

(stats/minimum mpg) ;; => 10.4
(stats/maximum mpg) ;; => 33.9
(stats/sum mpg) ;; => 642.8999999999999

Compensated summation can be used to reduce numerical error. There are three algorithms implemented:

:kahan: The classic algorithm using a single correction variable to reduce numerical error.
:neumayer: An improvement on Kahan, also using one correction variable but often providing better accuracy.
:klein: A higher-order method using two correction variables, typically offering the highest accuracy at a slight computational cost.

As you can see below, all compensated summation give accurate result for mpg data.

Examples

(stats/sum mpg) ;; => 642.8999999999999
(stats/sum mpg :kahan) ;; => 642.9
(stats/sum mpg :neumayer) ;; => 642.9
(stats/sum mpg :klein) ;; => 642.9

But here is the example in which normal summation and :kahan fails.

Examples

(stats/sum [1.0 1.0E100 1.0 -1.0E100]) ;; => 0.0
(stats/sum [1.0 1.0E100 1.0 -1.0E100] :kahan) ;; => 0.0
(stats/sum [1.0 1.0E100 1.0 -1.0E100] :neumayer) ;; => 2.0
(stats/sum [1.0 1.0E100 1.0 -1.0E100] :klein) ;; => 2.0

Mean

The mean is a measure of central tendency. fastmath.stats provides several types of means:

Arithmetic Mean (mean): The sum of values divided by their count. It’s the most common type of average.

\[\mu = \frac{1}{n} \sum_{i=1}^{n} x_i\]

Geometric Mean (geomean): The n-th root of the product of n numbers. Suitable for averaging ratios, growth rates, or values that are multiplicative in nature. Requires all values to be positive. It’s less affected by extreme large values than the arithmetic mean, but more affected by extreme small values.

\[G = \left(\prod_{i=1}^{n} x_i\right)^{1/n} = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln(x_i)\right)\]

Harmonic Mean (harmean): The reciprocal of the arithmetic mean of the reciprocals of the observations. Appropriate for averaging rates (e.g., speeds). It is sensitive to small values and requires all values to be positive and non-zero.

\[H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}\]

Power Mean (powmean): Also known as the generalized mean or Hölder mean. It generalizes the arithmetic, geometric, and harmonic means. Defined by a power parameter \(p\).

\[M_p = \left(\frac{1}{n} \sum_{i=1}^{n} x_i^p\right)^{1/p} \text{ for } (p \neq 0)\]

Special cases:

\(p \to 0\): Geometric Mean
\(p = 1\): Arithmetic Mean
\(p = -1\): Harmonic Mean
\(p = 2\): Root Mean Square (RMS)
\(p \to \infty\): Maximum value
\(p \to -\infty\): Minimum value

The behavior depends on \(p\): higher \(p\) gives more weight to larger values, lower \(p\) gives more weight to smaller values.

Examples

(stats/mean residual-sugar) ;; => 6.391414863209474
(stats/geomean residual-sugar) ;; => 4.397022390150047
(stats/harmean residual-sugar) ;; => 2.9295723194226815
(stats/powmean residual-sugar ##-Inf) ;; => 0.6
(stats/powmean residual-sugar -4.5) ;; => 1.5790401256763393
(stats/powmean residual-sugar -1.0) ;; => 2.9295723194226815
(stats/powmean residual-sugar 0.0) ;; => 4.397022390150047
(stats/powmean residual-sugar 1.0) ;; => 6.391414863209474
(stats/powmean residual-sugar 4.5) ;; => 12.228524559326921
(stats/powmean residual-sugar ##Inf) ;; => 65.8

All values of power mean for residual-sugar data and range of the power from -5 to 5.

Weighted

Every mean function accepts optional weights vector. Formulas for weighted means are as follows.

Arithmetic Mean (mean):

\[\mu_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}\]

Geometric Mean (geomean):

\[G_w = \left(\prod_{i=1}^{n} x_i^{w_i}\right)^{1/\sum w_i} = \exp\left(\frac{\sum_{i=1}^{n} w_i \ln(x_i)}{\sum_{i=1}^{n} w_i}\right)\]

Harmonic Mean (harmean):

\[H_w = \frac{\sum_{i=1}^{n} w_i}{\sum_{i=1}^{n} \frac{w_i}{x_i}}\]

Power Mean (powmean):

\[M_{w,p} = \left(\frac{\sum_{i=1}^{n} w_i x_i^p}{\sum_{i=1}^{n} w_i}\right)^{1/p} \text{ for } (p \neq 0)\]

Let’s calculate mean of hp (horsepower) weighted by wt (car weight).

Examples

(stats/mean hp) ;; => 146.6875
(stats/mean hp wt) ;; => 159.99440515968607
(stats/geomean hp) ;; => 131.88367883954564
(stats/geomean hp wt) ;; => 145.7870847521906
(stats/harmean hp) ;; => 118.2288915187372
(stats/harmean hp wt) ;; => 131.56286792601395
(stats/powmean hp -4.5) ;; => 87.4433284945516
(stats/powmean hp wt -4.5) ;; => 95.2308358812506
(stats/powmean hp 4.5) ;; => 193.97486455878905
(stats/powmean hp wt 4.5) ;; => 201.37510347727357

Expectile

The expectile is a measure of location, related to both the [mean] and [quantile]. For a given level τ (tau, a value between 0 and 1), the τ-th expectile is the value t that minimizes an asymmetrically weighted sum of squared differences from t. This is distinct from quantiles, which minimize an asymmetrically weighted sum of absolute differences.

A key property of expectiles is that the 0.5-expectile is identical to the arithmetic [mean]. As τ varies from 0 to 1, expectiles span a range of values, typically from the minimum (τ=0) to the maximum (τ=1) of the dataset. Like the mean, expectiles are influenced by the magnitude of all data points, making them generally more sensitive to outliers than corresponding quantiles (e.g., the median).

Examples

(stats/expectile residual-sugar 0.1) ;; => 2.794600951101003
(stats/expectile residual-sugar 0.5) ;; => 6.391414863209474
(stats/mean residual-sugar) ;; => 6.391414863209474
(stats/expectile residual-sugar 0.9) ;; => 11.368419035291927
(stats/expectile hp wt 0.25) ;; => 130.66198036981677

Plotting expectiles for residual-sugar data across the range of τ from 0.0 to 1.0.

Mode

The mode is the value that appears most frequently in a dataset.

For numeric data, mode returns the first mode encountered, while modes returns a sequence of all modes (in increasing order for the default method).

Let’s see that mode returns elements with the highest frequency of hp (showing only first 10 values)

^:kind/hidden
(->> (frequencies hp)
     (sort-by second >)
     (take 10))

value	frequency
110	3
175	3
180	3
150	2
66	2
245	2
123	2
65	1
62	1
205	1

Examples

(stats/mode hp) ;; => 110.0
(stats/modes hp) ;; => (110.0 175.0 180.0)

When dealing with data potentially from a continuous distribution, these functions can estimate the mode using different methods:

:histogram: Mode(s) based on the peak(s) of a histogram.
:pearson: Mode estimated using Pearson’s second skewness coefficient (mode ≈ 3 * median - 2 * mean).
:kde: Mode(s) based on Kernel Density Estimation, finding original data points with the highest estimated density.
The default method finds the exact most frequent value(s), suitable for discrete data.

Examples

(stats/mode residual-sugar) ;; => 1.2
(stats/mode residual-sugar :histogram) ;; => 1.700695134061569
(stats/mode residual-sugar :pearson) ;; => 2.8171702735810538
(stats/mode residual-sugar :kde) ;; => 1.65

For weighted data, or data of any type (not just numeric), use wmode and wmodes. wmode returns the first weighted mode (the one with the highest total weight encountered first), and wmodes returns all values that share the highest total weight. If weights are omitted, they default to 1.0 for each value, effectively calculating unweighted modes for any data type.

Examples

(stats/wmode [:a :b :c :d] [1 2.5 1 2.5]) ;; => :b
(stats/wmodes [:a :b :c :d] [1 2.5 1 2.5]) ;; => (:b :d)

Stats

The stats-map function provides a comprehensive summary of descriptive statistics for a given dataset. It returns a map where keys are statistic names (as keywords) and values are their calculated measures. This function is a convenient way to get a quick overview of the data’s characteristics.

The resulting map contains the following key-value pairs:

:Size - The number of data points in the sequence.
:Min - The smallest value in the sequence (see [minimum]).
:Max - The largest value in the sequence (see [maximum]).
:Range - The difference between the maximum and minimum values.
:Mean - The arithmetic average of the values (see [mean]).
:Median - The middle value of the sorted sequence (see [median]).
:Mode - The most frequently occurring value(s) (see [mode]).
:Q1 - The first quartile (25th percentile) of the data (see [percentile]).
:Q3 - The third quartile (75th percentile) of the data (see [percentile]).
:Total - The sum of all values in the sequence (see [sum]).
:SD - The sample standard deviation, a measure of data dispersion around the mean.
:Variance - The sample variance, the square of the standard deviation.
:MAD - The Median Absolute Deviation, a robust measure of variability (see [median-absolute-deviation]).
:SEM - The Standard Error of the Mean, an estimate of the standard deviation of the sample mean.
:LAV - The Lower Adjacent Value, the smallest observation that is not an outlier (see [adjacent-values]).
:UAV - The Upper Adjacent Value, the largest observation that is not an outlier (see [adjacent-values]).
:IQR - The Interquartile Range, the difference between Q3 and Q1.
:LOF - The Lower Outer Fence, a threshold for identifying extreme low outliers (Q1 - 3.0 * IQR).
:UOF - The Upper Outer Fence, a threshold for identifying extreme high outliers (Q3 + 3.0 * IQR).
:LIF - The Lower Inner Fence, a threshold for identifying mild low outliers (Q1 - 1.5 * IQR).
:UIF - The Upper Inner Fence, a threshold for identifying mild high outliers (Q3 + 1.5 * IQR).
:Outliers - A list of data points considered outliers (values outside the inner fences, see [outliers]).
:Kurtosis - A measure of the “tailedness” or “peakedness” of the distribution (see [kurtosis]).
:Skewness - A measure of the asymmetry of the distribution (see [skewness]).

(stats/stats-map residual-sugar)

{:IQR 8.200000000000001,
 :Kurtosis 3.4698201025634363,
 :LAV 0.6,
 :LIF -10.600000000000001,
 :LOF -22.900000000000002,
 :MAD 3.6,
 :Max 65.8,
 :Mean 6.391414863209486,
 :Median 5.2,
 :Min 0.6,
 :Mode 1.2,
 :Outliers (22.6 23.5 26.05 26.05 31.6 31.6 65.8),
 :Q1 1.7,
 :Q3 9.9,
 :Range 65.2,
 :SD 5.072057784014863,
 :SEM 0.07247276021182479,
 :Size 4898,
 :Skewness 1.0770937564241123,
 :Total 31305.150000000063,
 :UAV 22.0,
 :UIF 22.200000000000003,
 :UOF 34.5,
 :Variance 25.72577016438576}

Quantiles and Percentiles

Statistics related to points dividing the distribution of data.

Defined functions

percentile, percentiles
quantile, quantiles
wquantile, wquantiles
median, median-3
wmedian

Quantiles and percentiles are statistics that divide the range of a probability distribution into continuous intervals with equal probabilities, or divide the observations in a sample in the same way.

fastmath.stats provides several functions to calculate these measures:

percentile: Calculates the p-th percentile (a value from 0 to 100) of a sequence.
percentiles: Calculates multiple percentiles for a sequence.
quantile: Calculates the q-th quantile (a value from 0.0 to 1.0) of a sequence. This is equivalent to (percentile vs (* q 100.0)).
quantiles: Calculates multiple quantiles for a sequence.
median: Calculates the median (0.5 quantile or 50th percentile) of a sequence.
median-3: A specialized function that calculates the median of exactly three numbers.

All percentile, quantiles, quantile, quantiles, and median functions accept an optional estimation-strategy keyword. This parameter determines how the quantile is estimated, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence.

Examples

(stats/quantile mpg 0.25) ;; => 15.274999999999999
(stats/quantiles mpg [0.1 0.25 0.5 0.75 0.9]) ;; => [13.600000000000001 15.274999999999999 19.2 22.8 30.4]
(stats/percentile mpg 50.0) ;; => 19.2
(stats/percentiles residual-sugar [10 25 50 75 90]) ;; => [1.2 1.7 5.2 9.9 14.0]
(stats/median mpg) ;; => 19.2
(stats/median-3 15 5 10) ;; => 10.0

The available estimation-strategy values are :legacy (default), :r1, :r2, :r3, :r4, :r5, :r6, :r7, :r8 and :r9. Formulas for all of them can be found on this Wikipedia article. :legacy uses estimate of the form: \(Q_p = x_{\lceil p(N+1) - 1/2 \rceil}\)

The plot below illustrates the differences between these estimation strategies for the sample vs = [1 10 10 30].

(def vs [1 10 10 30])

Examples

(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :legacy) ;; => [1.0 3.25 10.0 25.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r1) ;; => [1.0 1.0 10.0 10.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r2) ;; => [1.0 5.5 10.0 20.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r3) ;; => [1.0 1.0 10.0 10.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r4) ;; => [1.0 1.0 10.0 10.0 22.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r5) ;; => [1.0 5.5 10.0 20.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r6) ;; => [1.0 3.25 10.0 25.0 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r7) ;; => [3.7 7.75 10.0 15.0 24.000000000000004]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r8) ;; => [1.0 4.749999999999998 10.0 21.66666666666667 30.0]
(stats/quantiles vs [0.1 0.25 0.5 0.75 0.9] :r9) ;; => [1.0 4.9375 10.0 21.25 30.0]

Weighted

There are also functions to calculate weighted quantiles and medians. These are useful when individual data points have different levels of importance or contribution.

wquantile: Calculates the q-th weighted quantile for a sequence vs with corresponding weights.
wquantiles: Calculates multiple weighted quantiles for a sequence vs with weights.
wmedian: Calculates the weighted median (0.5 weighted quantile) for vs with weights.

All these functions accept an optional method keyword argument that specifies the interpolation strategy when a quantile falls between points in the weighted empirical cumulative distribution function (ECDF). The available methods are:

:linear (Default): Performs linear interpolation between the data values corresponding to the cumulative weights surrounding the target quantile.
:step: Uses a step function (specifically, step-before interpolation) based on the weighted ECDF. The result is the data value whose cumulative weight range includes the target quantile.
:average: Computes the average of the step-before and step-after interpolation methods. This can be useful when a quantile corresponds exactly to a cumulative weight boundary.

Let’s define a sample dataset and weights:

(def sample-data [10 15 30 50 100])

(def sample-weights [1 2 5 1 1])

Examples

(stats/wquantile sample-data sample-weights 0.25) ;; => 13.75
(stats/wquantile sample-data sample-weights 0.5) ;; => 21.0
(stats/wmedian sample-data sample-weights) ;; => 21.0
(stats/wquantile sample-data sample-weights 0.75) ;; => 28.5
(stats/wmedian sample-data sample-weights :linear) ;; => 21.0
(stats/wmedian sample-data sample-weights :step) ;; => 30.0
(stats/wmedian sample-data sample-weights :average) ;; => 22.5
(stats/wquantiles sample-data sample-weights [0.2 0.5 0.8]) ;; => [12.5 21.0 30.0]
(stats/wquantiles sample-data sample-weights [0.2 0.5 0.8] :step) ;; => [15.0 30.0 30.0]
(stats/wquantiles sample-data sample-weights [0.2 0.5 0.8] :average) ;; => [12.5 22.5 30.0]

Using mpg data and wt (car weight) as weights:

Examples

(stats/wmedian mpg wt) ;; => 17.702906976744185
(stats/wquantile mpg wt 0.25) ;; => 14.894033613445378
(stats/wquantiles mpg wt [0.1 0.25 0.5 0.75 0.9] :average) ;; => [10.4 14.85 17.55 21.2 25.2]

When weights are equal to 1.0, then:

:linear method is the same as :r4 estimation strategy in quantiles
:step is the same as :r1
:average has no corresponding strategy

Examples

(stats/quantiles mpg [0.1 0.25 0.5 0.75 0.9] :r4) ;; => [13.5 15.2 19.2 22.8 29.78]
(stats/wquantiles mpg (repeat (count mpg) 1.0) [0.1 0.25 0.5 0.75 0.9]) ;; => [13.5 15.2 19.2 22.8 29.78]
(stats/quantiles mpg [0.1 0.25 0.5 0.75 0.9] :r1) ;; => [14.3 15.2 19.2 22.8 30.4]
(stats/wquantiles mpg (repeat (count mpg) 1.0) [0.1 0.25 0.5 0.75 0.9] :step) ;; => [14.3 15.2 19.2 22.8 30.4]
(stats/wquantiles mpg (repeat (count mpg) 1.0) [0.1 0.25 0.5 0.75 0.9] :average) ;; => [13.8 15.2 19.2 22.8 28.85]

Measures of Dispersion/Deviation

Statistics describing the spread or variability of data.

Defined functions

variance, population-variance
stddev, population-stddev
wvariance, population-wvariance
wstddev, population-wstddev
pooled-variance, pooled-stddev
variation, l-variation
mean-absolute-deviation
median-absolute-deviation, mad
pooled-mad
sem

Variance and standard deviation

Variance and standard deviation are fundamental measures of the dispersion or spread of a dataset around its mean.

Variance quantifies the average squared difference of each data point from the mean. A higher variance indicates that the data points are more spread out, while a lower variance indicates they are clustered closer to the mean.
Standard Deviation is the square root of the variance. It is expressed in the same units as the data, making it more interpretable than variance as a measure of spread.
Sample Variance (variance) and Sample Standard Deviation (stddev): These are estimates of the population variance and standard deviation, calculated from a sample of data. They use a denominator of \(N-1\) (Bessel’s correction) to provide an unbiased estimate of the population variance.

\[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]

\[s = \sqrt{s^2}\]

Both functions can optionally accept a pre-calculated mean (mu) as a second argument.

Population Variance (population-variance) and Population Standard Deviation (population-stddev): These are used when the data represents the entire population of interest, or when a biased estimate (maximum likelihood estimate) from a sample is desired. They use a denominator of \(N\).

\[\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}\]

\[\sigma = \sqrt{\sigma^2}\]

These also accept an optional pre-calculated population mean (mu).

Weighted Variance (wvariance, population-wvariance) and Weighted Standard Deviation (wstddev, population-wstddev): These calculate variance and standard deviation when each data point has an associated weight. For weighted sample variance (unbiased form, where \(w_i\) are weights):

\[\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}\]

\[s_w^2 = \frac{\sum w_i (x_i - \bar{x}_w)^2}{(\sum w_i) - 1}\]

For weighted population variance:

\[\sigma_w^2 = \frac{\sum w_i (x_i - \bar{x}_w)^2}{\sum w_i}\]

Weighted standard deviations are the square roots of their respective variances.

Pooled Variance (pooled-variance) and Pooled Standard Deviation (pooled-stddev): These are used to estimate a common variance when data comes from several groups that are assumed to have the same population variance. Following methods can be used (where \(k\) is the number of groups, each with \(n_i\) number of observations and sample variance \(s_i^2\))
- :unbiased (default) \[s_p^2 = \frac{\sum_{i=1}^{k} (n_i-1)s_i^2}{\sum_{i=1}^{k} n_i - k}\]
- :biased \[s_p^2 = \frac{\sum_{i=1}^{k} (n_i-1)s_i^2}{\sum_{i=1}^{k} n_i}\]
- :avg - simple average of group variances. \[s_p^2 = \frac{\sum_{i=1}^{k} s_i^2}{k}\]

Pooled standard deviation is \(\sqrt{s_p^2}\).

Examples

(stats/variance mpg) ;; => 36.32410282258065
(stats/stddev mpg) ;; => 6.026948052089105
(stats/population-variance mpg) ;; => 35.188974609375
(stats/population-stddev mpg) ;; => 5.932029552301219
(stats/variance mpg (stats/mean mpg)) ;; => 36.32410282258065
(stats/population-variance mpg (stats/mean mpg)) ;; => 35.188974609375

Weighted variance and standard deviation

Examples

(stats/wvariance hp wt) ;; => 4477.350800154701
(stats/wstddev hp wt) ;; => 66.91300919966685
(stats/population-wvariance hp wt) ;; => 4433.861107869416
(stats/population-wstddev hp wt) ;; => 66.58724433305088

Pooled variance and standard deviation

Examples

(stats/pooled-variance [setosa-sepal-length virginica-sepal-length]) ;; => 0.264295918367347
(stats/pooled-stddev [setosa-sepal-length virginica-sepal-length]) ;; => 0.5140971876672221
(stats/pooled-variance [setosa-sepal-length virginica-sepal-length] :biased) ;; => 0.2590100000000001
(stats/pooled-variance [setosa-sepal-length virginica-sepal-length] :avg) ;; => 0.264295918367347

Beyond variance and standard deviation, we have three additional functions:

Coefficient of Variation (variation): This is a standardized measure of dispersion, calculated as the ratio of the standard deviation \(s\) to the mean \(\bar{x}\).

\[CV = \frac{s}{\bar{x}}\]

The CV is unitless, making it useful for comparing the variability of datasets with different means or units. It’s most meaningful for data measured on a ratio scale (i.e., with a true zero point) and where all values are positive.

Standard Error of the Mean (sem): The SEM estimates the standard deviation of the sample mean if you were to draw multiple samples from the same population. It indicates how precisely the sample mean estimates the true population mean.

\[SEM = \frac{s}{\sqrt{n}}\]

where \(s\) is the sample standard deviation and \(n\) is the sample size. A smaller SEM suggests a more precise estimate of the population mean.

L-variation (l-variation): Calculates the coefficient of L-variation. This is a dimensionless measure of dispersion, analogous to the coefficient of variation.

\[\tau_2 = \lambda_2 / \lambda_1\]

Examples

(stats/variation mpg) ;; => 0.29998808160966145
(stats/variation residual-sugar) ;; => 0.7935735502339006
(stats/l-variation mpg) ;; => 0.1691378280874466
(stats/l-variation residual-sugar) ;; => 0.43403938821477966
(stats/sem mpg) ;; => 1.0654239593728148
(stats/sem residual-sugar) ;; => 0.07247276021182479

MAD

MAD typically refers to Median Absolute Deviation, a robust measure of statistical dispersion. fastmath.stats also provides the Mean Absolute Deviation.

Median Absolute Deviation (median-absolute-deviation or mad): This is a robust measure of the variability of a univariate sample. It is defined as the median of the absolute deviations from the data’s median.

\[MAD = \text{median}(|X_i - \text{median}(X)|)\]

If a specific center \(c\) is provided, it’s \(MAD_c = \text{median}(|X_i - c|)\). Also, different estimation strategies can be used, see [median] MAD is less sensitive to outliers than the standard deviation.

Mean Absolute Deviation (mean-absolute-deviation): This measures variability as the average of the absolute deviations from a central point, typically the data’s mean.

\[MeanAD = \frac{1}{n} \sum_{i=1}^{n} |X_i - \text{mean}(X)|\]

If a specific center \(c\) is provided, it’s \(MeanAD_c = \frac{1}{n} \sum_{i=1}^{n} |X_i - c|\). MeanAD is more sensitive to outliers than MAD but less sensitive than the standard deviation.

Pooled MAD (pooled-mad): This function calculates a pooled estimate of the Median Absolute Deviation when data comes from several groups. For each group \(i\), absolute deviations from its median \(M_i\) are calculated: \(Y_{ij} = |X_{ij} - M_i|\). The pooled MAD is then the median of all such \(Y_{ij}\) values, scaled by a constant const (which defaults to approximately 1.4826, to make it comparable to the standard deviation for normal data).

\[PooledMAD = \text{const} \cdot \text{median}(\{Y_{ij} \mid \text{for all groups } i \text{ and observations } j \text{ in group } i\})\]

Examples

(stats/mad mpg) ;; => 3.6500000000000004
(stats/median-absolute-deviation mpg) ;; => 3.6500000000000004
(stats/median-absolute-deviation mpg (stats/median mpg) :r3) ;; => 3.6000000000000014
(stats/median-absolute-deviation mpg (stats/mean mpg)) ;; => 4.299999999999999
(stats/mean-absolute-deviation mpg) ;; => 4.714453125
(stats/mean-absolute-deviation mpg (stats/median mpg)) ;; => 4.634375
(stats/pooled-mad [setosa-sepal-length virginica-sepal-length]) ;; => 0.4447806655516804
(stats/pooled-mad [setosa-sepal-length virginica-sepal-length] 1.0) ;; => 0.2999999999999998

Moments and Shape

Moments and shape statistics describe the form of a dataset’s distribution, particularly its symmetry and peakedness.

Defined functions

moment
skewness
kurtosis
l-moment

Conventional Moments (`moment`)

The moment function calculates statistical moments of a dataset. Moments can be central (around the mean), raw (around zero), or around a specified center. They can also be absolute and/or normalized.

k-th Central Moment: \(\mu_k = E[(X - \mu)^k] \approx \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^k\). Calculated when center is nil (default) and :mean? is true (default).
k-th Raw Moment (about origin): \(\mu'_k = E[X^k] \approx \frac{1}{n} \sum_{i=1}^{n} x_i^k\). Calculated if center is 0.0.
k-th Absolute Central Moment: \(E[|X - \mu|^k] \approx \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|^k\). Calculated if :absolute? is true.
Normalization: If :normalize? is true, the moment is divided by \(\sigma^k\) (where \(\sigma\) is the standard deviation), yielding a scale-invariant measure. For example, the 3rd normalized central moment is related to skewness, and the 4th to kurtosis.
Power of sum of differences: If :mean? is false, the function returns the sum \(\sum (x_i - c)^k\) (or sum of absolute values) instead of the mean.

The order parameter specifies \(k\). For example, the 2nd central moment for \(k=2\) is the variance.

Examples

(stats/moment mpg 2) ;; => 35.188974609374995
(stats/variance mpg) ;; => 36.32410282258065
(stats/moment mpg 3) ;; => 133.68672198486328
(stats/moment mpg 4 {:normalize? true}) ;; => 2.6272339701791085
(stats/moment mpg 1 {:absolute? true, :center (stats/median mpg)}) ;; => 4.634375

Skewness

Skewness measures the asymmetry of a probability distribution about its mean. fastmath.stats/skewness offers several types:

Moment-based (sensitive to outliers):

:G1 (Default): Sample skewness based on the 3rd standardized moment, adjusted for sample bias (via Apache Commons Math).

\[G_1 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s}\right)^3\]

:g1 or :pearson: Pearson’s moment coefficient of skewness, another bias-adjusted version of the 3rd standardized central moment \(m_3\).

\[g_1 = \frac{m_3}{m_2^{3/2}}\]

:b1: Sample skewness coefficient, related to \(g_1\).

\[b_1 = \frac{m_3}{s^3}\]

:skew: Skewness used in BCa

\[SKEW = \frac{\sum_{i=1}^n (x_i - \bar{x})^3}{(\sum_{i=1}^n (x_i - \bar{x})^2)^{3/2}} = \frac{g_1}{\sqrt{n}}\]

Robust (less sensitive to outliers):

:median: Median Skewness / Pearson’s first skewness coefficient.

\[S_P = 3 \frac{\text{mean} - \text{median}}{\text{stddev}}\]

:mode: Pearson’s second skewness coefficient. Mode estimation method can be specified.

\[S_K = \frac{\text{mean} - \text{mode}}{\text{stddev}}\]

:bowley or :yule (with \(u=0.25\)): Based on quartiles \(Q_1, Q_2, Q_3\).

\[S_B = \frac{(Q_3 - Q_2) - (Q_2 - Q_1)}{Q_3 - Q_1} = \frac{Q_3 + Q_1 - 2Q_2}{Q_3 - Q_1}\]

:yule or :B1 (Yule’s coefficient): Generalization of Bowley’s, using quantiles \(Q_u, Q_{0.5}, Q_{1-u}\).

\[B_1 = S_Y(u) = \frac{(Q_{1-u} - Q_{0.5}) - (Q_{0.5} - Q_u)}{Q_{1-u} - Q_u}\]

:B3: Robust measure by Groeneveld and Meeden

\[B_3 = \frac{\text{mean} - \text{median}}{\text{mean}(|X_i - \text{median}|)}\]

:hogg: Based on comparing trimmed means (\(U_{0.05}\): mean of top 5%, \(L_{0.05}\): mean of bottom 5%, \(M_{0.25}\): 25% trimmed mean).

\[S_H = \frac{U_{0.05} - M_{0.25}}{M_{0.25} - L_{0.05}}\]

:l-skewness: L-moments based skewness.

\[\tau_3 = \lambda_3 / \lambda_2\]

Positive skewness indicates a tail on the right side of the distribution; negative skewness indicates a tail on the left. Zero indicates symmetry.

Examples

(stats/skewness residual-sugar) ;; => 1.0770937564240939
(stats/skewness residual-sugar :G1) ;; => 1.0770937564240939
(stats/skewness residual-sugar :g1) ;; => 1.0767638711454521
(stats/skewness residual-sugar :pearson) ;; => 1.0767638711454521
(stats/skewness residual-sugar :b1) ;; => 1.07643413178962
(stats/skewness residual-sugar :skew) ;; => 0.01538548123095513
(stats/skewness residual-sugar :median) ;; => 0.7046931919610692
(stats/skewness residual-sugar :mode) ;; => 1.0235322790625094
(stats/skewness residual-sugar [:mode :histogram]) ;; => 0.9248159088272246
(stats/skewness residual-sugar [:mode :kde]) ;; => 0.9348108923665123
(stats/skewness residual-sugar :bowley) ;; => 0.1463414634146341
(stats/skewness residual-sugar :yule) ;; => 0.1463414634146341
(stats/skewness residual-sugar [:yule 0.1]) ;; => 0.37499999999999994
(stats/skewness residual-sugar :B3) ;; => 0.2864727410181956
(stats/skewness residual-sugar :hogg) ;; => 3.0343529984131266
(stats/skewness residual-sugar :l-skewness) ;; => 0.22296648073302056

Effect of an outlier is visible for moment based skewness, while has no effect when robust method is used.

Examples

(stats/skewness (conj residual-sugar -1000)) ;; => -58.659786835804155
(stats/skewness (conj residual-sugar -1000) :l-skewness) ;; => 0.13914390517414776

Kurtosis

Kurtosis measures the “tailedness” or “peakedness” of a distribution. High kurtosis means heavy tails (more outliers) and a sharp peak (leptokurtic); low kurtosis means light tails and a flatter peak (platykurtic). fastmath.stats/kurtosis offers several types:

Moment-based (sensitive to outliers):

:G2 (Default): Sample kurtosis (Fisher’s definition, not excess), adjusted for sample bias (via Apache Commons Math). For a normal distribution, this is approximately 3.

\[G_2 = \frac{(n+1)n}{(n-1)(n-2)(n-3)} \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s}\right)^4 - 3\frac{(n-1)^2}{(n-2)(n-3)}\]

:g2 or :excess: Sample excess kurtosis. For a normal distribution, this is approximately 0.

\[g_2 = \frac{m_4}{m_2^2}-3\]

:kurt: Kurtosis defined as \(g_2 + 3\).

\[g_{kurt} = \frac{m_4}{m_2^2} = g_2 + 3\]

:b2: Sample kurtosis

\[b_2 = \frac{m_4}{s^4}-3\]

Robust (less sensitive to outliers):

:geary: Geary’s ‘g’ measure of kurtosis. Normal \(\approx \sqrt{2/\pi} \approx 0.798\).

\[g = \frac{MeanAD}{\sigma^2}\]

:moors: Based on octiles \(E_i\) (quantiles \(i/8\)) and centered by subtracting \(1.233\) (Moors’ constant for normality).

\[M_0 = \frac{(E_7-E_5) + (E_3-E_1)}{E_6-E_2}-1.233\]

:crow (Crow-Siddiqui): Based on quantiles \(Q_\alpha, Q_{1-\alpha}, Q_\beta, Q_{1-\beta}\) and centered for normality (\(c\) is based on \(\alpha\) and \(\beta\)). By default \(\alpha=0.025\) and \(\beta=0.25\).

\[CS(\alpha, \beta) = \frac{Q_{1-\alpha} - Q_{\alpha}}{Q_{1-\beta} - Q_{\beta}}-c\]

:hogg: Based on trimmed means \(U_p\) (mean of top \(p\%\)) and \(L_p\) (mean of bottom \(p\%\)) and centered by subtracting \(2.585\). By default \(\alpha=0.005\) and \(\beta=0.5\).

\[K_H(\alpha, \beta) = \frac{U_{\alpha} - L_{\alpha}}{U_{\beta} - L_{\beta}}-2.585\]

:l-kurtosis: L-moments based kurtosis.

\[\tau_4 = \lambda_4 / \lambda_2\]

Examples

(stats/kurtosis residual-sugar) ;; => 3.4698201025636317
(stats/kurtosis residual-sugar :G2) ;; => 3.4698201025636317
(stats/kurtosis residual-sugar :g1) ;; => 3.4698201025636317
(stats/kurtosis residual-sugar :excess) ;; => 3.4650542966048463
(stats/kurtosis residual-sugar :kurt) ;; => 6.465054296604846
(stats/kurtosis residual-sugar :b2) ;; => 3.4624146909177034
(stats/kurtosis residual-sugar :geary) ;; => 0.8336299967214688
(stats/kurtosis residual-sugar :moors) ;; => -0.35495121951219544
(stats/kurtosis residual-sugar :crow) ;; => -0.8936518297189449
(stats/kurtosis residual-sugar [:crow 0.05 0.25]) ;; => -0.6581758315571915
(stats/kurtosis residual-sugar :hogg) ;; => -0.5329102382273203
(stats/kurtosis residual-sugar [:hogg 0.025 0.45]) ;; => -0.5505736317691476
(stats/kurtosis residual-sugar :l-kurtosis) ;; => 0.02007386147996773

Effect of an outlier is visible for moment based kurtosis, while has no effect when robust method is used.

Examples

(stats/kurtosis (conj residual-sugar -1000 1000)) ;; => 2167.8435026710868
(stats/kurtosis (conj residual-sugar -1000 1000) :l-kurtosis) ;; => 0.1447670840462904

L-moment

L-moments are summary statistics analogous to conventional moments but are computed from linear combinations of order statistics (sorted data). They are more robust to outliers and provide better estimates for small samples compared to conventional moments.

l-moment vs order: Calculates the L-moment of a specific order.
- \(\lambda_1\): L-location (identical to the mean).
- \(\lambda_2\): L-scale (a measure of dispersion).
- Higher orders relate to shape.
Trimmed L-moments (TL-moments) can be calculated by specifying :s (left trim) and :t (right trim) as number of trimmed samples
L-moment Ratios: If :ratio? true, normalized L-moments are returned.
- \(\tau_3 = \lambda_3 / \lambda_2\): Coefficient of L-skewness (same as (stats/skewness vs :l-skewness)).
- \(\tau_4 = \lambda_4 / \lambda_2\): Coefficient of L-kurtosis (same as (stats/kurtosis vs :l-kurtosis)).

L-moments often provide more reliable inferences about the underlying distribution shape, especially when data may contain outliers or come from heavy-tailed distributions.

Examples

(stats/l-moment mpg 1) ;; => 20.090624999999996
(stats/mean mpg) ;; => 20.090625
(stats/l-moment mpg 2) ;; => 3.3980846774193565
(stats/l-moment mpg 3) ;; => 0.534375
(stats/l-moment residual-sugar 3) ;; => 0.6185370660798765
(stats/l-moment residual-sugar 3 {:s 10}) ;; => 0.3017960335908669
(stats/l-moment residual-sugar 3 {:t 10}) ;; => 0.029252613288362147
(stats/l-moment residual-sugar 3 {:s 10, :t 10}) ;; => 0.003909359312292547

Relation to skewness and kurtosis

Examples

(stats/l-moment residual-sugar 3 {:ratio? true}) ;; => 0.22296648073302056
(stats/skewness residual-sugar :l-skewness) ;; => 0.22296648073302056
(stats/l-moment residual-sugar 4 {:ratio? true}) ;; => 0.02007386147996773
(stats/kurtosis residual-sugar :l-kurtosis) ;; => 0.02007386147996773

Intervals and Extents

This section provides functions to describe the spread or define specific ranges and intervals within a dataset.

Defined functions

span, iqr
extent,
stddev-extent, mad-extent, sem-extent
percentile-extent, quantile-extent
pi, pi-extent
hpdi-extent
adjacent-values
inner-fence-extent, outer-fence-extent
percentile-bc-extent, percentile-bca-extent
ci

Basic Range: Functions like span (\(max - min\)) and extent (providing \([min, max]\) and optionally the mean) offer simple measures of the total spread of the data.
Interquartile Range: iqr (\(Q_3 - Q_1\)) specifically measures the spread of the middle 50% of the data, providing a robust alternative to the total range.
Symmetric Spread Intervals: Functions ending in -extent such as stddev-extent, mad-extent, and sem-extent define intervals typically centered around the mean or median. They represent a range defined by adding/subtracting a multiple (usually 1) of a measure of dispersion (Standard Deviation, Median Absolute Deviation, or Standard Error of the Mean) from the central point.
Quantile-Based Intervals: percentile-extent, quantile-extent, pi, pi-extent, and hpdi-extent define intervals based on quantiles or percentiles of the data. These functions capture specific ranges containing a certain percentage of the data points (e.g., the middle 95% defined by quantiles 0.025 and 0.975). hpdi-extent calculates the shortest interval containing a given proportion of data, based on empirical density.
Box Plot Boundaries: adjacent-values (LAV, UAV) and fence functions (inner-fence-extent, outer-fence-extent) calculate specific bounds based on quartiles and multiples of the IQR. These are primarily used in box plot visualization and as a conventional method for identifying potential outliers.
Confidence and Prediction Intervals: ci, percentile-bc-extent, and percentile-bca-extent provide inferential intervals. ci estimates a confidence interval for the population mean using the t-distribution. percentile-bc-extent and percentile-bca-extent (Bias-Corrected and Bias-Corrected Accelerated) are advanced bootstrap methods for estimating confidence intervals for statistics, offering robustness against non-normality and bias.

Note that:

\(IQR = Q_3-Q_1\)
\(LIF=Q_1-1.5 \times IQR\)
\(UIF=Q_3+1.5 \times IQR\)
\(LOF=Q_1-3\times IQR\)
\(UOF=Q_3+3\times IQR\)
\(CI=\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}\)

Function	Returned value
`span`	\(max-min\)
`iqr`	\(Q_3-Q_1\)
`extent`	`[min, max, mean]` or `[min, max]` (when `:mean?` is `false`)
`stddev-extent`	`[mean - stddev, mean + stddev, mean]`
`mad-extent`	`[median - mad, median + mad, median]`
`sem-extent`	`[mean - sem, mean + sem, mean]`
`percentile-extent`	`[p1-val, p2-val, median]` with default `p1=25` and `p2=75`
`quantile-extent`	`[q1-val, q2-val, median]` with default `q1=0.25` and `q2=0.75`
`pi`	`{p1 p1-val p2 p2-val}` defined by `size=p2-p1`
`pi-extent`	`[p1-val, p2-val, median]` defined by `size=p2-p1`
`hdpi-extent`	`[p1-val, p2-val, median]` defined by `size=p2-p1`
`adjacent-values`	`[LAV, UAV, median]`
`inner-fence-extent`	`[LIF, UIF, median]`
`outer-fence-extent`	`[LOF, UOF, median]`
`ci`	`[lower upper mean]`
`percentile-bc-extent`	`[lower upper mean]`
`percentile-bca-extent`	`[lower upper mean]`

Examples

(stats/span mpg) ;; => 23.5
(stats/iqr mpg) ;; => 7.525000000000002
(stats/extent mpg) ;; => #vec3 [10.4, 33.9, 20.090625]
(stats/extent mpg false) ;; => #vec2 [10.4, 33.9]
(stats/stddev-extent mpg) ;; => [14.063676947910896 26.117573052089103 20.090625]
(stats/mad-extent mpg) ;; => [15.549999999999999 22.85 19.2]
(stats/sem-extent mpg) ;; => [19.025201040627184 21.156048959372814 20.090625]
(stats/percentile-extent mpg) ;; => [15.274999999999999 22.8 19.2]
(stats/percentile-extent mpg 2.5 97.5) ;; => [10.4 33.9 19.2]
(stats/percentile-extent mpg 2.5 97.5 :r9) ;; => [10.4 33.628125 19.2]
(stats/quantile-extent mpg) ;; => [15.274999999999999 22.8 19.2]
(stats/quantile-extent mpg 0.025 0.975) ;; => [10.4 33.9 19.2]
(stats/quantile-extent mpg 0.025 0.975 :r9) ;; => [10.4 33.628125 19.2]
(stats/pi mpg 0.95) ;; => {2.5 10.4, 97.5 33.9}
(stats/pi-extent mpg 0.95) ;; => [10.4 33.9 19.2]
(stats/hpdi-extent mpg 0.95) ;; => [10.4 32.4 19.2]
(stats/adjacent-values mpg) ;; => [10.4 33.9 19.2]
(stats/inner-fence-extent mpg) ;; => [3.9874999999999954 34.087500000000006 19.2]
(stats/outer-fence-extent mpg) ;; => [-7.300000000000008 45.37500000000001 19.2]
(stats/ci mpg) ;; => [17.917678508746246 22.263571491253753 20.090625]
(stats/ci mpg 0.1) ;; => [18.284178665508097 21.8970713344919 20.090625]
(stats/percentile-bc-extent mpg) ;; => [10.4 33.9 20.090625]
(stats/percentile-bc-extent mpg 10.0) ;; => [14.85121570396848 32.66635783408668 20.090625]
(stats/percentile-bca-extent mpg) ;; => [10.4 33.9 20.090625]
(stats/percentile-bca-extent mpg 10.0) ;; => [14.79162798537463 32.44980004741413 20.090625]

Outlier Detection

Outlier detection involves identifying data points that are significantly different from other observations. Outliers can distort statistical analyses and require careful handling. fastmath.stats provides functions to find and optionally remove such values based on the Interquartile Range (IQR) method.

Defined functions

outliers

outliers function use the inner fence rule based on the IQR and returns a sequence containing only the data points identified as outliers.

Lower Inner Fence (LIF): \(Q_1 - 1.5 \times IQR\)
Upper Inner Fence (UIF): \(Q_3 + 1.5 \times IQR\)

Where \(Q_1\) is the first quartile (25th percentile) and \(Q_3\) is the third quartile (75th percentile). Points falling below the LIF or above the UIF are considered outliers.

Function accepts an optional estimation-strategy keyword (see [quantile]) to control how quartiles are calculated, which affects the fence boundaries.

Let’s find the outliers in the residual-sugar data.

(stats/outliers residual-sugar)

(23.5 31.6 31.6 65.8 26.05 26.05 22.6)

Data Transformation

Functions to modify data (scaling, normalizing, transforming).

Defined functions

standardize, robust-standardize, demean
rescale
remove-outliers
trim, trim-lower, trim-upper, winsor
box-cox-infer-lambda, box-cox-transformation
yeo-johnson-infer-lambda, yeo-johnson-transformation

Data transformations are often necessary preprocessing steps in statistical analysis and machine learning. They can help meet the assumptions of certain models (e.g., normality, constant variance), improve interpretability, or reduce the influence of outliers. fastmath.stats offers several functions for these purposes, broadly categorized into linear scaling/centering, outlier handling, and power transformations for normality.

Let’s demonstrate some of these transformations using the residual-sugar data from the wine quality dataset.

Examples

(stats/mean residual-sugar) ;; => 6.391414863209474
(stats/stddev residual-sugar) ;; => 5.072057784014863
(stats/median residual-sugar) ;; => 5.2
(stats/mad residual-sugar) ;; => 3.6
(stats/extent residual-sugar false) ;; => #vec2 [0.6, 65.8]
(count residual-sugar) ;; => 4898

Linear Transformations: standardize, robust-standardize, demean, and rescale linearly transform data, preserving its shape but changing its location and/or scale.

demean centers the data by subtracting the mean, resulting in a dataset with a mean of zero.
standardize scales the demeaned data by dividing by the standard deviation, resulting in data with mean zero and standard deviation one (z-score normalization). This makes the scale of different features comparable.
robust-standardize provides a version less sensitive to outliers by centering around the median and scaling by the Median Absolute Deviation (MAD) or a quantile range (like the IQR).
rescale linearly maps the data to a specific target range (e.g., [0, 1]), useful for algorithms sensitive to input scale.

(def residual-sugar-demeaned (-> residual-sugar stats/demean))

(def residual-sugar-standardized (-> residual-sugar stats/standardize))

(def residual-sugar-robust-standardized (-> residual-sugar stats/robust-standardize))

(def residual-sugar-rescaled (-> residual-sugar stats/rescale))

Examples

(stats/mean residual-sugar-demeaned) ;; => -3.2386416038881656E-16
(stats/stddev residual-sugar-demeaned) ;; => 5.072057784014863
(stats/median residual-sugar-demeaned) ;; => -1.1914148632094737
(stats/mad residual-sugar-demeaned) ;; => 3.5999999999999996
(stats/extent residual-sugar-demeaned false) ;; => #vec2 [-5.791414863209474, 59.40858513679052]
(stats/mean residual-sugar-standardized) ;; => -1.5386267642212256E-16
(stats/stddev residual-sugar-standardized) ;; => 1.0000000000000004
(stats/median residual-sugar-standardized) ;; => -0.23489773065368974
(stats/mad residual-sugar-standardized) ;; => 0.7097710935679375
(stats/extent residual-sugar-standardized false) ;; => #vec2 [-1.1418274613238324, 11.712915677739927]
(stats/mean residual-sugar-robust-standardized) ;; => 0.3309485731137426
(stats/stddev residual-sugar-robust-standardized) ;; => 1.4089049400041327
(stats/median residual-sugar-robust-standardized) ;; => 0.0
(stats/mad residual-sugar-robust-standardized) ;; => 1.0
(stats/extent residual-sugar-robust-standardized false) ;; => #vec2 [-1.277777777777778, 16.833333333333332]
(stats/mean residual-sugar-rescaled) ;; => 0.0888253813375686
(stats/stddev residual-sugar-rescaled) ;; => 0.07779229730084207
(stats/median residual-sugar-rescaled) ;; => 0.07055214723926381
(stats/mad residual-sugar-rescaled) ;; => 0.055214723926380375
(stats/extent residual-sugar-rescaled false) ;; => #vec2 [0.0, 1.0]

Outlier Handling: remove-outliers, trim, trim-lower, trim-upper, and winsor address outliers based on quantile fences.

remove-outliers returns a sequence containing the data points from the original sequence excluding those identified as outliers.
trim removes values outside a specified quantile range (defaulting to 0.2 quantile, removing the bottom and top 20%). trim-lower and trim-upper remove only below or above a single quantile.
winsor caps values outside a quantile range to the boundary values instead of removing them. This retains the sample size but reduces the influence of extreme values.

(def residual-sugar-no-outliers (stats/remove-outliers residual-sugar))

(def residual-sugar-trimmed (stats/trim residual-sugar))

(def residual-sugar-winsorized (stats/winsor residual-sugar))

Examples

(stats/mean residual-sugar-no-outliers) ;; => 6.354109589041096
(stats/stddev residual-sugar-no-outliers) ;; => 4.950545246813552
(stats/median residual-sugar-no-outliers) ;; => 5.2
(stats/mad residual-sugar-no-outliers) ;; => 3.6
(stats/extent residual-sugar-no-outliers false) ;; => #vec2 [0.6, 22.0]
(count residual-sugar-no-outliers) ;; => 4891
(stats/mean residual-sugar-trimmed) ;; => 5.27940414507772
(stats/stddev residual-sugar-trimmed) ;; => 2.9594599192772546
(stats/median residual-sugar-trimmed) ;; => 5.0
(stats/mad residual-sugar-trimmed) ;; => 2.7
(stats/extent residual-sugar-trimmed false) ;; => #vec2 [1.5, 11.2]
(count residual-sugar-trimmed) ;; => 3088
(stats/mean residual-sugar-winsorized) ;; => 5.789893834218048
(stats/stddev residual-sugar-winsorized) ;; => 3.8241868005472104
(stats/median residual-sugar-winsorized) ;; => 5.2
(stats/mad residual-sugar-winsorized) ;; => 3.6
(stats/quantiles residual-sugar [0.2 0.8]) ;; => [1.5 11.2]
(stats/extent residual-sugar-winsorized false) ;; => #vec2 [1.5, 11.2]
(count residual-sugar-winsorized) ;; => 4898

Power Transformations: box-cox-transformation and yeo-johnson-transformation (and their infer-lambda counterparts) are non-linear transformations that can change the shape of the distribution to be more symmetric or normally distributed. They are particularly useful for data that is skewed or violates assumptions of linear models. Both are invertable.

box-cox-transformation works in general for strictly positive data. It includes the log transformation as a special case (when lambda is \(0.0\)) and generalizes square root, reciprocal, and other power transformations. box-cox-infer-lambda helps find the optimal lambda parameter. Optional parameters:
- :negative? (default: false), when set to true specific transformation is performed to keep information about sign.
- :scaled? (default: false), when set to true, scale data by geometric mean, when is a number, this number is used as a scale.

\[y_{BC}^{(\lambda)}=\begin{cases} \frac{y^\lambda-1}{\lambda} & \lambda\neq 0 \\ \log(y) & \lambda = 0 \end{cases}\]

Scaled version, with default scale set to geometric mean (GM):

\[y_{BC}^{(\lambda, s)}=\begin{cases} \frac{y^\lambda-1}{\lambda s^{\lambda - 1}} & \lambda\neq 0 \\ s\log(y) & \lambda = 0 \end{cases}\]

When :negative? is set to true, formula takes the following form:

\[y_{BCneg}^{(\lambda)}=\begin{cases} \frac{\operatorname{sgn}(y)|y|^\lambda-1}{\lambda} & \lambda\neq 0 \\ \operatorname{sgn}(y)\log(|y|+1) & \lambda = 0 \end{cases}\]

yeo-johnson-transformation extends Box-Cox to handle zero and negative values. yeo-johnson-infer-lambda finds the optimal lambda for this transformation.

\[y_{YJ}^{(\lambda)}=\begin{cases} \frac{(y+1)^\lambda - 1}{\lambda} & \lambda \neq 0, y\geq 0 \\ \log(y+1) & \lambda = 0, y\geq 0 \\ \frac{(1-y)^{2-\lambda} - 1}{\lambda - 2} & \lambda \neq 2, y\geq 0 \\ -\log(1-y) & \lambda = 2, y\geq 0 \end{cases}\]

Both fuctions accept additional parameters:

:alpha (dafault: 0.0): perform dataset shift by value of the :alpha before transformation.
:inversed? (default: false): perform inverse transformation for given lambda.

When lambda is set to nil optimal lambda will be calculated (only when :inversed? is false).

Examples

(stats/box-cox-transformation [0 1 10] 0.0) ;; => (##-Inf 0.0 2.302585092994046)
(stats/box-cox-transformation [0 1 10] 2.0) ;; => (-0.5 0.0 49.5)
(stats/box-cox-transformation [0 1 10] -2.0 {:alpha 2}) ;; => (0.375 0.4444444444444444 0.4965277777777778)
(stats/box-cox-transformation [0.375 0.444 0.497] -2.0 {:alpha 2, :inverse? true}) ;; => (0.0 0.9880715233359845 10.90994448735805)
(stats/box-cox-transformation [0 1 10] nil {:alpha 1}) ;; => (-0.0 0.5989997131047903 1.493287747539177)
(stats/box-cox-transformation [0 1 10] nil {:scaled? true, :alpha 1}) ;; => (-0.0 2.619403814271841 6.530092646346955)
(stats/box-cox-transformation [0 1 10] nil {:alpha -5, :negative? true}) ;; => (-42.0 -21.666666666666668 41.333333333333336)
(stats/box-cox-transformation [0 1 10] 2.0 {:alpha -5, :negative? true, :scaled? 2}) ;; => (-6.5 -4.25 6.0)
(stats/box-cox-transformation [-6.5 -4.25 6.0] 2.0 {:alpha -5, :negative? true, :scaled? 2, :inverse? true}) ;; => (0.0 1.0 10.0)
(stats/yeo-johnson-transformation [0 1 10]) ;; => (-0.0 0.5989997131047903 1.493287747539177)
(stats/yeo-johnson-transformation [0 1 10] 0.0) ;; => (0.0 0.6931471805599453 2.3978952727983707)
(stats/yeo-johnson-transformation [0 1 10] 2.0 {:alpha -5}) ;; => (-1.791759469228055 -1.6094379124341003 17.5)
(stats/yeo-johnson-transformation [-1.79 -1.61 17.5] 2.0 {:alpha -5, :inverse? true}) ;; => (0.010547533616885651 0.9971887721664103 10.0)

Let’s illustrate how real data look after transformation. We’ll start with finding an optimal lambda parameter for both transformations.

(stats/box-cox-infer-lambda residual-sugar)

0.12450565747077313

(stats/yeo-johnson-infer-lambda residual-sugar)

-0.004232775028107413

(def residual-sugar-box-cox (stats/box-cox-transformation residual-sugar nil))

(def residual-sugar-yeo-johnson (stats/yeo-johnson-transformation residual-sugar nil))

Examples

(stats/mean residual-sugar-box-cox) ;; => 1.6895596029306466
(stats/stddev residual-sugar-box-cox) ;; => 1.1032589652895741
(stats/median residual-sugar-box-cox) ;; => 1.83006349249072
(stats/mad residual-sugar-box-cox) ;; => 1.0527777588806666
(stats/extent residual-sugar-box-cox false) ;; => #vec2 [-0.4949201739139414, 5.494888687185149]
(stats/mean residual-sugar-yeo-johnson) ;; => 1.7445922710465025
(stats/stddev residual-sugar-yeo-johnson) ;; => 0.7180340465417893
(stats/median residual-sugar-yeo-johnson) ;; => 1.8175219821484885
(stats/mad residual-sugar-yeo-johnson) ;; => 0.6888246769156978
(stats/extent residual-sugar-yeo-johnson false) ;; => #vec2 [0.46953642189900135, 4.164560241279393]

As you can see, the Yeo-Johnson transformation with the inferred lambda has made the residual-sugar distribution appear more symmetric and perhaps closer to a normal distribution shape.

Both power transformation can work on negative data as well. When Box-Cox is used, :negative? option should be set to true.

(stats/box-cox-infer-lambda residual-sugar
                            nil {:alpha (- (stats/mean residual-sugar)) :negative? true})

1.0967709597378346

(stats/yeo-johnson-infer-lambda residual-sugar nil {:alpha (- (stats/mean residual-sugar))})

0.7290829398083033

(def residual-sugar-box-cox-demeaned
  (stats/box-cox-transformation
   residual-sugar nil {:alpha (- (stats/mean residual-sugar)) :negative? true}))

(def residual-sugar-yeo-johnson-demeaned (stats/yeo-johnson-transformation
                                        residual-sugar nil
                                        {:alpha (- (stats/mean residual-sugar))}))

Examples

(stats/mean residual-sugar-box-cox-demeaned) ;; => -0.817408751945805
(stats/stddev residual-sugar-box-cox-demeaned) ;; => 5.608979218510315
(stats/median residual-sugar-box-cox-demeaned) ;; => -2.0166286943919243
(stats/mad residual-sugar-box-cox-demeaned) ;; => 3.9790383049427582
(stats/extent residual-sugar-box-cox-demeaned false) ;; => #vec2 [-7.1704674516316125, 79.51310067615422]
(stats/mean residual-sugar-yeo-johnson-demeaned) ;; => -1.3556628530573829
(stats/stddev residual-sugar-yeo-johnson-demeaned) ;; => 4.783153826599674
(stats/median residual-sugar-yeo-johnson-demeaned) ;; => -1.3457965294079925
(stats/mad residual-sugar-yeo-johnson-demeaned) ;; => 4.669926884717297
(stats/extent residual-sugar-yeo-johnson-demeaned false) ;; => #vec2 [-8.192317681811963, 25.905106824581896]

Correlation and Covariance

Measures of the relationship between two or more variables.

Defined functions

covariance, correlation
pearson-correlation, spearman-correlation, kendall-correlation
coefficient-matrix, correlation-matrix, covariance-matrix

Covariance vs. Correlation:

covariance measures the extent to which two variables change together. A positive covariance means they tend to increase or decrease simultaneously. A negative covariance means one tends to increase when the other decreases. A covariance near zero suggests no linear relationship. The magnitude of covariance depends on the scales of the variables, making it difficult to compare covariances between different pairs of variables. The sample covariance between two sequences \(X = \{x_1, \dots, x_n\}\) and \(Y = \{y_1, \dots, y_n\}\) is calculated as:

\[ \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

where \(\bar{x}\) and \(\bar{y}\) are the sample means.

correlation standardizes the covariance, resulting in a unitless measure that ranges from -1 to +1. It indicates both the direction and strength of a relationship. A correlation of +1 indicates a perfect positive relationship, -1 a perfect negative relationship, and 0 no linear relationship. The correlation function in fastmath.stats defaults to computing the Pearson correlation coefficient.

Types of Correlation:

pearson-correlation: The most common correlation coefficient, also known as the Pearson product-moment correlation coefficient (\(r\)). It measures the strength and direction of a linear relationship between two continuous variables. It assumes the variables are approximately normally distributed and that the relationship is linear. It is sensitive to outliers. The formula for the sample Pearson correlation coefficient is:

\[r = \frac{\text{Cov}(X, Y)}{s_x s_y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}\]

where \(s_x\) and \(s_y\) are the sample standard deviations.

spearman-correlation: Spearman’s rank correlation coefficient (\(\rho\)) is a non-parametric measure of the strength and direction of a monotonic relationship between two variables. A monotonic relationship is one that is either consistently increasing or consistently decreasing, but not necessarily linear. Spearman’s correlation is calculated by applying the Pearson formula to the ranks of the data values rather than the values themselves. This makes it less sensitive to outliers than Pearson correlation and suitable for ordinal data or when the relationship is monotonic but non-linear.
kendall-correlation: Kendall’s Tau rank correlation coefficient (\(\tau\)) is another non-parametric measure of the strength and direction of a monotonic relationship. It is based on the number of concordant and discordant pairs of observations. A pair of data points is concordant if their values move in the same direction (both increase or both decrease) relative to each other, and discordant if they move in opposite directions. Kendall’s Tau is generally preferred over Spearman’s Rho for smaller sample sizes or when there are many tied ranks. One common formulation, Kendall’s Tau-a, is:

\[\tau_A = \frac{N_c - N_d}{n(n-1)/2}\]

where \(N_c\) is the number of concordant pairs and \(N_d\) is the number of discordant pairs.

Comparison of Correlation Methods:

Use Pearson for measuring linear relationships between continuous, normally distributed variables.
Use Spearman or Kendall for measuring monotonic relationships (linear or non-linear) between variables, especially when data is not normally distributed, contains outliers, or is ordinal. Kendall is often more robust with ties and smaller samples.

Matrix Functions for Multiple Variables:

coefficient-matrix: A generic function that computes a specified pairwise measure (defined by a function passed as an argument) between all pairs of sequences in a collection. Useful for generating matrices of custom similarity, distance, or correlation measures.
covariance-matrix: A specialization that computes the pairwise covariance for all sequences in a collection. The output is a symmetric matrix where the element at row i, column j is the covariance between sequence i and sequence j. The diagonal elements are the variances of the individual sequences.
correlation-matrix: A specialization that computes the pairwise correlation (Pearson by default, or specified via keyword like :spearman or :kendall) for all sequences in a collection. The output is a symmetric matrix where the element at row i, column j is the correlation between sequence i and sequence j. The diagonal elements are always 1.0 (a variable is perfectly correlated with itself).

Let’s examine the correlations between the numerical features in the iris dataset.

Examples

(stats/covariance virginica-sepal-length setosa-sepal-length) ;; => 0.03007346938775512
(stats/correlation virginica-sepal-length setosa-sepal-length) ;; => 0.13417210385493564
(stats/pearson-correlation virginica-sepal-length setosa-sepal-length) ;; => 0.1341721038549354
(stats/spearman-correlation virginica-sepal-length setosa-sepal-length) ;; => 0.038837958926489176
(stats/kendall-correlation virginica-sepal-length setosa-sepal-length) ;; => 0.030531668042830747

To generate matrices we’ll use three sepal lengths samples. The last two examples use custom measure function: Euclidean distance between samples and Glass’ delta.

Examples

(stats/covariance-matrix (vals sepal-lengths)) ;; => ([0.12424897959183674 -0.014710204081632658 0.03007346938775512] [-0.014710204081632658 0.2664326530612246 -0.04649795918367346] [0.03007346938775512 -0.04649795918367346 0.4043428571428573])
(stats/correlation-matrix (vals sepal-lengths)) ;; => ([1.0 -0.08084972701756978 0.13417210385493492] [-0.08084972701756978 0.9999999999999999 -0.14166588513698952] [0.13417210385493492 -0.14166588513698952 1.0])
(stats/correlation-matrix (vals sepal-lengths) :kendall) ;; => ([1.0 -0.06357129445203882 0.030531668042830747] [-0.06357129445203882 1.0 -0.10454171307909799] [0.030531668042830747 -0.10454171307909799 1.0])
(stats/correlation-matrix (vals sepal-lengths) :spearman) ;; => ([1.0 -0.10163684956357029 0.03883795892648921] [-0.10163684956357029 1.0 -0.14067854670792204] [0.03883795892648921 -0.14067854670792204 1.0])
(stats/coefficient-matrix (vals sepal-lengths) stats/L2 true) ;; => ([0.0 7.989367934949548 12.16922347563722] [7.989367934949548 0.0 7.660287200882224] [12.16922347563722 7.660287200882224 0.0])
(stats/coefficient-matrix (vals sepal-lengths) stats/glass-delta) ;; => ([0.0 -1.8017279836157427 -2.4878923883271122] [2.6383750609896146 0.0 -1.0253513509413892] [4.488074566113517 1.263146930448887 0.0])

Distance and Similarity Metrics

Measures of distance, error, or similarity between sequences or distributions.

Defined functions

me, mae, mape
rss, mse, rmse
r2
count=, L0, L1, L2sq, L2, LInf
psnr
dissimilarity, similarity

Distance metrics quantify how far apart or different two data sequences or probability distributions are. Similarity metrics, conversely, measure how close or alike they are, often being the inverse or a transformation of a distance. fastmath.stats provides a range of these measures suitable for comparing numerical sequences, observed counts (histograms), or theoretical probability distributions.

Error Metrics

These functions typically quantify the difference between an observed sequence and a predicted or reference sequence, focusing on the magnitude of errors. All can accept a constant as a second argument.

me (Mean Error): The average of the differences between corresponding elements. \[ ME = \frac{1}{n} \sum_{i=1}^n (x_i - y_i) \]
mae (Mean Absolute Error): The average of the absolute differences. More robust to outliers than squared error. \[ MAE = \frac{1}{n} \sum_{i=1}^n |x_i - y_i| \]
mape (Mean Absolute Percentage Error): The average of the absolute percentage errors. Useful for relative error assessment, but undefined if the reference value \(x_i\) is zero. \[ MAPE = \frac{1}{n} \sum_{i=1}^n \left| \frac{x_i - y_i}{x_i} \right| \times 100\% \]
rss (Residual Sum of Squares): The sum of the squared differences. Used in least squares regression. \[ RSS = \sum_{i=1}^n (x_i - y_i)^2 \]
mse (Mean Squared Error): The average of the squared differences. Penalizes larger errors more heavily. \[ MSE = \frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 \]
rmse (Root Mean Squared Error): The square root of the MSE. Has the same units as the original data. \[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2} \]
r2 (Coefficient of Determination): Measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Calculated as \(1 - (RSS / TSS)\), where TSS is the Total Sum of Squares. It ranges from 0 to 1 for linear regression. \[ R^2 = 1 - \frac{\sum (x_i - y_i)^2}{\sum (x_i - \bar{x})^2} \]
adjusted r2: A modified version of \(R^2\) that has been adjusted for the number of predictors in the model. It increases only if the new term improves the model more than would be expected by chance. \[ R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1} \]

Let’s use setosa-sepal-length as observed and virginica-sepal-length as predicted (though they are independent samples, not predictions) to illustrate error measures.

Examples

(stats/me setosa-sepal-length virginica-sepal-length) ;; => -1.582
(stats/mae setosa-sepal-length virginica-sepal-length) ;; => 1.582
(stats/mape setosa-sepal-length virginica-sepal-length) ;; => 0.3211321155719454
(stats/rss setosa-sepal-length virginica-sepal-length) ;; => 148.09000000000006
(stats/mse setosa-sepal-length virginica-sepal-length) ;; => 2.9618
(stats/rmse setosa-sepal-length virginica-sepal-length) ;; => 1.720988088279521
(stats/r2 setosa-sepal-length virginica-sepal-length) ;; => -23.324102361946068
(stats/r2 setosa-sepal-length virginica-sepal-length 2) ;; => -24.359170547560794
(stats/r2 setosa-sepal-length virginica-sepal-length 5) ;; => -26.0882049030763

Also we can compare an observed sequence to a constant value. For example to a mean of the virginica sepal length.

(def vsl-mean (stats/mean virginica-sepal-length))

Examples

(stats/me setosa-sepal-length vsl-mean) ;; => -1.582
(stats/mae setosa-sepal-length vsl-mean) ;; => 1.582
(stats/mape setosa-sepal-length vsl-mean) ;; => 0.32244506843926823
(stats/rss setosa-sepal-length vsl-mean) ;; => 131.2244000000001
(stats/mse setosa-sepal-length vsl-mean) ;; => 2.6244880000000004
(stats/rmse setosa-sepal-length vsl-mean) ;; => 1.6200271602661482
(stats/r2 setosa-sepal-length vsl-mean) ;; => -20.55389113366842
(stats/r2 setosa-sepal-length vsl-mean 2) ;; => -21.471077990420266
(stats/r2 setosa-sepal-length vsl-mean 5) ;; => -23.003196944312556

Distance Metrics (L-p Norms and others)

These functions represent common distance measures, often related to L-p norms between vectors (sequences).

count=, L0: Counts the number of elements that are equal in both sequences. While related to the L0 “norm” (which counts non-zero elements), this implementation counts equal elements after subtraction. \[ Count= = \sum_{i=1}^n \mathbb{I}(x_i = y_i) \]
L1 (Manhattan/City Block Distance): The sum of the absolute differences. \[ L_1 = \sum_{i=1}^n |x_i - y_i| \]
L2sq (Squared Euclidean Distance): The sum of the squared differences. Equivalent to rss. \[ L_2^2 = \sum_{i=1}^n (x_i - y_i)^2 \]
L2 (Euclidean Distance): The square root of the sum of the squared differences. The most common distance metric. \[ L_2 = \sqrt{\sum_{i=1}^n (x_i - y_i)^2} \]
LInf (Chebyshev Distance): The maximum absolute difference between corresponding elements. \[ L_\infty = \max_{i} |x_i - y_i| \]
psnr (Peak Signal-to-Noise Ratio): A measure of signal quality often used in image processing, derived from the MSE. Higher PSNR indicates better quality (less distortion). Calculated based on the maximum possible value of the data and the MSE. \[ PSNR = 10 \cdot \log_{10} \left( \frac{MAX^2}{MSE} \right) \]

Using the sepal length samples again:

Examples

(stats/count= setosa-sepal-length virginica-sepal-length) ;; => 1
(stats/L0 setosa-sepal-length virginica-sepal-length) ;; => 1
(stats/L1 setosa-sepal-length virginica-sepal-length) ;; => 79.10000000000005
(stats/L2sq setosa-sepal-length virginica-sepal-length) ;; => 148.09000000000006
(stats/L2 setosa-sepal-length virginica-sepal-length) ;; => 12.16922347563722
(stats/LInf setosa-sepal-length virginica-sepal-length) ;; => 3.1000000000000005
(stats/psnr setosa-sepal-length virginica-sepal-length) ;; => 13.236984537937193

Dissimilarity and Similarity

dissimilarity and similarity functions provide measures for comparing probability distributions or frequency counts (like histograms). They quantify how ‘far apart’ or ‘alike’ two data sequences, interpreted as distributions, are. They take a method keyword specifying the desired measure. Many methods exist, each with different properties and interpretations. They can accept raw data sequences, automatically creating histograms for comparison (controlled by :bins), or they can take pre-calculated frequency sequences or a data sequence and a fastmath.random distribution object.

Parameters:

method - The specific distance or similarity method to use.
P-observed - Frequencies, probabilities, or raw data (when Q-expected is a distribution or :bins is set).
Q-expected - Frequencies, probabilities, or a distribution object (when P-observed is raw data or :bins is set).
opts (map, optional) - Configuration options, including:
- :probabilities? (boolean, default: true): If true, input sequences are normalized to sum to 1.0 before calculating the measure, treating them as probability distributions.
- :epsilon (double, default: 1.0e-6): A small number used to replace zero values in denominators or logarithms to avoid division-by-zero or log-of-zero errors.
- :log-base (double, default: m/E): The base for logarithms in information-theoretic measures.
- :power (double, default: 2.0): The exponent for the :minkowski distance.
- :remove-zeros? (boolean, default: false): Removes pairs where both P and Q are zero before calculation.
- :bins (number, keyword, or seq): Used for comparisons involving raw data or distributions. Specifies the number of histogram bins, an estimation method (see [histogram]), or explicit bin edges for histogram creation.

Dissimilarity Methods

Higher values generally indicate greater difference.

L-p Norms and Related:
- :euclidean: Euclidean distance (\(L_2\) norm) between the frequency/probability vectors. \[ D(P, Q) = \sqrt{\sum_i (P_i - Q_i)^2} \]
- :city-block / :manhattan: Manhattan distance (\(L_1\) norm). \[ D(P, Q) = \sum_i |P_i - Q_i| \]
- :chebyshev: Chebyshev distance (\(L_\infty\) norm), the maximum absolute difference. \[ D(P, Q) = \max_i |P_i - Q_i| \]
- :minkowski: Minkowski distance (generalized \(L_p\) norm, controlled by :power). \[ D(P, Q) = \left(\sum_i |P_i - Q_i|^p\right)^{1/p} \]
- :euclidean-sq / :squared-euclidean: Squared Euclidean distance. \[ D(P, Q) = \sum_i (P_i - Q_i)^2 \]
- :squared-chord: Squared chord distance, related to Hellinger distance. \[ D(P, Q) = \sum_i (\sqrt{P_i} - \sqrt{Q_i})^2 \]
Set-based/Overlap: Measures derived from the concept of set overlap applied to frequencies/probabilities.
- :sorensen: Sorensen-Dice dissimilarity (1 - Dice similarity). \[ D(P, Q) = \frac{\sum_i |P_i - Q_i|}{\sum_i (P_i + Q_i)} \]
- :gower: Gower distance (average Manhattan distance). \[ D(P, Q) = \frac{1}{N} \sum_i |P_i - Q_i| \]
- :soergel: Soergel distance (1 - Jaccard similarity). \[ D(P, Q) = \frac{\sum_i |P_i - Q_i|}{\sum_i \max(P_i, Q_i)} \]
- :kulczynski: Kulczynski dissimilarity (1 - Kulczynski similarity, can be > 1). \[ D(P, Q) = \frac{\sum_i |P_i - Q_i|}{\sum_i \min(P_i, Q_i)} \]
- :canberra: Canberra distance, sensitive to small values. \[ D(P, Q) = \sum_i \frac{|P_i - Q_i|}{P_i + Q_i} \]
- :lorentzian: Lorentzian distance. \[ D(P, Q) = \sum_i \ln(1 + |P_i - Q_i|) \]
- :non-intersection: Non-intersection measure. \[ D(P, Q) = \frac{1}{2} \sum_i |P_i - Q_i| \]
- :wave-hedges: Wave Hedges distance. \[ D(P, Q) = \sum_i \frac{|P_i - Q_i|}{\max(P_i, Q_i)} \]
- :czekanowski: Czekanowski dissimilarity (same as Sorensen).
- :motyka: Motyka dissimilarity. \[ D(P, Q) = 1 - \frac{\sum_i \min(P_i, Q_i)}{\sum_i (P_i + Q_i)} \]
- :tanimoto: Tanimoto dissimilarity (extended Jaccard or Dice). \[ D(P, Q) = \frac{\sum_i (\max(P_i, Q_i) - \min(P_i, Q_i))}{\sum_i \max(P_i, Q_i)} \]
- :jaccard: Jaccard dissimilarity (1 - Jaccard similarity). \[ D(P, Q) = \frac{\sum_i (P_i - Q_i)^2}{\sum_i P_i^2 + \sum_i Q_i^2 - \sum_i P_i Q_i} \]
- :dice: Dice dissimilarity (1 - Dice similarity). \[ D(P, Q) = \frac{\sum_i (P_i - Q_i)^2}{\sum_i P_i^2 + \sum_i Q_i^2} \]
- :bhattacharyya: Bhattacharyya distance. \[ D(P, Q) = -\ln \left( \sum_i \sqrt{P_i Q_i} \right) \]
- :hellinger: Hellinger distance, derived from Bhattacharyya coefficient. \[ D(P, Q) = \sqrt{2 \sum_i (\sqrt{P_i} - \sqrt{Q_i})^2} \]
- :matusita: Matusita distance. \[ D(P, Q) = \sqrt{\sum_i (\sqrt{P_i} - \sqrt{Q_i})^2} \]
Chi-squared based:
- :pearson-chisq / :chisq: Pearson’s Chi-squared statistic. \[ D(P, Q) = \sum_i \frac{(P_i - Q_i)^2}{Q_i} \]
- :neyman-chisq: Neyman’s Chi-squared statistic. \[ D(P, Q) = \sum_i \frac{(P_i - Q_i)^2}{P_i} \]
- :squared-chisq: Squared Chi-squared distance. \[ D(P, Q) = \sum_i \frac{(P_i - Q_i)^2}{P_i + Q_i} \]
- :symmetric-chisq: Symmetric Chi-squared distance. \[ D(P, Q) = 2 \sum_i \frac{(P_i - Q_i)^2}{P_i + Q_i} \]
- :divergence: Divergence statistic. \[ D(P, Q) = 2 \sum_i \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2} \]
- :clark: Clark distance. \[ D(P, Q) = \sqrt{\sum_i \left(\frac{P_i - Q_i}{P_i + Q_i}\right)^2} \]
- :additive-symmetric-chisq: Additive Symmetric Chi-squared distance. \[ D(P, Q) = \sum_i \frac{(P_i - Q_i)^2 (P_i + Q_i)}{P_i Q_i} \]
Information Theory based (Divergences): Measure the difference in information content.
- :kullback-leibler: Kullback-Leibler divergence (not symmetric, \(KL(P||Q)\)). \[ D(P, Q) = \sum_i P_i \ln\left(\frac{P_i}{Q_i}\right) \]
- :jeffreys: Jeffreys divergence (symmetric KL). \[ D(P, Q) = \sum_i (P_i - Q_i) \ln\left(\frac{P_i}{Q_i}\right) \]
- :k-divergence: K divergence (related to KL). \[ D(P, Q) = \sum_i P_i \ln\left(\frac{2 P_i}{P_i + Q_i}\right) \]
- :topsoe: Topsoe divergence. \[ D(P, Q) = \sum_i \left( P_i \ln\left(\frac{2 P_i}{P_i + Q_i}\right) + Q_i \ln\left(\frac{2 Q_i}{P_i + Q_i}\right) \right) \]
- :jensen-shannon: Jensen-Shannon divergence (symmetric, finite, based on KL). \[ D(P, Q) = \frac{1}{2} \left( KL(P || M) + KL(Q || M) \right), \text{ where } M = \frac{P+Q}{2} \]
- :jensen-difference: Jensen difference divergence. \[ D(P, Q) = \sum_i \left( \frac{P_i \ln P_i + Q_i \ln Q_i}{2} - \frac{(P_i+Q_i)}{2} \ln\left(\frac{P_i+Q_i}{2}\right) \right) \]
- :taneja: Taneja divergence. \[ D(P, Q) = \sum_i \frac{P_i + Q_i}{2} \ln\left(\frac{(P_i+Q_i)/2}{\sqrt{P_i Q_i}}\right) \]
- :kumar-johnson: Kumar-Johnson divergence. \[ D(P, Q) = \sum_i \frac{(P_i^2 - Q_i^2)^2}{2 (P_i Q_i)^{3/2}} \]
Other:
- :avg: Average of Manhattan and Chebyshev distances. \[ D(P, Q) = \frac{1}{2} \left( \sum_i |P_i - Q_i| + \max_i |P_i - Q_i| \right) \]

Let’s use the sepal length samples from the iris dataset.

Examples

(stats/dissimilarity :euclidean setosa-sepal-length virginica-sepal-length) ;; => 0.015621495942713016
(stats/dissimilarity :manhattan setosa-sepal-length virginica-sepal-length) ;; => 0.09134539463390746
(stats/dissimilarity :chebyshev setosa-sepal-length virginica-sepal-length) ;; => 0.005564421661826087
(stats/dissimilarity :minkowski setosa-sepal-length virginica-sepal-length) ;; => 0.015621495942713016
(stats/dissimilarity :minkowski setosa-sepal-length virginica-sepal-length {:power 0.5}) ;; => 3.931956786218106
(stats/dissimilarity :euclidean-sq setosa-sepal-length virginica-sepal-length) ;; => 2.4403113548819921E-4
(stats/dissimilarity :squared-chord setosa-sepal-length virginica-sepal-length) ;; => 0.0030351358423127664
(stats/dissimilarity :sorensen setosa-sepal-length virginica-sepal-length) ;; => 0.04567269731695373
(stats/dissimilarity :gower setosa-sepal-length virginica-sepal-length) ;; => 0.0018269078926781493
(stats/dissimilarity :kulczynski setosa-sepal-length virginica-sepal-length) ;; => 0.09571705050991852
(stats/dissimilarity :canberra setosa-sepal-length virginica-sepal-length) ;; => 2.2736449691924085
(stats/dissimilarity :lorentzian setosa-sepal-length virginica-sepal-length) ;; => 0.09122364351967303
(stats/dissimilarity :non-intersection setosa-sepal-length virginica-sepal-length) ;; => 0.04567269731695373
(stats/dissimilarity :wave-hedges setosa-sepal-length virginica-sepal-length) ;; => 4.267542427615257
(stats/dissimilarity :czekanowski setosa-sepal-length virginica-sepal-length) ;; => 0.04567269731695373
(stats/dissimilarity :motyka setosa-sepal-length virginica-sepal-length) ;; => 0.5228363486584767
(stats/dissimilarity :tanimoto setosa-sepal-length virginica-sepal-length) ;; => 0.08735562750016011
(stats/dissimilarity :jaccard setosa-sepal-length virginica-sepal-length) ;; => 0.012043840252620624
(stats/dissimilarity :dice setosa-sepal-length virginica-sepal-length) ;; => 0.0060584033473610865
(stats/dissimilarity :bhattacharyya setosa-sepal-length virginica-sepal-length) ;; => 0.001518720593674244
(stats/dissimilarity :hellinger setosa-sepal-length virginica-sepal-length) ;; => 0.0779119482789752
(stats/dissimilarity :matusita setosa-sepal-length virginica-sepal-length) ;; => 0.05509206696351892
(stats/dissimilarity :pearson-chisq setosa-sepal-length virginica-sepal-length) ;; => 0.01230611574919367
(stats/dissimilarity :neyman-chisq setosa-sepal-length virginica-sepal-length) ;; => 0.012113936206954518
(stats/dissimilarity :squared-chisq setosa-sepal-length virginica-sepal-length) ;; => 0.006058781702682577
(stats/dissimilarity :symmetric-chisq setosa-sepal-length virginica-sepal-length) ;; => 0.012117563405365154
(stats/dissimilarity :divergence setosa-sepal-length virginica-sepal-length) ;; => 0.30215920027968135
(stats/dissimilarity :clark setosa-sepal-length virginica-sepal-length) ;; => 0.38868959355743066
(stats/dissimilarity :additive-symmetric-chisq setosa-sepal-length virginica-sepal-length) ;; => 0.024420051956148194
(stats/dissimilarity :kullback-leibler setosa-sepal-length virginica-sepal-length) ;; => 0.006089967649043811
(stats/dissimilarity :jeffreys setosa-sepal-length virginica-sepal-length) ;; => 0.012148239424007154
(stats/dissimilarity :k-divergence setosa-sepal-length virginica-sepal-length) ;; => 0.0015126771903720877
(stats/dissimilarity :topsoe setosa-sepal-length virginica-sepal-length) ;; => 0.0030332163488238145
(stats/dissimilarity :jensen-shannon setosa-sepal-length virginica-sepal-length) ;; => 0.0015166081744119081
(stats/dissimilarity :jensen-difference setosa-sepal-length virginica-sepal-length) ;; => 0.001516608174411918
(stats/dissimilarity :taneja setosa-sepal-length virginica-sepal-length) ;; => 0.00152045168158987
(stats/dissimilarity :kumar-johnson setosa-sepal-length virginica-sepal-length) ;; => 0.024513334129211098
(stats/dissimilarity :avg setosa-sepal-length virginica-sepal-length) ;; => 0.04845490814786678

We can compare our data to a distribution. The method used here is based on building histogram for P and quantize distribution for Q.

Examples

(stats/dissimilarity :chisq (stats/standardize setosa-sepal-length) r/default-normal) ;; => 0.11684559542476217
(stats/dissimilarity :chisq setosa-sepal-length (r/distribution :normal {:mu (stats/mean setosa-sepal-length), :sd (stats/stddev setosa-sepal-length)})) ;; => 0.11684559542476244
(stats/dissimilarity :chisq (repeatedly 1000 r/grand) r/default-normal) ;; => 0.021612937008293368

In case when counts of samples are not equal we can use histograms. Also we can bin our data before comparison.

Examples

(stats/dissimilarity :gower (repeatedly 1000 r/grand) (repeatedly 800 r/grand) {:bins :auto}) ;; => 0.007386363636363636
(stats/dissimilarity :gower setosa-sepal-length virginica-sepal-length {:bins :auto}) ;; => 0.22499999999999998
(stats/dissimilarity :gower setosa-sepal-length virginica-sepal-length {:bins 10}) ;; => 0.18400000000000002

Similarity Methods

Higher values generally indicate greater similarity.

Overlap/Set-based:
- :intersection: Intersection measure (sum of element-wise minimums). \[ S(P, Q) = \sum_i \min(P_i, Q_i) \]
- :czekanowski: Czekanowski similarity (same as Sorensen-Dice). \[ S(P, Q) = \frac{2 \sum_i \min(P_i, Q_i)}{\sum_i (P_i + Q_i)} \]
- :motyka: Motyka similarity. \[ S(P, Q) = \frac{\sum_i \min(P_i, Q_i)}{\sum_i (P_i + Q_i)} \]
- :kulczynski: Kulczynski similarity (can be > 1). \[ S(P, Q) = \frac{\sum_i \min(P_i, Q_i)}{\sum_i |P_i - Q_i|} \]
- :ruzicka: Ruzicka similarity. \[ S(P, Q) = \frac{\sum_i \min(P_i, Q_i)}{\sum_i \max(P_i, Q_i)} \]
- :fidelity: Probability fidelity (Bhattacharyya coefficient). \[ S(P, Q) = \sum_i \sqrt{P_i Q_i} \]
- :squared-chord: Squared chord similarity (1 - Squared Chord dissimilarity). \[ S(P, Q) = 2 \sum_i \sqrt{P_i Q_i} - 1 \]
Inner Product / Angle:
- :inner-product: Inner product of the vectors. \[ S(P, Q) = \sum_i P_i Q_i \]
- :cosine: Cosine similarity. \[ S(P, Q) = \frac{\sum_i P_i Q_i}{\sqrt{\sum_i P_i^2} \sqrt{\sum_i Q_i^2}} \]
Set-based (adapted):
- :jaccard: Jaccard similarity (generalized to distributions). \[ S(P, Q) = \frac{\sum_i P_i Q_i}{\sum_i P_i^2 + \sum_i Q_i^2 - \sum_i P_i Q_i} \]
- :dice: Dice similarity (generalized to distributions). \[ S(P, Q) = \frac{2 \sum_i P_i Q_i}{\sum_i P_i^2 + \sum_i Q_i^2} \]
Harmonic Mean:
- :harmonic-mean: Harmonic mean similarity. \[ S(P, Q) = 2 \sum_i \frac{P_i Q_i}{P_i + Q_i} \]

Examples

(stats/similarity :intersection setosa-sepal-length virginica-sepal-length) ;; => 0.9543273026830466
(stats/similarity :czekanowski setosa-sepal-length virginica-sepal-length) ;; => 0.9543273026830466
(stats/similarity :motyka setosa-sepal-length virginica-sepal-length) ;; => 0.4771636513415233
(stats/similarity :kulczynski setosa-sepal-length virginica-sepal-length) ;; => 10.447459409505903
(stats/similarity :ruzicka setosa-sepal-length virginica-sepal-length) ;; => 0.9126443724998404
(stats/similarity :fidelity setosa-sepal-length virginica-sepal-length) ;; => 0.9984824320788436
(stats/similarity :squared-chord setosa-sepal-length virginica-sepal-length) ;; => 0.9969648641576871
(stats/similarity :inner-product setosa-sepal-length virginica-sepal-length) ;; => 0.020017872905882722
(stats/similarity :cosine setosa-sepal-length virginica-sepal-length) ;; => 0.9939438317187768
(stats/similarity :jaccard setosa-sepal-length virginica-sepal-length) ;; => 0.9879561597473809
(stats/similarity :dice setosa-sepal-length virginica-sepal-length) ;; => 0.9939415966526397
(stats/similarity :harmonic-mean setosa-sepal-length virginica-sepal-length) ;; => 0.9969706091486588

As for dissimilarity, we can compare our data to a distribution. The method used here is based on building histogram for P and quantize distribution for Q.

Examples

(stats/similarity :dice (stats/standardize setosa-sepal-length) r/default-normal) ;; => 0.941700344004566
(stats/similarity :dice setosa-sepal-length (r/distribution :normal {:mu (stats/mean setosa-sepal-length), :sd (stats/stddev setosa-sepal-length)})) ;; => 0.9417003440045658
(stats/similarity :dice (repeatedly 10000 r/grand) r/default-normal) ;; => 0.9984869550039952

In case when counts of samples are not equal we can use histograms. Also we can bin our data before comparison.

Examples

(stats/similarity :ruzicka (repeatedly 1000 r/grand) (repeatedly 800 r/grand) {:bins :auto}) ;; => 0.8535681186283597
(stats/similarity :ruzicka setosa-sepal-length virginica-sepal-length {:bins :auto}) ;; => 0.052631578947368425
(stats/similarity :ruzicka setosa-sepal-length virginica-sepal-length {:bins 10}) ;; => 0.04166666666666666

Contingency Tables

Functions for creating and analyzing contingency tables.

Defined functions

contingency-table, rows->contingency-table, contingency-table->marginals
contingency-2x2-measures-all, contingency-2x2-measures
mcc
cramers-c, cramers-v, cramers-v-corrected
cohens-w, tschuprows-t
cohens-kappa, weighted-kappa

Contingency tables, also known as cross-tabulations or cross-tabs, are a fundamental tool in statistics for displaying the frequency distribution of two or more categorical variables. They help visualize and analyze the relationship or association between these variables.

For two variables, a contingency table typically looks like this:

	Category 1 (Col)	Category 2 (Col)	…	Column Totals (Marginals)
Cat A (Row)	\(n_{11}\)	\(n_{12}\)	…	\(R_1 = \sum_j n_{1j}\)
Cat B (Row)	\(n_{21}\)	\(n_{22}\)	…	\(R_2 = \sum_j n_{2j}\)
…	…	…	…	…
Row Totals (Marginals)	\(C_1 = \sum_i n_{i1}\)	\(C_2 = \sum_i n_{i2}\)	…	\(N = \sum_i \sum_j n_{ij}\)

Where \(n_{ij}\) is the count of observations falling into row category \(i\) and column category \(j\). \(R_i\) are row marginal totals, \(C_j\) are column marginal totals, and \(N\) is the grand total number of observations.

Let’s use data from wikipedia article as the 2x2 example.

	Right-handed	Left-handed	Total
Male	43	9	52
Female	44	4	48
Total	87	13	100

(def ct-data-1 [[43 9] [44 4]])

Another example will be from openstax book

	Lake Path	Hilly Path	Wooded Path	Total
Younger	45	38	27	110
Older	26	52	12	90
Total	71	90	39	200

(def ct-data-2 [[45 38 27] [26 52 12]])

The last example will consist two sequences for gender and exam outcome

(def gender [:male :female :male :female :male :male :female :female :female :male :male])

(def outcome [:pass :fail :pass :pass :fail :pass :fail :pass :pass :pass :pass])

fastmath.stats provides functions for creating and analyzing these tables:

contingency-table: Creates a frequency map from one or more sequences. If given two sequences of equal length, say vars1 and vars2, it produces a map where keys are pairs [value-from-vars1, value-from-vars2] and values are the counts of these co-occurrences.
rows->contingency-table: Takes a sequence of sequences, interpreted as rows of counts in a grid, and converts it into a map format where keys are [row-index, column-index] and values are the non-zero counts. This is useful for inputting tables structured as lists of lists.
contingency-table->marginals: Calculates the row totals (:rows), column totals (:cols), grand total (:n), and diagonal elements (:diag) from a contingency table (either in map format or sequence of sequences format).

Let’s create a simple contingency table from above data.

(def ct-gender-outcome (stats/contingency-table gender outcome))

(def contingency-table-1 (stats/rows->contingency-table ct-data-1))

(def contingency-table-2 (stats/rows->contingency-table ct-data-2))

Examples

ct-gender-outcome ;; => {[:male :pass] 5, [:female :fail] 2, [:female :pass] 3, [:male :fail] 1}
contingency-table-1 ;; => {[0 0] 43, [0 1] 9, [1 0] 44, [1 1] 4}
contingency-table-2 ;; => {[0 0] 45, [0 1] 38, [0 2] 27, [1 0] 26, [1 1] 52, [1 2] 12}
(stats/contingency-table->marginals ct-gender-outcome) ;; => {:rows ([:female 5.0] [:male 6.0]), :cols ([:fail 3.0] [:pass 8.0]), :n 11.0, :diag ()}
(stats/contingency-table->marginals contingency-table-1) ;; => {:rows ([0 52.0] [1 48.0]), :cols ([0 87.0] [1 13.0]), :n 100.0, :diag ([[0 0] 43] [[1 1] 4])}
(stats/contingency-table->marginals contingency-table-2) ;; => {:rows ([0 110.0] [1 90.0]), :cols ([0 71.0] [1 90.0] [2 39.0]), :n 200.0, :diag ([[0 0] 45] [[1 1] 52])}

Measures of Association: These statistics quantify the strength and nature of the relationship between the variables in the table. They are often derived from the Pearson’s Chi-squared statistic (\(\chi^2\), obtainable via [chisq-test]), which tests for independence.

cramers-c: Cramer’s C is a measure of association for any \(R \times K\) contingency table. It ranges from 0 to 1, where 0 indicates no association and 1 indicates perfect association. \[ C = \sqrt{\frac{\chi^2}{N + \chi^2}} \]
cramers-v: Cramer’s V is another measure of association, also ranging from 0 to 1. It is widely used and is often preferred over Tschuprow’s T because it can attain the value 1 even for non-square tables. \[ V = \sqrt{\frac{\chi^2/N}{\min(R-1, K-1)}} \]
cramers-v-corrected: Corrected Cramer’s V (\(V^*\)) applies a bias correction, which is particularly important for small sample sizes or tables with many cells having low expected counts.
cohens-w: Cohen’s W is a measure of effect size for Chi-squared tests. It quantifies the magnitude of the difference between observed and expected frequencies. It ranges from 0 upwards, with 0 indicating no difference (independence). \[ W = \sqrt{\frac{\chi^2}{N}} \]
tschuprows-t: Tschuprow’s T is a measure of association ranging from 0 to 1. However, it can only reach 1 in square tables (\(R=K\)). \[ T = \sqrt{\frac{\chi^2/N}{\sqrt{(R-1)(K-1)}}} \]

fastmath.stats allows you to calculate these measures by providing either the raw sequences or a pre-calculated contingency table:

Examples

(stats/cramers-c ct-gender-outcome) ;; => 0.25242645609111675
(stats/cramers-v ct-gender-outcome) ;; => 0.26087459737497565
(stats/cramers-v-corrected ct-gender-outcome) ;; => 0.0
(stats/cohens-w ct-gender-outcome) ;; => 0.26087459737497565
(stats/tschuprows-t ct-gender-outcome) ;; => 0.26087459737497565
(stats/cramers-c contingency-table-1) ;; => 0.13215047155447307
(stats/cramers-v contingency-table-1) ;; => 0.13331972997326805
(stats/cramers-v-corrected contingency-table-1) ;; => 0.08804224922800516
(stats/cohens-w contingency-table-1) ;; => 0.13331972997326805
(stats/tschuprows-t contingency-table-1) ;; => 0.13331972997326805
(stats/cramers-c contingency-table-2) ;; => 0.22972682313063483
(stats/cramers-v contingency-table-2) ;; => 0.23603966869637247
(stats/cramers-v-corrected contingency-table-2) ;; => 0.21423142299458464
(stats/cohens-w contingency-table-2) ;; => 0.23603966869637247
(stats/tschuprows-t contingency-table-2) ;; => 0.19848491126445403

Measures of Agreement: These statistics are typically used for square tables (\(R=K\)) to assess the consistency of agreement between two independent raters or methods that categorize items.

cohens-kappa: Cohen’s Kappa (\(\kappa\)) measures the agreement between two raters for nominal (or ordinal) categories, correcting for the agreement that would be expected by chance. It ranges from -1 (perfect disagreement) to +1 (perfect agreement), with 0 indicating agreement equivalent to chance. \[ \kappa = \frac{p_0 - p_e}{1 - p_e} \] where \(p_0\) is the observed proportional agreement (sum of diagonal cells divided by N) and \(p_e\) is the proportional agreement expected by chance (based on marginal probabilities).
weighted-kappa: Weighted Kappa (\(\kappa_w\)) is an extension of Cohen’s Kappa that allows for different levels of disagreement penalties, suitable for ordinal categories. Disagreements between categories that are closer together (e.g., “Good” vs “Very Good”) are penalized less than disagreements between categories that are further apart (e.g., “Poor” vs “Excellent”). It requires specifying a weighting scheme (e.g., :equal-spacing, :fleiss-cohen).

Examples for agreement measures (using the [rows->contingency-table] format for clarity on cell positions):

Examples

(stats/cohens-kappa ct-gender-outcome) ;; => -1.0862068965517244
(stats/cohens-kappa contingency-table-1) ;; => -0.09233305853256403
(stats/weighted-kappa contingency-table-1 :equal-spacing) ;; => -0.09233305853256403
(stats/weighted-kappa contingency-table-1 :fleiss-cohen) ;; => -0.09233305853256403
(stats/weighted-kappa contingency-table-1 {[0 0] 0.1, [1 0] 0.2, [0 1] 0.4, [1 1] 0.3}) ;; => 0.005427145418423201
(stats/weighted-kappa contingency-table-1 (fn [dim row-id col-id] (/ (max row-id col-id) dim))) ;; => 0.049513704686118425

2x2 Specific Measures: These functions are tailored for 2x2 tables.

mcc: Calculates the [mcc], which is equivalent to the Phi coefficient for a 2x2 table. It is a single, balanced measure of classification performance, ranging from -1 to +1.
contingency-2x2-measures: A convenience function that returns a map containing a selection of commonly used statistics specifically for 2x2 tables (Chi-squared, Kappa, Phi, Yule’s Q, Odds Ratio, etc.).
contingency-2x2-measures-all: Provides a very comprehensive map of measures for a 2x2 table, including various Chi-squared statistics and p-values, measures of association, agreement, and risk/effect size measures like Odds Ratio (OR), Relative Risk (RR), and Number Needed to Treat (NNT). It’s the most detailed summary for a 2x2 table.

Examples

(stats/mcc contingency-table-1) ;; => -0.133319729973268

The contingency-2x2-measures-all function accepts the counts a, b, c, d directly, a sequence [a b c d], a matrix [[a b] [c d]], or a map {:a a :b b :c c :d d}.

The map returned by contingency-2x2-measures-all is extensive. Here’s a description of its main keys and their contents:

:n: The grand total number of observations in the table (\(N\)).
:table: A map representation of the input counts, typically {:a a, :b b, :c c, :d d} corresponding to the top-left, top-right, bottom-left, bottom-right cells.
:expected: A map of the expected counts {:a exp_a, :b exp_b, :c exp_c, :d exp_d} for each cell under the assumption of independence.
:marginals: A map containing the row totals (:row1, :row2), column totals (:col1, :col2), and the grand total (:total).
:proportions: A nested map providing proportions:
- :table: Cell counts divided by the grand total (N).
- :rows: Cell counts divided by their respective row totals (conditional probabilities of columns given rows).
- :cols: Cell counts divided by their respective column totals (conditional probabilities of rows given columns).
- :marginals: Row and column totals divided by the grand total (N).
:p-values: A map containing p-values for various Chi-squared tests:
- :chi2: Pearson’s Chi-squared test p-value.
- :yates: Yates’ continuity corrected Chi-squared test p-value.
- :cochran-mantel-haenszel: Cochran-Mantel-Haenszel statistic p-value (useful for stratified data, but calculated for the single table here).
:OR: The Odds Ratio (\(OR\)) for the 2x2 table, typically \((a \times d) / (b \times c)\). Quantifies the strength of association, representing the odds of an outcome occurring in one group compared to the odds in another group.
:lOR: The natural logarithm of the Odds Ratio (\(\ln(OR)\)). Useful for constructing confidence intervals for the OR.
:RR: The Relative Risk (\(RR\)) for the 2x2 table, typically \((a/(a+b)) / (c/(c+d))\). Compares the probability of an outcome in one group to the probability in another group. Requires row marginals to represent the exposed/unexposed or intervention/control groups.
:risk: A map containing various risk-related measures, especially relevant in epidemiology or clinical trials, derived from the counts assuming a specific row structure (e.g., exposure/outcome):
- :RR: Relative Risk (same as top-level :RR).
- :RRR: Relative Risk Reduction (\(1 - RR\)).
- :RD: Risk Difference (\(a/(a+b) - c/(c+d)\)).
- :ES: Exposure Sample size (row 1 total).
- :CS: Control Sample size (row 2 total).
- :EER: Experimental Event Rate (\(a/(a+b)\)).
- :CER: Control Event Rate (\(c/(c+d)\)).
- :ARR: Absolute Risk Reduction (\(CER - EER\)).
- :NNT: Number Needed to Treat (\(1 / ARR\)). The average number of patients who need to be treated to prevent one additional bad outcome.
- :ARI: Absolute Risk Increase (\(-ARR\)).
- :NNH: Number Needed to Harm (\(-NNT\)).
- :RRI: Relative Risk Increase (\(RR - 1\)).
- :AFe: Attributable Fraction among the exposed (\((EER - CER)/EER\)).
- :PFu: Prevented Fraction among the unexposed (\(1 - RR\)).
:SE: Standard Error related values used in some calculations.
:measures: A map containing a wide array of statistical measures:
- :chi2: Pearson’s Chi-squared statistic (same as :p-values :chi2 but the statistic value).
- :yates: Yates’ continuity corrected Chi-squared statistic.
- :cochran-mantel-haenszel: Cochran-Mantel-Haenszel statistic.
- :cohens-kappa: Cohen’s Kappa coefficient.
- :yules-q: Yule’s Q, a measure of association based on the Odds Ratio. Ranges -1 to +1.
- :holley-guilfords-g: Holley-Guilford’s G.
- :huberts-gamma: Hubert’s Gamma.
- :yules-y: Yule’s Y, based on square roots of cell proportions. Ranges -1 to +1.
- :cramers-v: Cramer’s V measure of association.
- :phi: Phi coefficient (equivalent to MCC for 2x2 tables), calculated as \(\frac{ad-bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}\). Ranges -1 to +1.
- :scotts-pi: Scott’s Pi, another measure of inter-rater reliability, similar to Kappa.
- :cohens-h: Cohen’s H, a measure of effect size for comparing two proportions.
- :PCC: Pearson’s Contingency Coefficient.
- :PCC-adjusted: Adjusted Pearson’s Contingency Coefficient.
- :TCC: Tschuprow’s Contingency Coefficient.
- :F1: F1 Score, common in binary classification, the harmonic mean of precision and recall. Calculated assuming a=TP, b=FP, c=FN, d=TN.
- :bangdiwalas-b: Bangdiwala’s B, a measure of agreement/reproducibility for clustered data, but calculated for the 2x2 table.
- :mcnemars-chi2: McNemar’s Chi-squared test statistic, specifically for paired nominal data (e.g., before/after) to test for changes in proportions. Calculated from off-diagonal elements (\(b\) and \(c\)).
- :gwets-ac1: Gwet’s AC1, another measure of inter-rater reliability claimed to be more robust to prevalence issues than Kappa.

(stats/contingency-2x2-measures-all (flatten ct-data-1))

{:OR 0.43434343434343436,
 :RR 0.9020979020979021,
 :SE 0.6380393387494788,
 :expected {:a 45.24, :b 6.76, :c 41.76, :d 6.24},
 :lOR -0.8339197344410274,
 :marginals {:col1 87, :col2 13, :row1 52, :row2 48, :total 100},
 :measures {:F1 0.6187050359712231,
            :PCC 0.13215047155447307,
            :PCC-adjusted 0.18688898914633573,
            :TCC -0.3172384169581813,
            :bangdiwalas-b 0.3622766122766123,
            :chi2 1.7774150400145103,
            :cochran-mantel-haenszel 1.7596408896143674,
            :cohens-h -0.27245415531951833,
            :cohens-kappa -0.09233305853256389,
            :cramers-v 0.133319729973268,
            :gwets-ac1 0.07994097734571652,
            :holley-guilfords-g -0.06,
            :huberts-gamma 0.0036,
            :mcnemars-chi2 23.11320754716981,
            :phi -0.133319729973268,
            :scotts-pi -0.2501474230451704,
            :yates 1.0724852071005921,
            :youdens-j -0.08974358974358974,
            :yules-q -0.3943661971830986,
            :yules-y -0.20551108882876298},
 :n 100,
 :p-values {:chi2 0.1824670652605519,
            :cochran-mantel-haenszel 0.18466931414678545,
            :yates 0.3003847703905692},
 :proportions {:cols {:a 0.4942528735632184,
                      :b 0.6923076923076923,
                      :c 0.5057471264367817,
                      :d 0.3076923076923077},
               :marginals {:col1 0.87, :col2 0.13, :row1 0.52, :row2 0.48},
               :rows {:a 0.8269230769230769,
                      :b 0.17307692307692307,
                      :c 0.9166666666666666,
                      :d 0.08333333333333333},
               :table {:a 0.43, :b 0.09, :c 0.44, :d 0.04}},
 :risk {:AFe -0.10852713178294575,
        :ARI -0.08974358974358976,
        :ARR 0.08974358974358976,
        :CER 0.9166666666666666,
        :CS 48,
        :EER 0.8269230769230769,
        :ES 52,
        :NNH -11.14285714285714,
        :NNT 11.14285714285714,
        :PFu 0.09790209790209792,
        :RD -0.08974358974358976,
        :RR 0.9020979020979021,
        :RRI -0.09790209790209792,
        :RRR 0.09790209790209792},
 :table {:a 43, :b 9, :c 44, :d 4}}

Binary Classification Metrics

Metrics derived from a 2x2 confusion matrix are essential for evaluating the performance of algorithms in binary classification tasks (problems where the outcome belongs to one of two classes, e.g., “positive” or “negative”, “success” or “failure”). These metrics quantify how well a classifier distinguishes between the two classes.

Defined functions

confusion-matrix
binary-measures-all, binary-measures

The foundation for these metrics is the confusion matrix, a 2x2 table summarizing the results by comparing the actual class of each instance to the predicted class:

	Predicted Positive (P’)	Predicted Negative (N’)	Total
Actual Positive (P)	True Positives (TP)	False Negatives (FN)	P
Actual Negative (N)	False Positives (FP)	True Negatives (TN)	N
Total	P’	N’	Total

True Positives (TP): Instances correctly predicted as positive.
False Negatives (FN): Instances incorrectly predicted as negative (Type II error). These are positive instances missed by the classifier.
False Positives (FP): Instances incorrectly predicted as positive (Type I error). These are negative instances misclassified as positive.
True Negatives (TN): Instances correctly predicted as negative.

The counts from these four cells form the basis for almost all binary classification metrics.

fastmath.stats provides functions to generate and analyze these matrices:

confusion-matrix: This function constructs the 2x2 confusion matrix counts (:tp, :fn, :fp, :tn). It can take the four counts directly, a structured map/sequence representation, or two sequences of actual and predicted outcomes. It can also handle different data types for outcomes using an encode-true parameter.

Examples

(stats/confusion-matrix 10 2 5 80) ;; => {:tp 10, :fn 2, :fp 5, :tn 80}
(stats/confusion-matrix [10 2 5 80]) ;; => {:tp 10, :fn 2, :fp 5, :tn 80}
(stats/confusion-matrix [[10 5] [2 80]]) ;; => {:tp 10, :fn 5, :fp 2, :tn 80}
(stats/confusion-matrix {:tp 10, :fn 2, :fp 5, :tn 80}) ;; => {:tp 10, :fn 2, :fp 5, :tn 80}
(stats/confusion-matrix [:pos :neg :pos :pos :neg :pos :neg :pos :neg :pos :pos] [:pos :neg :pos :pos :neg :pos :neg :neg :pos :pos :pos] #{:pos}) ;; => {:tp 6, :tn 3, :fn 1, :fp 1}
(stats/confusion-matrix [1 0 1 1 0 1 0 1 0 1 1] [1 0 1 1 0 1 0 0 1 1 1]) ;; => {:tp 6, :tn 3, :fn 1, :fp 1}
(stats/confusion-matrix [true false true true false true false true false true true] [true false true true false true false false true true true]) ;; => {:tp 6, :tn 3, :fn 1, :fp 1}

binary-measures-all: This is the primary function for calculating a wide range of metrics from a 2x2 confusion matrix. It accepts the same input formats as confusion-matrix. It returns a map containing a comprehensive set of derived statistics.
binary-measures: A convenience function that calls binary-measures-all but returns a smaller, more commonly used subset of the metrics. It accepts the same input formats.

The map returned by binary-measures-all contains numerous metrics. Key metrics include:

:tp, :fn, :fp, :tn: The raw counts from the confusion matrix.
:total: Total number of instances (\(TP + FN + FP + TN\)).
:cp, :cn, :pcp, :pcn: Marginal totals (e.g., :cp is actual positives \(TP + FN\)).
:accuracy: Overall proportion of correct predictions: \[ Accuracy = \frac{TP + TN}{TP + FN + FP + TN} \]
:sensitivity, :recall, :tpr: True Positive Rate (proportion of actual positives correctly identified): \[ Sensitivity = \frac{TP}{TP + FN} \]
:specificity, :tnr: True Negative Rate (proportion of actual negatives correctly identified): \[ Specificity = \frac{TN}{FP + TN} \]
:precision, :ppv: Positive Predictive Value (proportion of positive predictions that were actually positive): \[ Precision = \frac{TP}{TP + FP} \]
:fdr: False Discovery Rate (\(1 - Precision\)).
:npv: Negative Predictive Value (proportion of negative predictions that were actually negative): \[ NPV = \frac{TN}{FN + TN} \]
:for: False Omission Rate (\(1 - NPV\)).
:fpr: False Positive Rate (proportion of actual negatives incorrectly identified as positive): \[ FPR = \frac{FP}{FP + TN} \]
:fnr: False Negative Rate (proportion of actual positives incorrectly identified as negative): \[ FNR = \frac{FN}{TP + FN} \]
:f1-score, :f-measure: The harmonic mean of Precision and Recall. It provides a balance between the two: \[ F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \]
:mcc, :phi: Matthews Correlation Coefficient / Phi coefficient. A balanced measure ranging from -1 to +1, considered robust for imbalanced datasets: \[ MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \]
:prevalence: Proportion of the positive class in the actual data: \[ Prevalence = \frac{TP + FN}{TP + FN + FP + TN} \]
Other metrics like :ba (Balanced Accuracy), :lr+, :lr-, :dor, :fm, :pt, :bm, :kappa (Cohen’s Kappa for 2x2), :mk, and :f-beta (a function to calculate F-beta score for any beta value).

binary-measures returns a map containing :tp, :tn, :fp, :fn, :accuracy, :fdr, :f-measure, :fall-out (FPR), :precision, :recall (Sensitivity/TPR), :sensitivity, :specificity (TNR), and :prevalence. This provides a useful quick summary. For a deeper dive or specific niche metrics, use binary-measures-all.

Let’s calculate the metrics for a sample confusion matrix {:tp 10 :fn 2 :fp 5 :tn 80}.

(stats/binary-measures-all {:tp 10, :fn 2, :fp 5, :tn 80})

{:accuracy 0.9278350515463918,
 :ba 0.8872549019607843,
 :bm 0.7745098039215685,
 :cn 85.0,
 :cp 12.0,
 :dor 80.00000000000003,
 :f-beta #<Fn@1cc15f2e fastmath.stats/binary_measures_all_calc[fn]>,
 :f-measure 0.7407407407407408,
 :f1-score 0.7407407407407408,
 :fall-out 0.058823529411764705,
 :fdr 0.33333333333333337,
 :fm 0.7453559924999299,
 :fn 2,
 :fnr 0.16666666666666663,
 :for 0.024390243902439046,
 :fp 5,
 :fpr 0.058823529411764705,
 :hit-rate 0.8333333333333334,
 :jaccard 0.5882352941176471,
 :kappa 0.6994245241257193,
 :lr+ 14.166666666666668,
 :lr- 0.1770833333333333,
 :mcc 0.7053009189406806,
 :miss-rate 0.16666666666666663,
 :mk 0.6422764227642275,
 :n 85.0,
 :npv 0.975609756097561,
 :p 12.0,
 :pcn 82.0,
 :pcp 15.0,
 :phi 0.7053009189406806,
 :pn 82.0,
 :pp 15.0,
 :ppv 0.6666666666666666,
 :precision 0.6666666666666666,
 :prevalence 0.12371134020618557,
 :pt 0.20991366558572697,
 :recall 0.8333333333333334,
 :selectivity 0.9411764705882353,
 :sensitivity 0.8333333333333334,
 :specificity 0.9411764705882353,
 :tn 80,
 :tnr 0.9411764705882353,
 :total 97.0,
 :tp 10,
 :tpr 0.8333333333333334,
 :ts 0.5882352941176471}

(stats/binary-measures {:tp 10, :fn 2, :fp 5, :tn 80})

{:accuracy 0.9278350515463918,
 :f-measure 0.7407407407407408,
 :fall-out 0.058823529411764705,
 :fdr 0.33333333333333337,
 :fn 2,
 :fp 5,
 :precision 0.6666666666666666,
 :prevalence 0.12371134020618557,
 :recall 0.8333333333333334,
 :sensitivity 0.8333333333333334,
 :specificity 0.9411764705882353,
 :tn 80,
 :tp 10}

Effect Size

Measures quantifying the magnitude of a phenomenon or relationship.

Difference Family

Measures quantifying the magnitude of the difference between two group means, standardized by a measure of variability. These are widely used effect size statistics, particularly in comparing experimental and control groups or two distinct conditions.

Defined functions

cohens-d, cohens-d-corrected
hedges-g, hedges-g-corrected, hedges-g*
glass-delta

These functions primarily quantify the difference between the means of two groups (usually group 1 and group 2: \(\bar{x}_1 - \bar{x}_2\)), scaled by some estimate of the population standard deviation.

Cohen’s d (cohens-d): Measures the difference between two means divided by the pooled standard deviation. It assumes equal variances between the groups. The formula is: \[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} \] where \(s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}\) is the unbiased pooled standard deviation. An optional method argument can be passed to select the pooled standard deviation calculation method (see [pooled-stddev]).
Hedges’ g (hedges-g): In fastmath.stats, this function is equivalent to cohens-d using the default (unbiased) pooled standard deviation. Conceptually, Hedges’ g also divides the mean difference by a pooled standard deviation, but often refers to a bias-corrected version for small samples (see below). \[ g = \frac{\bar{x}_1 - \bar{x}_2}{s_p} \]
Bias-Corrected Cohen’s d (cohens-d-corrected) / Bias-Corrected Hedges’ g (hedges-g-corrected): These functions apply a small-sample bias correction factor to Cohen’s d or Hedges’ g. This correction factor (an approximation) is multiplied by the calculated d or g value to provide a less biased estimate of the population effect size, especially for small sample sizes (\(n < 20\)). The correction factor is approximately \(1 - \frac{3}{4\nu - 1}\), where \(\nu = n_1+n_2-2\) are the degrees of freedom.
Exact Bias-Corrected Hedges’ g (hedges-g*): Applies the precise bias correction factor \(J(\nu)\) (based on the Gamma function) to Hedges’ g (or Cohen’s d with unbiased pooling). This provides the most theoretically accurate bias correction for small samples. \[ g^* = g \cdot J(\nu) \] where \(\nu = n_1+n_2-2\) and \(J(\nu) = \frac{\Gamma(\nu/2)}{\sqrt{\nu/2} \Gamma((\nu-1)/2)}\).
Glass’s Delta (glass-delta): Measures the difference between two means divided by the standard deviation of the control group (conventionally group 2). \[ \Delta = \frac{\bar{x}_1 - \bar{x}_2}{s_2} \] This is useful when the control group’s variance is considered a better estimate of the population variance than a pooled variance, or when the intervention is expected to affect variance.

The returned value from these functions represents the difference between the means of Group 1 and Group 2, expressed in units of the relevant standard deviation (pooled for Cohen’s/Hedges’, control group’s for Glass’s).

A positive value indicates that the mean of Group 1 is greater than the mean of Group 2.
A negative value indicates that the mean of Group 1 is less than the mean of Group 2.
The magnitude of the value indicates the size of the difference relative to the spread of the data. Cohen’s informal guidelines for interpreting the magnitude of d (and often g) are:
- \(|d| \approx 0.2\): small effect
- \(|d| \approx 0.5\): medium effect
- \(|d| \approx 0.8\): large effect

Comparison:

cohens-d and hedges-g (as implemented here) are standard measures for the mean difference scaled by pooled standard deviation, assuming equal variances. Cohen’s d is more commonly cited, while Hedges’ g is often used when sample size is small.
cohens-d-corrected, hedges-g-corrected, and hedges-g* are preferred over the standard d/g when sample sizes are small, as they provide less biased estimates of the population effect size. hedges-g* uses the most accurate correction formula.
glass-delta is distinct in using only the control group’s standard deviation for scaling. Use this when the variance of the control group is more appropriate as a baseline (e.g., comparing an experimental group to a control, or if the intervention might affect variance).

Let’s illustrate these with the sepal lengths from the iris dataset, using ‘virginica’ as group 1 and ‘setosa’ as group 2.

Examples

(stats/cohens-d virginica-sepal-length setosa-sepal-length) ;; => 3.0772391640158845
(stats/cohens-d-corrected virginica-sepal-length setosa-sepal-length) ;; => 3.053628633345686
(stats/cohens-d virginica-sepal-length setosa-sepal-length :biased) ;; => 3.1084809717263635
(stats/cohens-d-corrected virginica-sepal-length setosa-sepal-length :biased) ;; => 3.0851089343449623
(stats/cohens-d virginica-sepal-length setosa-sepal-length :avg) ;; => 3.0772391640158845
(stats/cohens-d-corrected virginica-sepal-length setosa-sepal-length :avg) ;; => 3.053628633345686
(stats/hedges-g virginica-sepal-length setosa-sepal-length) ;; => 3.0772391640158845
(stats/hedges-g-corrected virginica-sepal-length setosa-sepal-length) ;; => 3.053628633345686
(stats/hedges-g* virginica-sepal-length setosa-sepal-length) ;; => 3.0536185452069557
(stats/glass-delta virginica-sepal-length setosa-sepal-length) ;; => 4.488074566113517

Ratio Family

Effect size measures in the ratio family quantify the difference between two group means as a multiplicative factor, rather than a difference. This is useful when comparing values that are inherently ratios or when expressing the effect in terms of relative change.

Defined functions

means-ratio, means-ratio-corrected

Ratio of Means (means-ratio): Calculates the ratio of the mean of the first group (\(\bar{x}_1\)) to the mean of the second group (\(\bar{x}_2\)). \[ Ratio = \frac{\bar{x}_1}{\bar{x}_2} \] A value greater than 1 means the first group’s mean is larger; less than 1 means it’s smaller. A value of 1 indicates equal means.
Corrected Ratio of Means (means-ratio-corrected): Applies a small-sample bias correction to the simple ratio of means. This provides a less biased estimate of the population ratio, especially for small sample sizes. The correction is based on incorporating the variances of the two groups. This function is equivalent to calling (means-ratio group1 group2 true).

Use means-ratio for a direct, uncorrected ratio. Use means-ratio-corrected for a less biased estimate of the population ratio, particularly advisable with small sample sizes.

Let’s calculate these for the virginica and setosa sepal lengths from the iris dataset.

Examples

(stats/means-ratio virginica-sepal-length setosa-sepal-length) ;; => 1.316020775069916
(stats/means-ratio-corrected virginica-sepal-length setosa-sepal-length) ;; => 1.3160781315136816

Ordinal / Non-parametric Family

Effect size measures in the ordinal or non-parametric family are suitable for data that are not normally distributed, have outliers, or are measured on an ordinal scale. They are often based on comparing pairs of observations between two groups.

Defined functions

cliffs-delta
ameasure, wmw-odds

Cliff’s Delta (cliffs-delta): A non-parametric measure of the amount of separation or overlap between two distributions. It is calculated as the probability that a randomly selected observation from one group is greater than a randomly selected observation from the other group, minus the probability of the reverse:

\[\delta = P(X > Y) - P(Y > X)\]

It ranges from -1 (complete separation, with values in the second group always greater than values in the first) to +1 (complete separation, with values in the first group always greater than values in the second). A value of 0 indicates complete overlap (stochastic equality). It is a robust alternative to Cohen’s d when assumptions for parametric tests are not met.
Vargha-Delaney A (ameasure): A non-parametric measure of stochastic superiority. It quantifies the probability that a randomly chosen value from the first sample (group1) is greater than a randomly chosen value from the second sample (group2).

\[A = P(X > Y)\]

It ranges from 0 to 1. A value of 0.5 indicates stochastic equality (distributions are overlapping). Values > 0.5 mean group1 tends to have larger values; values < 0.5 mean group2 tends to have larger values. It is directly related to Cliff’s Delta: \(A = (\delta + 1)/2\).
Wilcoxon-Mann-Whitney Odds (wmw-odds): This non-parametric measure quantifies the odds that a randomly chosen observation from the first group (group1) is greater than a randomly chosen observation from the second group (group2).

\[\psi = \frac{P(X > Y)}{P(Y > X)} = \frac{1 + \delta}{1 - \delta}\]

It ranges from 0 to infinity. A value of 1 indicates stochastic equality. Values > 1 mean group1 values tend to be larger; values < 1 mean group1 values tend to be smaller. The natural logarithm of ψ is the log-odds that a random observation from group 1 is greater than a random observation from group 2.

These three measures (cliffs-delta, ameasure, wmw-odds) are non-parametric and robust, making them suitable for ordinal data or when assumptions of normality and equal variances are violated. They are inter-related and can be transformed into one another. Cliff’s Delta is centered around 0, A is centered around 0.5, and WMW Odds is centered around 1 (or 0 on a log scale), offering different interpretations of the effect magnitude and direction.

Let’s illustrate these with the sepal lengths from the iris dataset, using ‘virginica’ as group 1 and ‘setosa’ as group 2.

Examples

(stats/cliffs-delta virginica-sepal-length setosa-sepal-length) ;; => 0.9692
(stats/ameasure virginica-sepal-length setosa-sepal-length) ;; => 0.9846
(stats/wmw-odds virginica-sepal-length setosa-sepal-length) ;; => 63.93506493506461

Overlap Family

Measures quantifying the degree to which two distributions share common values. These metrics assess how much the data from one group tends to blend or overlap with the data from another group. They provide an alternative perspective on effect size compared to mean differences or ratios, which is particularly useful when the focus is on the proportion of scores that are shared or distinct between groups.

Defined functions

p-overlap
cohens-u1-normal, cohens-u2-normal, cohens-u3-normal
cohens-u1, cohens-u2, cohens-u3

These functions provide different ways to quantify overlap:

p-overlap: Estimates the proportion of overlap between two distributions based on Kernel Density Estimation (KDE). It calculates the area under the minimum of the two estimated density functions.

\[ \text{Overlap} = \int_{-\infty}^{\infty} \min(f_1(x), f_2(x)) \, dx \]

This measure is non-parametric, meaning it doesn’t assume the data comes from a specific distribution (like normal), and it is symmetric (p-overlap(A, B) == p-overlap(B, A)). The result is a proportion between 0 (no overlap) and 1 (complete overlap, identical distributions).

Examples

(stats/p-overlap virginica-sepal-length setosa-sepal-length) ;; => 0.12090979659086024
(stats/p-overlap virginica-sepal-length setosa-sepal-length {:kde :epanechnikov, :bandwidth 0.7}) ;; => 0.15938870929802704

Cohen’s U measures (cohens-u1-normal, cohens-u2-normal, cohens-u3-normal): These are measures of non-overlap or overlap derived from Cohen’s d, assuming normal distributions with equal variances. They quantify the proportion of one group’s scores that fall beyond a certain point in relation to the other group.
- cohens-u1-normal: Quantifies the proportion of scores in the lower-scoring group that overlap with the scores in the higher-scoring group. Calculated from Cohen’s d (\(d\)) using the standard normal CDF (\(\Phi\)) as \((2\Phi(|d|/2) - 1) / \Phi(|d|/2)\). Range is [0, 1]. 0 means no overlap, 1 means complete overlap.
- cohens-u2-normal: Quantifies the proportion of scores in the lower-scoring group that are below the point located halfway between the means of the two groups. Calculated as \(\Phi(|d|/2)\). Range is [0, 1]. 0.5 means perfect overlap (medians are at the same point), 0 or 1 means no overlap (distributions are far apart).
- cohens-u3-normal: Quantifies the proportion of scores in the lower-scoring group that fall below the mean of the higher-scoring group. Calculated as \(\Phi(d)\). Range is [0, 1]. This measure is asymmetric: U3(A, B) is the proportion of B below A’s mean; U3(B, A) is the proportion of A below B’s mean.

Examples

(stats/cohens-u1-normal virginica-sepal-length setosa-sepal-length) ;; => 0.9339603379976781
(stats/cohens-u2-normal virginica-sepal-length setosa-sepal-length) ;; => 0.9380514024419309
(stats/cohens-u3-normal virginica-sepal-length setosa-sepal-length) ;; => 0.9989553620117301

Non-parametric Cohen’s U measures (cohens-u1, cohens-u2, cohens-u3): These are analogous measures that do not assume normality or equal variances. They are based on comparing percentiles or medians of the empirical distributions.
- cohens-u1: Non-parametric measure of difference/separation, related to [cohens-u2]. Its value indicates greater separation as it increases towards +1, and more overlap as it approaches -1. Symmetric.
- cohens-u2: Quantifies the minimum overlap between corresponding quantiles. It finds the smallest distance between quantiles \(q\) and \(1-q\) of the two distributions across \(q \in [0.5, 1.0]\). Range is typically [0, max_diff]. It is symmetric.
- cohens-u3: Quantifies the proportion of values in the second group (group2) that are less than the median of the first group (group1). This is a non-parametric version of the concept behind cohens-u3-normal, but is calculated directly from the empirical distributions’ medians and CDFs. Range is [0, 1]. This measure is also asymmetric.

Examples

(stats/cohens-u1 virginica-sepal-length setosa-sepal-length) ;; => 0.9372815512903497
(stats/cohens-u2 virginica-sepal-length setosa-sepal-length) ;; => 0.9409830056250525
(stats/cohens-u3 virginica-sepal-length setosa-sepal-length) ;; => 1.0

Comparison:

Use p-overlap for a robust, non-parametric, symmetric measure of direct area overlap based on estimated densities.
Use cohens-u*-normal functions when you are comfortable assuming normality and equal variances, and want measures directly derived from Cohen’s d. u1, u2, and u3 offer slightly different views of overlap/separation relative to means and standard deviations.
Use non-parametric cohens-u* functions when assumptions of normality are not met or when working with ordinal data. u2 is a symmetric overlap measure based on quantiles, while u3 is an asymmetric measure based on comparing to the median.

Correlation / Association Family

Effect size measures in the correlation and association family quantify the strength and magnitude of the relationship between variables. Unlike difference-based measures (like Cohen’s d), these focus on how well one variable predicts or is associated with another, often in terms of shared or explained variance. They are widely used in regression, ANOVA, and general correlation analysis.

Defined functions

pearson-r, r2-determination
eta-sq, omega-sq, epsilon-sq
cohens-f2, cohens-f, cohens-q
rank-eta-sq, rank-epsilon-sq

Pearson r (pearson-r): The Pearson product-moment correlation coefficient. An alias for [pearson-correlation]. Measures the strength and direction of a linear relationship between two continuous variables (\(r\)). Ranges from -1 to +1. \[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} \]
R² Determination (r2-determination): The coefficient of determination. An alias for the standard [r2] calculation between two sequences. For a simple linear relationship, this is the square of the Pearson correlation (\(R^2 = r^2\)). It quantifies the proportion of the variance in one variable that is linearly predictable from the other. Ranges from 0 to 1. \[ R^2 = 1 - \frac{RSS}{TSS} = \frac{SS_{regression}}{SS_{total}} \]
Eta-squared (eta-sq): A measure of the proportion of the variance in a dependent variable that is associated with an independent variable. In the context of two numerical sequences, this function calculates \(R^2\) from the simple linear regression of group1 on group2. \[ \eta^2 = \frac{SS_{regression}}{SS_{total}} \]
Omega-squared (omega-sq): A less biased estimate of the population Eta-squared (\(\omega^2\)). It estimates the proportion of variance in the dependent variable accounted for by the independent variable in the population. Often preferred over Eta-squared as a population effect size estimate, especially for small sample sizes. \[ \omega^2 = \frac{SS_{regression} - p \cdot MSE}{SS_{total} + MSE} \] where \(p\) is the number of predictors (1 for simple linear regression) and \(MSE\) is the Mean Squared Error of the residuals.
Epsilon-squared (epsilon-sq): Another less biased estimate of population Eta-squared (\(\varepsilon^2\)), similar to adjusted \(R^2\). Also aims to estimate the proportion of variance explained in the population. \[ \varepsilon^2 = \frac{SS_{regression} - MSE}{SS_{total}} \]
Cohen’s f² (cohens-f2): Measures the ratio of the variance explained by the effect to the unexplained variance. It can be calculated based on Eta-squared, Omega-squared, or Epsilon-squared (controlled by the :type parameter). Ranges from 0 upwards. \[ f^2 = \frac{\text{Proportion of Variance Explained}}{\text{Proportion of Variance Unexplained}} = \frac{\eta^2}{1-\eta^2} \]
Cohen’s f (cohens-f): The square root of Cohen’s f² (\(f = \sqrt{f^2}\)). It quantifies the magnitude of the effect size in standardized units. Ranges from 0 upwards.

Let’s look at the relationship between mpg (Miles Per Gallon) and hp (Horsepower) from the mtcars dataset. We expect a negative relationship (higher HP means lower MPG).

The Pearson r value shows a strong negative linear correlation (-0.77). Squaring this gives the R² (r2-determination, eta-sq), indicating that about 59% of the variance in MPG can be explained by the linear relationship with HP in this sample. omega-sq and epsilon-sq provide adjusted estimates for the population, which are slightly lower, as expected. Cohen’s f² and f quantify the magnitude of this effect, suggesting a large effect size according to conventional guidelines.

Examples

(stats/pearson-r mpg hp) ;; => -0.7761683718265864
(stats/r2-determination mpg hp) ;; => 0.602437341423934
(stats/eta-sq mpg hp) ;; => 0.602437341423934
(stats/omega-sq mpg hp) ;; => 0.5814794357913807
(stats/epsilon-sq mpg hp) ;; => 0.5891852528047318
(stats/cohens-f2 mpg hp) ;; => 1.515326775360793
(stats/cohens-f2 mpg hp :omega) ;; => 1.3893688519007434
(stats/cohens-f2 mpg hp :epsilon-sq) ;; => 1.515326775360793
(stats/cohens-f mpg hp) ;; => 1.2309860987682977
(stats/cohens-f mpg hp :omega) ;; => 1.178714915448491
(stats/cohens-f mpg hp :epsilon-sq) ;; => 1.2309860987682977

Cohen’s q (cohens-q): Measures the difference between two correlation coefficients after applying the Fisher z-transformation (atanh). Useful for comparing the strength of correlation between different pairs of variables or in different samples. \[ q = \text{atanh}(r_1) - \text{atanh}(r_2) \]

Let’s compare correlations using cohens-q. We’ll compare the correlation between mpg and hp against the correlation between mpg and wt (weight). Second test is to compare correlation between setosa sepal widths and lengths against correlation between virginical sepal widths and lengths.

(def r-mpg-hp (stats/pearson-correlation mpg hp))

(def r-mpg-wt (stats/pearson-correlation mpg wt))

Examples

r-mpg-hp ;; => -0.7761683718265864
r-mpg-wt ;; => -0.8676593765172276
(stats/cohens-q r-mpg-hp r-mpg-wt) ;; => 0.2878712810066375
(stats/cohens-q mpg hp wt) ;; => 0.2878712810066375
(stats/cohens-q setosa-sepal-width setosa-sepal-length virginica-sepal-width virginica-sepal-length) ;; => 0.4718354551725414

The cohens-q value quantifies the difference in the strength of these two (dependent) correlations. A larger absolute value of q suggests a more substantial difference between the correlation coefficients.

Rank Eta-squared (rank-eta-sq): Effect size measure for the Kruskal-Wallis test. It represents the proportion of variation in the dependent variable accounted for by group membership, based on ranks. Ranges from 0 to 1. Calculated based on the Kruskal-Wallis H statistic (\(H\)), number of groups (\(k\)), and total sample size (\(n\)). \[ \eta^2_H = \frac{\max(0, H - (k-1))}{n - k} \]
Rank Epsilon-squared (rank-epsilon-sq): Another effect size measure for the Kruskal-Wallis test, also based on ranks and providing a less biased estimate than Rank Eta-squared. Ranges from 0 to 1. Calculated based on \(H\) and \(n\). \[ \varepsilon^2_H = \frac{H}{n-1} \]

We can treat the different species (setosa, virginica and versicolor) as groups and compare their sepal lengths using a non-parametric approach conceptually related to Kruskal-Wallis (though the function itself is a generic effect size calculation).

Examples

(stats/rank-eta-sq (vals sepal-lengths)) ;; => 0.6458328979635933
(stats/rank-eta-sq (vals sepal-widths)) ;; => 0.41152809592198036
(stats/rank-epsilon-sq (vals sepal-lengths)) ;; => 0.6505868187962968
(stats/rank-epsilon-sq (vals sepal-widths)) ;; => 0.41942704765457123

Comparison:

pearson-r measures linear relationship direction and strength. r2-determination quantifies the proportion of shared variance (\(r^2\)) for a simple linear relationship.
eta-sq, omega-sq, and epsilon-sq all quantify the proportion of variance explained. eta-sq is a sample-based measure (equal to \(R^2\) here), while omega-sq and epsilon-sq are less biased population estimates, preferred when generalizing beyond the sample.
cohens-f2 and cohens-f measure effect magnitude relative to unexplained variance. They are useful for power analysis and interpreting the practical significance of a regression or ANOVA effect. cohens-f is often more interpretable as it’s in the same units as standard deviations (though it’s not directly scaled by a single standard deviation like Cohen’s d).
cohens-q is specifically for comparing correlation coefficients, assessing if the strength of relationship between two variables differs significantly from the strength between another pair or in another context.
rank-eta-sq and rank-epsilon-sq are specific to rank-based tests like Kruskal-Wallis, providing effect sizes analogous to Eta-squared/Epsilon-squared but suitable for non-parametric data.

Interpretation Guidelines:

Correlation (r):
- \(|r| < 0.3\): weak/small linear relationship
- \(0.3 \le |r| < 0.7\): moderate/medium linear relationship
- \(|r| \ge 0.7\): strong/large linear relationship
Proportion of Variance Explained (\(R^2, \eta^2, \omega^2, \varepsilon^2\)):
- \(0.01\): small effect (1% of variance explained)
- \(0.06\): medium effect (6% of variance explained)
- \(0.14\): large effect (14% of variance explained)
Cohen’s f (and f²):
- \(f = 0.1\) (\(f^2 = 0.01\)): small effect
- \(f = 0.25\) (\(f^2 \approx 0.06\)): medium effect
- \(f = 0.4\) (\(f^2 = 0.16\)): large effect

Statistical Tests

Functions for hypothesis testing.

Normality and Shape Tests

These tests assess whether a dataset deviates significantly from a normal distribution, or exhibits specific characteristics related to skewness (asymmetry) and kurtosis (tailedness). Deviations from normality are important to check as many statistical methods assume normal data.

Defined functions

skewness-test
kurtosis-test
normality-test
jarque-bera-test
bonett-seier-test

fastmath.stats provides several functions for these tests:

skewness-test: Tests if the sample skewness significantly differs from the zero skewness expected of a normal distribution. It calculates a standardized test statistic (approximately Z) and its p-value. By default, it uses the :g1 type of skewness (Pearson’s moment coefficient).
kurtosis-test: Tests if the sample kurtosis significantly differs from the value expected of a normal distribution (3 for raw kurtosis, 0 for excess kurtosis). It also calculates a standardized test statistic (approximately Z) and its p-value. By default, it uses the :kurt type of kurtosis (Excess Kurtosis + 3).
normality-test: The D’Agostino-Pearson K² test. This is an omnibus test that combines the skewness and kurtosis tests into a single statistic (\(K^2 = Z_{skewness}^2 + Z_{kurtosis}^2\)). \(K^2\) follows approximately a Chi-squared distribution with 2 degrees of freedom under the null hypothesis of normality. It tests for overall departure from normality. It internally uses the default :g1 skewness and default :kurt kurtosis types from the individual tests.
jarque-bera-test: Another omnibus test for normality based on sample skewness (\(S\), specifically type :g1) and excess kurtosis (\(K\), specifically type :g2). The test statistic is \(JB = \frac{n}{6}(S^2 + \frac{1}{4}K^2)\), which also follows approximately a Chi-squared distribution with 2 degrees of freedom under the null hypothesis. Similar to the K² test, it assesses whether the combined deviation in skewness and kurtosis is significant. Note that the specific :g1 and :g2 types are used for the JB statistic formula regardless of any :type option passed to the underlying skewness or kurtosis functions.
bonett-seier-test: A test for normality based on Geary’s ‘g’ measure of kurtosis, which is more robust to outliers than standard moment-based kurtosis. It calculates a Z-statistic comparing the sample Geary’s ‘g’ to its expected value for a normal distribution (\(\sqrt{2/\pi}\)). This test specifically uses the :geary kurtosis type.

Interpretation: The output of these functions is typically a map containing:

:stat (or an alias like :Z, :K2, :JB, :F, :chi2): The calculated value of the test statistic.
:p-value: The probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis (that the data comes from a normal distribution, or has the expected shape property) is true.
Other keys might include :df (degrees of freedom), :skewness (the value of the type used in the test), :kurtosis (the value of the type used in the test), :n (sample size), and :sides (the alternative hypothesis used).

A small p-value (typically less than the chosen significance level, e.g., 0.05) suggests that the observed data is unlikely to have come from a normal distribution (or satisfy the specific shape property being tested), leading to rejection of the null hypothesis. A large p-value indicates that there is not enough evidence to reject the null hypothesis.

Let’s apply some of these tests to the residual-sugar data, which we observed earlier to be right-skewed. We’ll show the specific skewness/kurtosis values used by each test for clarity.

Examples

(stats/skewness residual-sugar :g1) ;; => 1.0767638711454521
(stats/kurtosis residual-sugar :kurt) ;; => 6.465054296604846
(stats/kurtosis residual-sugar :g2) ;; => 3.4650542966048463
(stats/kurtosis residual-sugar :geary) ;; => 0.8336299967214688
(stats/skewness-test residual-sugar) ;; => {:p-value 0.0, :Z 25.47884190247254, :skewness 1.0767638711454521}
(stats/kurtosis-test residual-sugar) ;; => {:p-value 0.0, :Z 20.004147071664516, :kurtosis 6.465054296604846}
(stats/normality-test residual-sugar) ;; => {:p-value 0.0, :Z 1049.3372847559745, :skewness 1.0767638711454521, :kurtosis 6.465054296604846}
(stats/jarque-bera-test residual-sugar) ;; => {:p-value 0.0, :Z 3396.8207586928006, :skewness 1.0767638711454521, :kurtosis 3.4650542966048463}
(stats/bonett-seier-test residual-sugar) ;; => {:p-value 1.2876717402423177E-30, :stat -11.50208472276902, :Z -11.50208472276902, :kurtosis 0.8336299967214688, :n 4898, :sides :two-sided}

As expected for the right-skewed residual-sugar data, the skewness-test, normality-test (K²), and jarque-bera-test yield very small p-values, strongly suggesting that the data is not normally distributed. The kurtosis-test and bonett-seier-test examine tailedness; their p-values indicate whether the deviation from normal kurtosis is significant.

Binomial Tests

These tests are designed for data representing counts of successes in a fixed number of trials, where each trial has only two possible outcomes (success or failure). They are used to make inferences about the true underlying proportion of successes in the population.

Defined functions

binomial-test
binomial-ci

binomial-test: Performs an exact hypothesis test on a binomial proportion. This test is used to determine if the observed number of successes in a given number of trials is significantly different from what would be expected under a specific hypothesized probability of success (\(p_0\)). The null hypothesis is typically \(H_0: p = p_0\).

The test calculates a p-value based on the binomial probability distribution, assessing the likelihood of observing the sample results (or more extreme) if $p_0$ were the true population proportion.

Let’s illustrate with some examples. Imagine we check the am column of the mtcars dataset, where 0 represents automatic and 1 represents manual transmission. We want to test if the proportion of cars with manual transmission is significantly different from 0.5 (an equal split). We also want to estimate a confidence interval for this proportion.

Let’s get the counts:

(def mtcars-am (ds/mtcars :am))

(def manual-count (count (filter m/one? mtcars-am)))

(def total-count (count mtcars-am))

Examples

manual-count ;; => 13
total-count ;; => 32
(stats/mean mtcars-am) ;; => 0.40625

Now, let’s perform the binomial test:

(stats/binomial-test manual-count total-count)

{:alpha 0.05,
 :ci-method :asymptotic,
 :confidence-interval [0.4008057508099545 0.4116942491900455],
 :estimate 0.40625,
 :level 0.95,
 :p 0.5,
 :p-value 0.3770855874754484,
 :stat 13,
 :successes 13,
 :test-type :two-sided,
 :trials 32}

(stats/binomial-test manual-count total-count {:p 0.5, :sides :one-sided-greater})

{:alpha 0.05,
 :ci-method :asymptotic,
 :confidence-interval [0.39533998822335753 1.0],
 :estimate 0.40625,
 :level 0.95,
 :p 0.5,
 :p-value 0.8923364251386376,
 :stat 13,
 :successes 13,
 :test-type :one-sided-greater,
 :trials 32}

(stats/binomial-test manual-count total-count {:p 0.8, :sides :one-sided-less, :alpha 0.01})

{:alpha 0.01,
 :ci-method :asymptotic,
 :confidence-interval [0.0 0.40842650129634167],
 :estimate 0.40625,
 :level 0.99,
 :p 0.8,
 :p-value 1.1904334239476455E-6,
 :stat 13,
 :successes 13,
 :test-type :one-sided-less,
 :trials 32}

The output map from binomial-test contains:

:p-value: The probability of observing the sample result or more extreme outcomes if the true population proportion were equal to :p. A small p-value (typically < 0.05) suggests evidence against the null hypothesis.
:stat: The observed number of successes.
:estimate: The observed proportion of successes (:successes / :trials).
:confidence-interval: A confidence interval for the true population proportion, based on the observed data and using the specified :ci-method and :alpha.

Binomial Confidence Intervals

binomial-ci: Calculates a confidence interval for a binomial proportion. Given the observed number of successes and trials, this function estimates a range of values that is likely to contain the true population probability of success (\(p\)) with a specified level of confidence.

The binomial-ci function offers various methods for calculating the confidence interval for a binomial proportion, controlled by the optional method keyword. These methods differ in their underlying assumptions and formulas, leading to intervals that can vary in width and coverage properties, particularly with small sample sizes or proportions close to 0 or 1.

Available method values:

:asymptotic: Normal Approximation (Wald) Interval. Based on the Central Limit Theorem, using the sample proportion and its estimated standard error. Simple to calculate but can have poor coverage (too narrow) for small sample sizes or proportions near 0 or 1.
:agresti-coull: Agresti-Coull Interval. An adjusted Wald interval that adds ‘pseudo-counts’ (typically 2 successes and 2 failures) to the observed counts. This adjustment improves coverage and performance, especially for small samples.
:clopper-pearson: Clopper-Pearson Interval. An ‘exact’ method based on inverting binomial tests. It provides guaranteed minimum coverage probability (i.e., the true proportion is included in the interval at least 100 * (1-alpha)% of the time). However, it is often wider than necessary and can be overly conservative.
:wilson: Wilson Score Interval. Derived from the score test, which is based on the null hypothesis standard error rather than the observed standard error. It performs well across different sample sizes and probabilities and is generally recommended over the Wald interval.
:prop.test: Continuity-Corrected Wald Interval. Applies a continuity correction to the Wald interval, similar to what is often used with the Chi-squared test for 2x2 tables.
:cloglog: Complementary Log-log Transformation Interval. Calculates the interval on the clog-log scale and then transforms it back to the probability scale. Can be useful when dealing with skewed data or probabilities close to 0.
:logit: Logit Transformation Interval. Calculates the interval on the logit scale (log(p/(1-p))) and transforms it back. Also useful for handling probabilities near boundaries.
:probit: Probit Transformation Interval. Calculates the interval on the probit scale (inverse of the standard normal CDF) and transforms it back.
:arcsine: Arcsine Transformation Interval. Calculates the interval using the arcsine square root transformation (asin(sqrt(p))) and transforms it back.
:all: Calculates and returns a map of intervals from all the above methods.

Examples

(stats/binomial-ci manual-count total-count) ;; => [0.23608446629942265 0.5764155337005774 0.40625]
(stats/binomial-ci manual-count total-count :asymptotic 0.01) ;; => [0.18261458036097775 0.6298854196390222 0.40625]
(stats/binomial-ci manual-count total-count :asymptotic 0.1) ;; => [0.26344258236512375 0.5490574176348763 0.40625]
(stats/binomial-ci manual-count total-count :agresti-coull) ;; => [0.25491682311999536 0.5776792765528302 0.40625]
(stats/binomial-ci manual-count total-count :clopper-pearson) ;; => [0.23698410097201839 0.5935507534231756 0.40625]
(stats/binomial-ci manual-count total-count :wilson) ;; => [0.25519634842665434 0.5773997512461713 0.40625]
(stats/binomial-ci manual-count total-count :prop.test) ;; => [0.24219139863150996 0.5921456974454622 0.40625]
(stats/binomial-ci manual-count total-count :cloglog) ;; => [0.238336877315159 0.5678979157795636 0.40625]
(stats/binomial-ci manual-count total-count :logit) ;; => [0.25256981246106547 0.5807794598608678 0.40625]
(stats/binomial-ci manual-count total-count :probit) ;; => [0.2495476615316991 0.5798499631463624 0.40625]
(stats/binomial-ci manual-count total-count :arcsine) ;; => [0.2450397617966424 0.578602376146695 0.40625]

For methods other than :all, the function returns a vector [lower-bound, upper-bound, estimated-p]. * lower-bound, upper-bound: The ends of the calculated confidence interval. We are 100 * (1 - alpha)% confident that the true population proportion lies within this range. * estimated-p: The observed proportion of successes in the sample.

If the method is :all, a map is returned where keys are the method keywords and values are the [lower, upper, estimate] vectors for each method. Note that different methods can produce different interval widths and positions.

(stats/binomial-ci manual-count total-count :all)

{:agresti-coull [0.25491682311999536 0.5776792765528302 0.40625],
 :arcsine [0.2450397617966424 0.578602376146695 0.40625],
 :asymptotic [0.23608446629942265 0.5764155337005774 0.40625],
 :cloglog [0.238336877315159 0.5678979157795636 0.40625],
 :clopper-pearson [0.23698410097201839 0.5935507534231756 0.40625],
 :logit [0.25256981246106547 0.5807794598608678 0.40625],
 :probit [0.2495476615316991 0.5798499631463624 0.40625],
 :prop.test [0.24219139863150996 0.5921456974454622 0.40625],
 :wilson [0.25519634842665434 0.5773997512461713 0.40625]}

Location Tests (T/Z Tests)

Location tests, specifically t-tests and z-tests, are fundamental statistical tools used to compare means. They help determine if the mean of a sample is significantly different from a known or hypothesized value (one-sample tests), or if the means of two different samples are significantly different from each other (two-sample tests). The choice between a t-test and a z-test typically depends on whether the population standard deviation is known and the sample size.

Defined functions

t-test-one-sample, z-test-one-sample
t-test-two-samples, z-test-two-samples
p-value

fastmath.stats provides the following functions for location tests:

t-test-one-sample: Performs a one-sample Student’s t-test. This tests the null hypothesis that the true population mean (\(\mu\)) is equal to a hypothesized value (\(\mu_0\)). It is typically used when the population standard deviation is unknown and estimated from the sample. The test statistic follows a t-distribution with \(n-1\) degrees of freedom, where \(n\) is the sample size.

\[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \]
z-test-one-sample: Performs a one-sample Z-test. This tests the null hypothesis that the true population mean (\(\mu\)) is equal to a hypothesized value (\(\mu_0\)). It is typically used when the population standard deviation is known or when the sample size is large (generally \(n > 30\)), allowing the sample standard deviation to be used as a reliable estimate for the population standard deviation, and the test statistic approximates a standard normal distribution.

\[ z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} \quad \text{or} \quad z \approx \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \]
t-test-two-samples: Performs a two-sample t-test to compare the means of two samples. It can perform:
- Unpaired tests: For independent samples. By default, it performs Welch’s t-test (:equal-variances? false), which does not assume equal population variances. It can also perform Student’s t-test (:equal-variances? true), assuming equal variances and using a pooled standard deviation.
- Paired tests: For dependent samples (:paired? true), essentially performing a one-sample t-test on the differences between pairs. The null hypothesis is typically that the true difference between population means is zero, or equal to a hypothesized value (\(\mu_0\)). The test statistic follows a t-distribution with degrees of freedom calculated appropriately for the specific variant (pooled for Student’s, Satterthwaite approximation for Welch’s, \(n-1\) for paired).
z-test-two-samples: Performs a two-sample Z-test to compare the means of two independent or paired samples. Similar to the one-sample Z-test, it is used when population variances are known or samples are large. It can handle independent samples (with or without assuming equal variances, affecting the standard error calculation) or paired samples (by performing a one-sample Z-test on differences). The test statistic approximates a standard normal distribution.

Comparison:

T vs Z: Choose Z-tests primarily when population standard deviations are known or with large samples (often \(n > 30\) for each group in two-sample tests), as the sampling distribution of the mean is well-approximated by the normal distribution. Use T-tests when population standard deviations are unknown and estimated from samples, especially with small to moderate sample sizes, as the sampling distribution is better described by the t-distribution.
One-Sample vs Two-Sample: Use one-sample tests to compare a single sample mean against a known or hypothesized constant. Use two-sample tests to compare the means of two distinct samples.
Paired vs Unpaired (Two-Sample): Use paired tests when the two samples consist of paired observations (e.g., measurements before and after an intervention on the same subjects). Use unpaired tests for independent samples (e.g., comparing a treatment group to a control group with different subjects in each).
Welch’s vs Student’s (Unpaired Two-Sample T-test): Welch’s test is generally recommended as it does not assume equal variances, a common violation in practice. Student’s t-test requires the assumption of equal population variances.

All these functions return a map containing key results, typically including the test statistic (:t or :z), :p-value, :confidence-interval for the mean (or mean difference), :estimate (sample mean or mean difference), sample size(s) (:n, :nx, :ny), :mu (hypothesized value), :stderr (standard error of the estimate), :alpha (significance level), and :sides (alternative hypothesis type). Two-sample tests also report :paired? and, if unpaired, :equal-variances?.

Arguments:

xs (sequence of numbers): The sample data for one-sample tests or the first sample for two-sample tests.
ys (sequence of numbers): The second sample for two-sample tests.
params (map, optional): A map of options. Common keys:
- :alpha (double, default 0.05): Significance level. Confidence level is \(1 - \alpha\).
- :sides (keyword, default :two-sided): Alternative hypothesis. Can be :two-sided, :one-sided-greater, or :one-sided-less.
- :mu (double, default 0.0): Hypothesized population mean (one-sample) or hypothesized difference in population means (two-sample).
- :paired? (boolean, default false): For two-sample tests, specifies if samples are paired.
- :equal-variances? (boolean, default false): For unpaired two-sample tests, specifies if equal population variances are assumed (Student’s vs Welch’s).

Return Value (Map):

:stat (double): The calculated test statistic (alias :t or :z).
:p-value (double): The probability of observing the test statistic or more extreme values under the null hypothesis.
:confidence-interval (vector of doubles): [lower-bound, upper-bound]. A range likely containing the true population mean (one-sample) or mean difference (two-sample) with \(1-\alpha\) confidence. Includes the estimate as a third value (e.g., [lower upper estimate]).
:estimate (double): The sample mean (one-sample) or the difference between sample means (two-sample).
:n (long or vector of longs): Sample size (one-sample) or sample sizes [nx ny] (two-sample).
:nx, :ny (long): Sample sizes of xs and ys respectively (two-sample unpaired).
:estimated-mu (vector of doubles): Sample means [mean xs, mean ys] (two-sample unpaired).
:mu (double): The hypothesized value used in the null hypothesis.
:stderr (double): The standard error of the sample mean or mean difference.
:alpha (double): The significance level used.
:sides / :test-type (keyword): The alternative hypothesis side.
:df (long or vector of longs): Degrees of freedom (t-tests only).
:paired? (boolean): Indicates if a paired test was performed (two-sample only).
:equal-variances? (boolean): Indicates if equal variances were assumed (two-sample unpaired only).

Helper function p-value:

The p-value function is a general utility used internally by many statistical tests (including t-tests and z-tests) to calculate the p-value from a given test statistic, its null distribution, and the specified alternative hypothesis (sides). It determines the probability of observing a statistic value as extreme as or more extreme than the one obtained from the sample, assuming the null hypothesis is true. The interpretation of ‘extreme’ depends on whether a two-sided, one-sided-greater, or one-sided-less test is performed. For discrete distributions, a continuity correction is applied.

Let’s apply these tests. First, one-sample tests on mpg data against a hypothesized mean of 20.0:

(stats/t-test-one-sample mpg {:mu 20.0})

{:alpha 0.05,
 :confidence-interval [17.917678508746246 22.263571491253753],
 :df 31,
 :estimate 20.090625,
 :level 0.95,
 :mu 20.0,
 :n 32,
 :p-value 0.9327606409093825,
 :stat 0.08506003568133355,
 :stderr 1.0654239593728148,
 :t 0.08506003568133355,
 :test-type :two-sided}

(stats/z-test-one-sample mpg {:mu 20.0, :sides :one-sided-greater})

{:alpha 0.05,
 :confidence-interval [18.338158536184626 ##Inf],
 :estimate 20.090625,
 :level 0.95,
 :mu 20.0,
 :n 32,
 :p-value 0.4661068310107286,
 :stat 0.08506003568133355,
 :stderr 1.0654239593728148,
 :test-type :one-sided-greater,
 :z 0.08506003568133355}

Now, two-sample tests comparing setosa-sepal-length and virginica-sepal-length. We’ll use unpaired tests first, comparing Welch’s and Student’s:

(stats/t-test-two-samples setosa-sepal-length virginica-sepal-length)

{:alpha 0.05,
 :confidence-interval [-1.7867603323573877 -1.3772396676426117],
 :df 76.51586702413655,
 :equal-variances? false,
 :estimate -1.5819999999999999,
 :estimated-mu [5.006 6.588],
 :level 0.95,
 :mu 0.0,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 3.966867270985708E-25,
 :paired? false,
 :sides :two-sided,
 :stat -15.386195820079422,
 :stderr 0.10281943753344443,
 :t -15.386195820079422,
 :test-type :two-sided}

(stats/t-test-two-samples setosa-sepal-length virginica-sepal-length {:equal-variances? true})

{:alpha 0.05,
 :confidence-interval [-1.786041827461154 -1.377958172538846],
 :df 98.0,
 :equal-variances? true,
 :estimate -1.5819999999999999,
 :estimated-mu [5.006 6.588],
 :level 0.95,
 :mu 0.0,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 6.892546060673613E-28,
 :paired? false,
 :sides :two-sided,
 :stat -15.386195820079422,
 :stderr 0.10281943753344443,
 :t -15.386195820079422,
 :test-type :two-sided}

And the corresponding Z-tests (note the Z-test doesn’t need degrees of freedom):

(stats/z-test-two-samples setosa-sepal-length virginica-sepal-length)

{:alpha 0.05,
 :confidence-interval [-1.7835223944762166 -1.380477605523783],
 :equal-variances? false,
 :estimate -1.5819999999999999,
 :estimated-mu [5.006 6.588],
 :level 0.95,
 :mu 0.0,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 2.0259860303505476E-53,
 :paired? false,
 :sides :two-sided,
 :stat -15.386195820079422,
 :stderr 0.10281943753344443,
 :test-type :two-sided,
 :z -15.386195820079422}

(stats/z-test-two-samples setosa-sepal-length virginica-sepal-length {:equal-variances? true})

{:alpha 0.05,
 :confidence-interval [-1.7835223944762166 -1.380477605523783],
 :equal-variances? true,
 :estimate -1.5819999999999999,
 :estimated-mu [5.006 6.588],
 :level 0.95,
 :mu 0.0,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 2.0259860303505476E-53,
 :paired? false,
 :sides :two-sided,
 :stat -15.386195820079422,
 :stderr 0.10281943753344443,
 :test-type :two-sided,
 :z -15.386195820079422}

If we had paired data (e.g., hp measurements before and after a modification for the same cars, stored in hp-before and hp-after), we would use the paired option:

Assume hp-before and hp-after are sequences of the same length

(def hp-before [100 120 150])

(def hp-after [110 125 165])

(stats/t-test-two-samples hp-before hp-after {:paired? true})

{:alpha 0.05,
 :confidence-interval [-22.420688558603995 2.4206885586039952],
 :df 2,
 :estimate -10.0,
 :level 0.95,
 :mu 0.0,
 :n 3,
 :p-value 0.07417990022744848,
 :paired? true,
 :stat -3.4641016151377544,
 :stderr 2.886751345948129,
 :t -3.4641016151377544,
 :test-type :two-sided}

(stats/z-test-two-samples hp-before hp-after {:paired? true})

{:alpha 0.05,
 :confidence-interval [-15.65792867038086 -4.3420713296191416],
 :estimate -10.0,
 :level 0.95,
 :mu 0.0,
 :n 3,
 :p-value 5.320055051392497E-4,
 :paired? true,
 :stat -3.4641016151377544,
 :stderr 2.886751345948129,
 :test-type :two-sided,
 :z -3.4641016151377544}

Variance Tests

Variance tests are used to assess whether the variances of two or more independent samples are statistically different. The null hypothesis for these tests is typically that the population variances are equal (homogeneity of variances, or homoscedasticity). This assumption is important for many statistical procedures, such as ANOVA and Student’s t-test.

Defined functions

f-test
levene-test
brown-forsythe-test
fligner-killeen-test

f-test: Performs an F-test for the equality of variances of two independent samples.

It calculates the ratio of the sample variances (\(s_1^2 / s_2^2\)) which follows an F-distribution under the null hypothesis of equal population variances, assuming the data in both groups are normally distributed.

\[ F = \frac{s_1^2}{s_2^2} \]

Parameters:
- xs (seq of numbers): The first sample.
- ys (seq of numbers): The second sample.
- params (map, optional): Options map with :sides (default :two-sided) and :alpha (default 0.05).
Returns a map with keys like :F (the F-statistic), :p-value, :df ([numerator-df, denominator-df]), :estimate (the variance ratio), :n ([nx, ny]), and :sides.

Assumptions: Independent samples, normality within each sample. Sensitive to departures from normality.
levene-test: Performs Levene’s test for homogeneity of variances across two or more independent groups.

This test is more robust to departures from normality than the F-test. It performs a one-way ANOVA on the absolute deviations of each observation from its group’s mean.

\[ W = \frac{N-k}{k-1} \frac{\sum_{i=1}^k n_i (\bar{Z}_{i\cdot} - \bar{Z}_{\cdot\cdot})^2}{\sum_{i=1}^k \sum_{j=1}^{n_i} (Z_{ij} - \bar{Z}_{i\cdot})^2} \]

where \(Z_{ij} = |X_{ij} - \bar{X}_{i\cdot}|\) are the absolute deviations from the group means, \(N\) is total sample size, \(k\) is number of groups, \(n_i\) is size of group \(i\), \(\bar{Z}_{i\cdot}\) is mean of absolute deviations in group \(i\), and \(\bar{Z}_{\cdot\cdot}\) is the grand mean of absolute deviations.

Parameters:
- xss (sequence of sequences): Collection of groups.
- params (map, optional): Options map with :sides (default :one-sided-greater), :statistic (default [mean]), and :scorediff (default [[abs]]).
Returns a map with keys like :W (the F-statistic for the ANOVA on deviations), :p-value, :df ([DFt, DFe]), :n (group sizes), and standard ANOVA output keys.

Assumptions: Independent samples. Less sensitive to non-normality than F-test.
brown-forsythe-test: Performs the Brown-Forsythe test, a modification of Levene’s test using the median as the center.

This version is even more robust to non-normality than the standard Levene’s test. It performs an ANOVA on the absolute deviations from group medians.

\[ F^* = \frac{N-k}{k-1} \frac{\sum_{i=1}^k n_i (\tilde{Z}_{i\cdot} - \tilde{Z}_{\cdot\cdot})^2}{\sum_{i=1}^k \sum_{j=1}^{n_i} (Z_{ij} - \tilde{Z}_{i\cdot})^2} \]

where \(Z_{ij} = |X_{ij} - \tilde{X}_{i\cdot}|\) are the absolute deviations from the group medians, \(\tilde{Z}_{i\cdot}\) is the median of absolute deviations in group \(i\), and \(\tilde{Z}_{\cdot\cdot}\) is the grand median of absolute deviations.

Parameters:
- xss (sequence of sequences): Collection of groups.
- params (map, optional): Options map with :sides (default :one-sided-greater) and :scorediff (default [[abs]]). Internally sets :statistic to [median] in [levene-test].
Returns a map similar to levene-test, with an F-statistic derived from the ANOVA on median deviations.

Assumptions: Independent samples. Most robust to non-normality among parametric-based tests.
fligner-killeen-test: Performs the Fligner-Killeen test for homogeneity of variances across two or more independent groups.

This is a non-parametric test based on ranks of the absolute deviations from group medians. It is generally considered one of the most robust tests for homogeneity of variances when assumptions about the underlying distribution are questionable.

The test statistic is based on the ranks of \(|X_{ij} - \tilde{X}_{i\cdot}|\), where \(\tilde{X}_{i\cdot}\) is the median of group \(i\).

Parameters:
- xss (sequence of sequences): Collection of groups.
- params (map, optional): Options map with :sides (default :one-sided-greater).
Returns a map with keys like :chi2 (the Chi-squared statistic), :p-value, :df (number of groups - 1), :n (group sizes), and standard ANOVA output keys (calculated on transformed ranks).

Assumptions: Independent samples. Non-parametric, robust to non-normality. Requires distributions to have similar shape for valid inference.

Comparison: - The F-test is simple but highly sensitive to non-normality. Use cautiously if data isn’t clearly normal. - Levene’s and Brown-Forsythe tests are more robust to non-normality. Brown-Forsythe (median-based) is generally preferred over standard Levene’s (mean-based) when distributions are potentially heavy-tailed or skewed. - Fligner-Killeen is a non-parametric alternative and the most robust to non-normality, based on ranks.

The choice depends on sample size, suspected distribution shape, and the need for robustness.

Function-specific keys:

f-test additionally returns:
- :F: The calculated F-statistic (ratio of sample variances, Var(xs) / Var(ys)). Alias for :stat.
- :nx, :ny: Sample sizes of the first (xs) and second (ys) groups.
- :estimate: The ratio of the sample variances (same value as :F).
- :confidence-interval: A confidence interval for the true ratio of population variances (Var(xs) / Var(ys)).
levene-test and brown-forsythe-test additionally return:
- :W: The calculated test statistic (which is an F-statistic derived from the ANOVA on deviations). Alias for :stat.
- :F: Alias for :W (as it’s an F-statistic).
- :SSt: Sum of Squares Treatment (between groups of deviations).
- :SSe: Sum of Squares Error (within groups of deviations).
- :DFt: Degrees of Freedom Treatment (number of groups - 1).
- :DFe: Degrees of Freedom Error (total sample size - number of groups).
- :MSt: Mean Square Treatment (SSt / DFt).
- :MSe: Mean Square Error (SSe / DFe).
fligner-killeen-test additionally returns:
- :chi2: The calculated Chi-squared statistic. Alias for :stat.
- :SSt: Sum of Squares Treatment, calculated from transformed ranks.
- :SSe: Sum of Squares Error, calculated from transformed ranks.
- :DFt: Degrees of Freedom Treatment (number of groups - 1).
- :DFe: Degrees of Freedom Error (total sample size - number of groups).
- :MSt: Mean Square Treatment (SSt / DFt).
- :MSe: Mean Square Error (SSe / DFe).

Let’s apply these tests to compare variances of sepal lengths for ‘setosa’ and ‘virginica’ species, and then across all three species (including ‘versicolor’ from the original sepal-lengths map).

(stats/f-test setosa-sepal-length virginica-sepal-length)

{:F 0.30728619882096414,
 :confidence-interval [0.1743775950172501 0.5414962165868265],
 :df [49 49],
 :estimate 0.30728619882096414,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 6.366473152645907E-5,
 :sides :two-sided,
 :stat 0.30728619882096414,
 :test-type :two-sided}

(stats/levene-test (vals sepal-lengths))

{:DFe 147,
 :DFt 2,
 :MSe 0.09376069877551022,
 :MSt 0.6920563200000005,
 :SSe 13.782822720000002,
 :SSt 1.384112640000001,
 :W 7.381091747801285,
 :df [2 147],
 :n (50 50 50),
 :p-value 8.817887814641656E-4,
 :stat 7.381091747801285}

(stats/brown-forsythe-test (vals sepal-lengths))

{:DFe 147,
 :DFt 2,
 :MSe 0.10096462585034013,
 :MSt 0.6414000000000003,
 :SSe 14.8418,
 :SSt 1.2828000000000006,
 :W 6.352720020482694,
 :df [2 147],
 :n (50 50 50),
 :p-value 0.0022585277836218998,
 :stat 6.352720020482694}

(stats/fligner-killeen-test (vals sepal-lengths))

{:DFe 147,
 :DFt 2,
 :MSe 0.3194726743934957,
 :MSt 1.9857373674434182,
 :SSe 46.96248313584387,
 :SSt 3.9714747348868364,
 :chi2 11.617980621101287,
 :df 2,
 :n (50 50 50),
 :p-value 0.0030004580742586384,
 :stat 11.617980621101287}

We can also compare sepal widths across species using these tests.

(stats/f-test (sepal-widths :setosa) (sepal-widths :virginica))

{:F 1.3959028295592801,
 :confidence-interval [0.7921415905767488 2.459843962499585],
 :df [49 49],
 :estimate 1.3959028295592801,
 :n [50 50],
 :nx 50,
 :ny 50,
 :p-value 0.24654059117021854,
 :sides :two-sided,
 :stat 1.3959028295592801,
 :test-type :two-sided}

(stats/levene-test (vals sepal-widths))

{:DFe 147,
 :DFt 2,
 :MSe 0.04547069387755103,
 :MSt 0.029199786666666727,
 :SSe 6.684192000000001,
 :SSt 0.058399573333333454,
 :W 0.642167166951519,
 :df [2 147],
 :n (50 50 50),
 :p-value 0.5276204473923196,
 :stat 0.642167166951519}

(stats/brown-forsythe-test (vals sepal-widths))

{:DFe 147,
 :DFt 2,
 :MSe 0.04818367346938773,
 :MSt 0.031199999999999922,
 :SSe 7.082999999999997,
 :SSt 0.062399999999999844,
 :W 0.6475222363405323,
 :df [2 147],
 :n (50 50 50),
 :p-value 0.5248269975064549,
 :stat 0.6475222363405323}

(stats/fligner-killeen-test (vals sepal-widths))

{:DFe 147,
 :DFt 2,
 :MSe 0.3425246920779396,
 :MSt 0.17247526965670978,
 :SSe 50.35112973545712,
 :SSt 0.34495053931341957,
 :chi2 1.0138383496145384,
 :df 2,
 :n (50 50 50),
 :p-value 0.6023484534454747,
 :stat 1.0138383496145384}

Goodness-of-Fit and Independence Tests

These tests are used to evaluate how well an observed distribution of data matches a hypothesized theoretical distribution (Goodness-of-Fit) or whether there is a statistically significant association between two or more categorical variables (Independence). They are commonly applied to categorical data or data that has been grouped into categories (e.g., histograms).

Defined functions

power-divergence-test
chisq-test
multinomial-likelihood-ratio-test
minimum-discrimination-information-test
neyman-modified-chisq-test
freeman-tukey-test
cressie-read-test
ad-test-one-sample
ks-test-one-sample, ks-test-two-samples

Power Divergence Tests:

The power-divergence-test is a generalized framework for several statistical tests that compare observed frequencies to expected frequencies. The specific test performed is determined by the lambda parameter.

The general test statistic is:

\[ 2 \sum_i O_i \left( \frac{\left(\frac{O_i}{E_i}\right)^\lambda - 1}{\lambda(\lambda+1)} \right) \]

where \(O_i\) are the observed counts, \(E_i\) are the expected counts, and \(\lambda\) is the power parameter.

This test can be used for two main purposes:

Goodness-of-Fit: Testing if a sequence of observed counts matches a set of expected counts (proportions/weights) or if a dataset matches a theoretical distribution.
Independence: Testing if there is an association between categorical variables in a contingency table.

The test statistic approximately follows a Chi-squared distribution with degrees of freedom determined by the specific application (e.g., number of categories - 1 - number of estimated parameters for GOF, \((rows-1)(cols-1)\) for independence).

fastmath.stats provides the general power-divergence-test and aliases for common lambda values:

chisq-test: Pearson’s Chi-squared test (\(\lambda=1\)). The most common test in this family. \[ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} \]
multinomial-likelihood-ratio-test: G-test (\(\lambda=0\)). Based on the ratio of likelihoods under the null and alternative hypotheses. \[ G = 2 \sum_i O_i \ln\left(\frac{O_i}{E_i}\right) \]
minimum-discrimination-information-test: Minimum Discrimination Information test (\(\lambda=-1\)). Also known as the G-test on expected counts vs observed. \[ I = 2 \sum_i E_i \ln\left(\frac{E_i}{O_i}\right) \]
neyman-modified-chisq-test: Neyman Modified Chi-squared test (\(\lambda=-2\)). \[ NM = \sum_i \frac{(O_i - E_i)^2}{O_i} \]
freeman-tukey-test: Freeman-Tukey test (\(\lambda=-0.5\)). \[ FT = \sum_i (\sqrt{O_i} - \sqrt{E_i})^2 \]
cressie-read-test: Cressie-Read test (\(\lambda=2/3\), default for power-divergence-test). A compromise test.

Arguments:

contingency-table-or-xs: For independence tests, a contingency table (sequence of sequences or map). For goodness-of-fit with counts, a sequence of observed counts. For goodness-of-fit with data, a sequence of raw data.
params (map, optional): Options including:
- :lambda (double): Power parameter (default 2/3).
- :p: For GOF, expected probabilities/weights (seq) or a fastmath.random distribution. Ignored for independence tests.
- :sides (keyword, default :one-sided-greater): For p-value calculation.
- :alpha (double, default 0.05): For confidence intervals.
- :ci-sides (keyword, default :two-sided): For confidence intervals.
- :bootstrap-samples (long, default 1000): For bootstrap CI.
- :ddof (long, default 0): Delta degrees of freedom subtracted from the calculated DF.
- :bins: For GOF with data vs distribution, histogram bins (see [histogram]).

Returns: A map with keys:

:stat (or :chi2): The calculated test statistic.
:df: Degrees of freedom.
:p-value: P-value.
:n: Total number of observations.
:estimate: Observed proportions.
:expected: Expected counts or proportions.
:confidence-interval: Bootstrap confidence intervals for observed proportions.
:lambda, :alpha, :sides, :ci-sides: Input parameters.

Examples:

Goodness-of-Fit test comparing observed counts to expected proportions (e.g., testing if a six-sided die is fair based on 60 rolls).

(def observed-rolls [10 12 8 11 9 10])

(def expected-proportions [1 1 1 1 1 1])

Using weights proportional to expected, they will be normalized

(stats/chisq-test observed-rolls {:p expected-proportions})

{:alpha 0.05,
 :chi2 1.0000000000000009,
 :ci-sides :two-sided,
 :confidence-interval ([0.08333333333333333 0.26666666666666666]
                       [0.11666666666666667 0.3]
                       [0.05 0.21666666666666667]
                       [0.08333333333333333 0.2833333333333333]
                       [0.06666666666666667 0.25]
                       [0.08333333333333333 0.26666666666666666]),
 :df 5,
 :estimate (0.16666666666666666
            0.2
            0.13333333333333333
            0.18333333333333332
            0.15
            0.16666666666666666),
 :expected (10.0 10.0 10.0 10.0 10.0 10.0),
 :lambda 1.0,
 :level 0.95,
 :n 60.0,
 :p (0.16666666666666666
     0.16666666666666666
     0.16666666666666666
     0.16666666666666666
     0.16666666666666666
     0.16666666666666666),
 :p-value 0.9625657732472963,
 :sides :one-sided-greater,
 :stat 1.0000000000000009,
 :test-type :one-sided-greater}

Goodness-of-Fit test comparing sample data (mpg) to a theoretical distribution (Normal). This implicitly creates a histogram from mpg and compares its counts to the counts expected from a Normal distribution within those bins.

(stats/chisq-test mpg {:p (r/distribution :normal {:mu (stats/mean mpg), :sd (stats/stddev mpg)})})

{:alpha 0.05,
 :chi2 5.1376170090363775,
 :ci-sides :two-sided,
 :confidence-interval ([0.0625 0.3125]
                       [0.21875 0.53125]
                       [0.09453125000000007 0.40625]
                       [0.0 0.15625]
                       [0.03125 0.25]),
 :df 4,
 :estimate (0.1875 0.375 0.25 0.0625 0.125),
 :expected (6.522258911613709
            8.862383891077252
            9.18485015889101
            5.339688018565394
            2.090819019852639),
 :lambda 1.0,
 :level 0.95,
 :n 32.0,
 :p (0.2038205909879284
     0.27694949659616414
     0.28702656746534405
     0.16686525058016857
     0.06533809437039496),
 :p-value 0.27346634528388103,
 :sides :one-sided-greater,
 :stat 5.1376170090363775,
 :test-type :one-sided-greater}

Independence test using a contingency table (from the previous section).

(stats/chisq-test ct-data-1)

{:alpha 0.05,
 :chi2 1.7774150400145103,
 :ci-sides :two-sided,
 :confidence-interval
   {[0 0] [0.33 0.53], [0 1] [0.04 0.15], [1 0] [0.34 0.54], [1 1] [0.01 0.08]},
 :df 1,
 :estimate {[0 0] 0.43, [0 1] 0.09, [1 0] 0.44, [1 1] 0.04},
 :expected {[0 0] 45.24, [0 1] 6.76, [1 0] 41.76, [1 1] 6.24},
 :k 2,
 :lambda 1.0,
 :level 0.95,
 :n 100.0,
 :p-value 0.1824670652605519,
 :r 2,
 :sides :one-sided-greater,
 :stat 1.7774150400145103,
 :test-type :one-sided-greater}

(stats/multinomial-likelihood-ratio-test ct-data-2)

{:alpha 0.05,
 :chi2 11.250923175273865,
 :ci-sides :two-sided,
 :confidence-interval {[0 0] [0.17 0.28500000000000003],
                       [0 1] [0.14 0.245],
                       [0 2] [0.09 0.185],
                       [1 0] [0.085 0.18],
                       [1 1] [0.2 0.32],
                       [1 2] [0.03 0.1]},
 :df 2,
 :estimate
   {[0 0] 0.225, [0 1] 0.19, [0 2] 0.135, [1 0] 0.13, [1 1] 0.26, [1 2] 0.06},
 :expected
   {[0 0] 39.05, [0 1] 49.5, [0 2] 21.45, [1 0] 31.95, [1 1] 40.5, [1 2] 17.55},
 :k 2,
 :lambda 0.0,
 :level 0.95,
 :n 200.0,
 :p-value 0.0036048987752140826,
 :r 3,
 :sides :one-sided-greater,
 :stat 11.250923175273865,
 :test-type :one-sided-greater}

Distribution Comparison Tests (AD/KS):

These tests directly compare empirical cumulative distribution functions (ECDFs) or density estimates, providing non-parametric ways to assess goodness-of-fit or compare two samples without strong assumptions about the underlying distribution shape.

ad-test-one-sample: Anderson-Darling test. Primarily a Goodness-of-Fit test. It is particularly sensitive to differences in the tails of the distributions being compared. It tests if a sample xs comes from a specified distribution or an empirical distribution estimated from ys. \[ A^2 = -n - \sum_{i=1}^n \frac{2i-1}{n} [\ln(F(X_i)) + \ln(1-F(X_{n-i+1}))] \] where \(X_i\) are the ordered data, \(n\) is the sample size, and \(F\) is the CDF of the hypothesized distribution.

Arguments: xs (sample), distribution-or-ys (reference distribution or sample), opts (map: :sides, :kernel, :bandwidth).

Returns: Map with :A2, :stat, :p-value, :n, :mean, :stddev, :sides.
ks-test-one-sample: One-sample Kolmogorov-Smirnov test. Compares the ECDF of a sample xs to a theoretical CDF or the ECDF of another sample ys. It is sensitive to the largest vertical difference between the two CDFs. \[ D = \max_i(|F_n(X_i) - F(X_i)|) \] where \(F_n\) is the ECDF of the sample, \(F\) is the reference CDF, and \(X_i\) are the ordered data.

Arguments: xs (sample), distribution-or-ys (reference distribution or sample), opts (map: :sides, :kernel, :bandwidth, :distinct?).

Returns: Map with :n, :dp, :dn, :d, :stat, :KS, :p-value, :sides.
ks-test-two-samples: Two-sample Kolmogorov-Smirnov test. Compares the ECDFs of two independent samples xs and ys. It is sensitive to the largest vertical difference between the two ECDFs. \[ D_{n,m} = \max_i(|F_n(X_i) - G_m(X_i)|) \] where \(F_n\) and \(G_m\) are the ECDFs of the two samples.

Arguments: xs (sample 1), ys (sample 2), opts (map: :method (exact/approx), :sides, :distinct?, :correct?).

Returns: Map with :nx, :ny, :n, :dp, :dn, :d, :stat, :KS, :p-value, :sides, :method.

Notes:

The AD test is generally more sensitive to differences in the tails of the distributions than the KS test.
The KS test (both one-sample and two-sample) is based on the maximum difference between CDFs, making it sensitive to any point where the distributions diverge, but perhaps less so specifically in the extreme tails compared to AD.
The KS test has issues with discrete data and ties; the distinct? option attempts to mitigate this.
The Power Divergence tests are typically applied to counts or binned data, while AD/KS can be applied directly to continuous data (or compared against continuous distributions).

Handling Ties in KS Tests:

The Kolmogorov-Smirnov (KS) test is theoretically defined for continuous distributions, where the probability of observing duplicate values (ties) is zero. In practice, real-world data often contains ties, which can affect the accuracy of the KS test, particularly for the exact p-value calculation methods. The ks-test-one-sample and ks-test-two-samples functions in fastmath.stats provide the :distinct? option to address this.

The :distinct? option controls how ties are handled:

:ties (default): This is the default behavior. It keeps all observed data points, including ties. If the :method is :exact, information about the ties is passed to the underlying exact calculation algorithm to attempt a correction. The accuracy depends on the specific algorithm’s tie-handling capabilities.
true: This applies the Clojure distinct function separately to each input sequence (xs and ys for the two-sample test) before combining and processing. This removes duplicate values within each sample but does not guarantee that ties between the samples are resolved or correctly handled for the exact method.
:jitter: This adds a small amount of random noise to each data point to break all ties. This is a common pragmatic approach for dealing with ties in continuous data tests when an exact tie correction is unavailable or complex, but it slightly alters the original data.
false: The data is used exactly as provided, without any specific handling or correction for ties.

The presence and handling of ties can significantly influence the calculated p-value, especially when using exact calculation methods. Choosing the appropriate method depends on the nature of the data and the required accuracy.

Examples:

AD test comparing setosa-sepal-length data to a normal distribution.

(stats/ad-test-one-sample setosa-sepal-length (r/distribution :normal {:mu (stats/mean setosa-sepal-length), :sd (stats/stddev setosa-sepal-length)}))

{:A2 0.407985975495866,
 :mean 5.006,
 :n 50,
 :p-value 0.8401982598779278,
 :sides :right,
 :stat 0.407985975495866,
 :stddev 0.3524896872134514}

KS test comparing setosa-sepal-length data to an estimated empirical distribution build from normal samples (KDE based).

(let [d (r/distribution :normal {:mu (stats/mean setosa-sepal-length), :sd (stats/stddev setosa-sepal-length)})] (stats/ks-test-one-sample setosa-sepal-length (r/->seq d 100) {:kernel :epanechnikov}))

{:d 0.1689233351927874,
 :dn 0.13173953566463703,
 :dp 0.1689233351927874,
 :n 15,
 :p-value 0.7249184402452267,
 :sides :two-sided,
 :stat 0.1689233351927874}

Two-sample KS test comparing setosa-sepal-length and virginica-sepal-length.

(stats/ks-test-two-samples setosa-sepal-length virginica-sepal-length)

{:KS 0.9200000000000005,
 :d 0.9200000000000005,
 :dn 6.245004513516506E-17,
 :dp 0.9200000000000005,
 :method :exact,
 :n 100,
 :nx 50,
 :ny 50,
 :p-value 4.719076766639571E-23,
 :sides :two-sided,
 :stat 0.9200000000000005}

ANOVA and Rank Sum Tests

;; Analysis of Variance (ANOVA) and rank sum tests are used to determine if there are statistically significant differences between the means (ANOVA) or distributions (rank sum tests) of two or more independent groups. One-way tests are used when comparing groups based on a single categorical independent variable.

Defined functions

one-way-anova-test
kruskal-test

One-way ANOVA Test (one-way-anova-test): A parametric test used to compare the means of two or more independent groups. It assesses whether the variation among group means is larger than the variation within groups.
- Null Hypothesis (\(H_0\)): The means of all groups are equal.
- Alternative Hypothesis (\(H_1\)): At least one group mean is different from the others.
- Assumptions: Independence of observations, normality within each group, and homogeneity of variances (equal variances across groups).
- Test Statistic (F-statistic): Calculated as the ratio of the variance between groups (Mean Square Treatment, \(MSt\)) to the variance within groups (Mean Square Error, \(MSe\)). \[ F = \frac{MSt}{MSe} \] Where \(MSt = SSt / DFt\) and \(MSe = SSe / DFe\). \(SSt\) is the Sum of Squares Treatment (between groups), \(SSe\) is the Sum of Squares Error (within groups), \(DFt\) is the degrees of freedom for the treatment (number of groups - 1), and \(DFe\) is the degrees of freedom for the error (total observations - number of groups).
Kruskal-Wallis H-Test (kruskal-test): A non-parametric alternative to the one-way ANOVA. It is used to compare the distributions of two or more independent groups when the assumptions of ANOVA (especially normality) are not met. It tests whether the groups come from the same distribution.
- Null Hypothesis (\(H_0\)): The distributions of all groups are identical (specifically, that the median ranks are equal).
- Alternative Hypothesis (\(H_1\)): At least one group’s distribution is different from the others.
- Assumptions: Independence of observations. While it doesn’t assume normality, it assumes that the distributions have similar shapes if the intent is to compare medians.
- Test Statistic (H): Calculated based on the ranks of the combined data from all groups. The sum of ranks is computed for each group, and H is derived from deviations of these rank sums from what would be expected under the null hypothesis.

Comparison:

one-way-anova-test is a parametric test that compares means. It assumes normality and equal variances.
kruskal-test is a non-parametric test that compares distributions (or median ranks). It does not assume normality or equal variances, making it suitable for ordinal data or data violating ANOVA assumptions.

Both functions accept a sequence of sequences (xss) where each inner sequence represents a group. Both return a map containing the test statistic, degrees of freedom, and p-value.

Arguments:

xss (sequence of sequences of numbers): The groups of data to compare.
params (map, optional): An options map.
- :sides (keyword, default :one-sided-greater for both): Specifies the alternative hypothesis side for the test statistic’s distribution.

Return Value (Map):

:stat (double): The calculated test statistic (F for ANOVA, H for Kruskal-Wallis).
:p-value (double): The probability of observing the test statistic or more extreme values under the null hypothesis.
:df (long or vector): Degrees of freedom. Vector [DFt, DFe] for ANOVA, scalar DFt for Kruskal-Wallis.
:n (sequence of longs): Sample sizes of each group.
:k (long): Number of groups (Kruskal-Wallis only).
:sides (keyword): The alternative hypothesis side used.
ANOVA only: :F (alias for :stat), :SSt, :SSe, :DFt, :DFe, :MSt, :MSe.
Kruskal-Wallis only: :H (alias for :stat).

Let’s compare the sepal lengths across the three iris species using both tests.

(stats/one-way-anova-test (vals sepal-lengths))

{:DFe 147,
 :DFt 2,
 :F 119.26450218450461,
 :MSe 0.2650081632653062,
 :MSt 31.606066666666663,
 :SSe 38.95620000000001,
 :SSt 63.21213333333333,
 :df [2 147],
 :n (50 50 50),
 :p-value 0.0,
 :stat 119.26450218450461}

(stats/kruskal-test (vals sepal-lengths))

{:df 2, :k 3, :n 150, :p-value 0.0, :sides :right, :stat 96.93743600064822}

Autocorrelation Tests

Autocorrelation tests examine whether values in a sequence are correlated with past values in the same sequence. This is particularly important when analyzing time series data or the residuals from a regression analysis, as autocorrelated residuals violate the assumption of independent errors, which can invalidate statistical inferences. fastmath.stats provides the Durbin-Watson test, a common method for detecting first-order (lag-1) autocorrelation in regression residuals.

Defined functions

durbin-watson

Durbin-Watson Test (durbin-watson): Calculates the Durbin-Watson statistic (d), which is used to test for the presence of serial correlation, especially first-order (lag-1) autocorrelation, in the residuals of a regression analysis.

The Durbin-Watson statistic is calculated as: \[ d = \frac{\sum_{t=2}^T (e_t - e_{t-1})^2}{\sum_{t=1}^T e_t^2} \] where \(e_t\) are the residuals at time \(t\), and \(T\) is the total number of observations.

The value of the statistic \(d\) ranges from 0 to 4.
- Values near 2 suggest no first-order autocorrelation.
- Values less than 2 suggest positive autocorrelation (residuals tend to be followed by residuals of the same sign).
- Values greater than 2 suggest negative autocorrelation (residuals tend to be followed by residuals of the opposite sign).
Testing for significance requires comparing the calculated statistic to lower (\(d_L\)) and upper (\(d_U\)) critical values from Durbin-Watson tables, which depend on the sample size and number of predictors in the regression model.

Parameters:
- rs (sequence of numbers): The sequence of residuals from a regression model. The sequence should represent observations ordered by time or sequence index.
Returns the calculated Durbin-Watson statistic as a double.

Note: This function only calculates the statistic. Determining statistical significance typically requires consulting Durbin-Watson critical value tables based on the sample size and the number of independent variables in the model.

Let’s calculate the Durbin-Watson statistic for a few example sequences representing different scenarios: no autocorrelation, positive autocorrelation, and negative autocorrelation.

(def residuals-no-autocorrelation (repeatedly 100 r/grand))

(def residuals-positive-autocorrelation [2.0 1.8 1.5 1.0 -0.3 -0.8 -1.2 -1.5])

(def residuals-negative-autocorrelation [1.0 -1.0 0.5 -0.5 0.2 -0.2 0.1 -0.1])

Examples

(stats/durbin-watson residuals-no-autocorrelation) ;; => 2.064286029995981
(stats/durbin-watson residuals-positive-autocorrelation) ;; => 0.17236753856472167
(stats/durbin-watson residuals-negative-autocorrelation) ;; => 3.0884615384615386

Time Series Analysis

Functions specifically for analyzing sequential data, such as time series. These functions help identify patterns like trends, seasonality, and autocorrelation, which are crucial for understanding the underlying process generating the data and for building forecasting models (like ARIMA).

Defined functions

acf
pacf
acf-ci, pacf-ci

Autocorrelation and Partial Autocorrelation functions are fundamental tools for analyzing the dependence structure of a time series. They measure the correlation between a series and its lagged values.

Autocorrelation Function (ACF) (acf): Measures the linear dependence between a time series and its lagged values. Specifically, the ACF at lag \(k\) is the correlation between the series and itself shifted by \(k\) time units. It captures both direct and indirect dependencies. For a stationary series, the sample ACF at lag \(k\) is estimated as: \[ \rho_k = \frac{\sum_{t=k+1}^n (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t=1}^n (y_t - \bar{y})^2} \] where \(y_t\) is the observation at time \(t\), \(\bar{y}\) is the sample mean, and \(n\) is the series length.
Partial Autocorrelation Function (PACF) (pacf): Measures the linear dependence between a time series and its lagged values after removing the linear dependence from the intermediate lags. The PACF at lag \(k\) is the correlation between \(y_t\) and \(y_{t-k}\), conditional on the intermediate observations \(y_{t-1}, y_{t-2}, \dots, y_{t-k+1}\). It isolates the direct relationship at each lag. The PACF values (\(\phi_{kk}\)) are the last coefficients in a sequence of autoregressive models of increasing order. For example, \(\phi_{11}\) is the correlation at lag 1, \(\phi_{22}\) is the correlation at lag 2 after accounting for lag 1, etc.

Comparison:

The ACF shows correlations at various lags, including those that are simply due to preceding correlations. It tends to decay gradually for autoregressive (AR) processes and cut off sharply for moving average (MA) processes.
The PACF shows only the direct correlation at each lag, after removing the effects of shorter lags. It tends to cut off sharply for AR processes and decay gradually for MA processes.

These patterns in ACF and PACF plots are diagnostic tools for identifying the order of AR or MA components in time series models (e.g., ARIMA).

Confidence Intervals:

acf-ci, pacf-ci: Calculate ACF and PACF values, respectively, and provide approximate confidence intervals around these estimates. These intervals help determine whether the autocorrelation/partial autocorrelation at a given lag is statistically significant (i.e., unlikely to be zero in the population). For a stationary series, the standard error of the sample ACF at lag \(k\) is approximately \(1/\sqrt{n}\) for \(k>0\). The PACF has a similar standard error for lags \(k>0\). The confidence interval at level \(\alpha\) is typically \(\pm z_{\alpha/2} / \sqrt{n}\). acf-ci also provides cumulative confidence intervals (:cis) based on the variance of the sum of squared autocorrelations up to each lag.

Arguments:

data (seq of numbers): The time series data.
lags (long or seq of longs, optional): The maximum lag to calculate ACF/PACF for (if a number), or a specific list of lags (for acf only). Defaults to (dec (count data)).
alpha (double, optional, for *-ci functions): Significance level for the confidence intervals (default 0.05 for 95% CI).

Returns:

acf: A sequence of doubles representing the autocorrelation coefficients at lags 0 (always 1.0) up to lags (or specified lags).
pacf: A sequence of doubles representing the partial autocorrelation coefficients at lags 0 (always 0.0) up to lags.
acf-ci, pacf-ci: A map containing:
- :ci (double): The value of the standard confidence interval (e.g., \(1.96/\sqrt{n}\) for 95% CI).
- :acf or :pacf (seq of doubles): The calculated ACF or PACF values.
- :cis (seq of doubles, only for acf-ci): The cumulative confidence intervals for ACF.

Let’s illustrate ACF and PACF with the white noise data (\(sigma=0.5\)) and AR, MA and ARFIMA processes.

(def white-noise (take 1000 (r/white-noise (r/rng :mersenne 1) 0.5)))

Examples

(stats/acf white-noise 10) ;; => (1.0 -0.018326052355220286 -0.003929181207704636 0.03404395796429751 -0.0437602077864766 -0.019833239371372824 -0.03199281424195406 0.01948375863906099 -0.0358413563927969 -0.03550751208554428 0.005884896971007567)
(stats/pacf white-noise 10) ;; => (0.0 -0.018326052355220286 -0.004266458267873071 0.033905460948715424 -0.0425955681599929 -0.021160385359453224 -0.034290284810821074 0.021155121365213258 -0.03604909879044502 -0.0363697118356963 -4.2588688461055706E-4)
(stats/acf-ci white-noise 10) ;; => {:ci 0.06197950323045615, :acf (1.0 -0.018326052355220286 -0.003929181207704636 0.03404395796429751 -0.0437602077864766 -0.019833239371372824 -0.03199281424195406 0.01948375863906099 -0.0358413563927969 -0.03550751208554428 0.005884896971007567), :cis (0.06197950323045615 0.06200031519261883 0.062001271732432243 0.06207303866741653 0.06219143491666359 0.062215727186766975 0.062278892765961415 0.06230230372268246 0.062381459961140466 0.06245905092154892 0.062461180879962185)}
(stats/pacf-ci white-noise 10) ;; => {:ci 0.06197950323045615, :pacf (0.0 -0.018326052355220286 -0.004266458267873071 0.033905460948715424 -0.0425955681599929 -0.021160385359453224 -0.034290284810821074 0.021155121365213258 -0.03604909879044502 -0.0363697118356963 -4.2588688461055706E-4)}

And now for AR(30) time series.

(def ar20 (take 10000 (drop 100 (r/ar (v/sq (m/slice-range 0.35 0.001 20)) white-noise))))

Examples

(stats/acf ar20 10) ;; => (1.0 0.29824685915970284 0.2896474498894056 0.311904955709081 0.2564971222020911 0.28562296383470065 0.2594494703108183 0.2895861922578007 0.25082092948344153 0.2345937181515176 0.2551966287040645)
(stats/pacf ar20 10) ;; => (0.0 0.29824685915970284 0.22029144703035636 0.2063218881353284 0.10446181152034764 0.13372974449872 0.08364335528216932 0.12094872497857509 0.053209972317661176 0.03814507902442836 0.061857342887178354)
(stats/acf-ci ar20 10) ;; => {:ci 0.06683421715111557, :acf (1.0 0.29824685915970284 0.2896474498894056 0.311904955709081 0.2564971222020911 0.28562296383470065 0.2594494703108183 0.2895861922578007 0.25082092948344153 0.2345937181515176 0.2551966287040645), :cis (0.06683421715111557 0.07253598529450296 0.07753039023526297 0.08294616607495518 0.08641652954167832 0.09053521958042302 0.09379757073484547 0.09770956726933358 0.10054443827222243 0.10296037633006667 0.10574802260744404)}
(stats/pacf-ci ar20 10) ;; => {:ci 0.06683421715111557, :pacf (0.0 0.29824685915970284 0.22029144703035636 0.2063218881353284 0.10446181152034764 0.13372974449872 0.08364335528216932 0.12094872497857509 0.053209972317661176 0.03814507902442836 0.061857342887178354)}

MA(10) time series.

(def ma10 (take 1000 (drop 100 (r/ma [0.1 0.1 0.1 2 1 0.1 0.1 0.1 -1 -2]))))

Examples

(stats/acf ma10 10) ;; => (1.0 0.3474133398834613 -0.011709728133522722 0.03660551621472599 0.1290588710109751 -0.27880917656908644 -0.39441644491101663 0.007875954440366985 9.42232762079619E-4 -0.11113130274255513 -0.18146580064411183)
(stats/pacf ma10 10) ;; => (0.0 0.3474133398834613 -0.15058018749888494 0.10891720657535767 0.08680688985888793 -0.4201731045736111 -0.14824488723286847 0.2457700222087262 -0.21043734168926814 0.07859630704564799 -0.1922480943507606)
(stats/acf-ci ma10 10) ;; => {:ci 0.06197950323045615, :acf (1.0 0.3474133398834613 -0.011709728133522722 0.03660551621472599 0.1290588710109751 -0.27880917656908644 -0.39441644491101663 0.007875954440366985 9.42232762079619E-4 -0.11113130274255513 -0.18146580064411183), :cis (0.06197950323045615 0.06905618342380006 0.06906381059072134 0.06913830172170955 0.07005763996703814 0.07419771638402556 0.08185651511897317 0.08185942611490751 0.08185946777725892 0.08243699276698035 0.08395745946947249)}
(stats/pacf-ci ma10 10) ;; => {:ci 0.06197950323045615, :pacf (0.0 0.3474133398834613 -0.15058018749888494 0.10891720657535767 0.08680688985888793 -0.4201731045736111 -0.14824488723286847 0.2457700222087262 -0.21043734168926814 0.07859630704564799 -0.1922480943507606)}

ARFIMA(3,0.1,3)

(def arfima (take 1000 (drop 100 (r/arfima [0.1 0.1 -0.2] 0.1 [0.9 0.1 0.01]))))

Examples

(stats/acf arfima 10) ;; => (1.0 0.6830024280427495 0.22641627601133527 0.0016587457637389108 -0.026890694280513158 0.05057199855655252 0.10543886974267487 0.10660727398567701 0.07819917581099999 0.04057428494707494 0.006148517819922725)
(stats/pacf arfima 10) ;; => (0.0 0.6830024280427495 -0.4499954700211661 0.19924878783442407 -0.044557439709321446 0.13585661229874513 -0.046948653961081485 0.06617673322436798 -0.015899361204672176 0.010727119397509012 -0.028901582713062784)
(stats/acf-ci arfima 10) ;; => {:ci 0.06197950323045615, :acf (1.0 0.6830024280427495 0.22641627601133527 0.0016587457637389108 -0.026890694280513158 0.05057199855655252 0.10543886974267487 0.10660727398567701 0.07819917581099999 0.04057428494707494 0.006148517819922725), :cis (0.06197950323045615 0.08617122994558603 0.08842703487053694 0.08842715439876447 0.08845856219342262 0.08856955738246763 0.08905043638142435 0.08953936246420274 0.08980133253702899 0.08987172804736311 0.08987334393090488)}
(stats/pacf-ci arfima 10) ;; => {:ci 0.06197950323045615, :pacf (0.0 0.6830024280427495 -0.4499954700211661 0.19924878783442407 -0.044557439709321446 0.13585661229874513 -0.046948653961081485 0.06617673322436798 -0.015899361204672176 0.010727119397509012 -0.028901582713062784)}

Histograms

Histograms are fundamental graphical and statistical tools used to represent the distribution of numerical data. They group data into bins along the x-axis and display the frequency (count or proportion) of data points falling into each bin as bars on the y-axis. Histograms provide a visual summary of the shape, center, and spread of the data, highlighting features like modality, symmetry, and outliers.

Defined functions

histogram
estimate-bins

fastmath.stats provides functions to construct histograms and assist in choosing appropriate binning strategies:

histogram: Computes and returns the data structure representing a histogram. It takes the input data and parameters defining the bins (number, estimation method, or explicit intervals). It can also process collections of data sequences (for grouped histograms).
estimate-bins: A utility function to recommend the number of bins for a given dataset based on various commonly used heuristic rules.

The histogram function calculates the counts of data points falling into predefined intervals (bins).

Parameters:

vs (sequence of numbers or sequence of sequences): The input data. Can be a single sequence for a simple histogram or a collection of sequences for grouped histograms.
bins-or-estimate-method (number, keyword, or sequence, optional): Defines the histogram bins.
- A number: The desired number of bins. The function calculates equally spaced intervals between the minimum and maximum values.
- A keyword (default: :freedman-diaconis): Uses a specific heuristic to estimate the number of bins (see [estimate-bins]).
- A sequence of numbers: Explicitly defines the bin edges (intervals). Data points are counted if they fall into [edge_i, edge_{i+1}).
- If omitted, uses the default :freedman-diaconis estimation method.
mn, mx (doubles, optional): Explicit minimum and maximum values to consider for binning. Data outside [mn, mx] are excluded. If omitted, the minimum and maximum of the data are used. When explicit bins-or-estimate-method (sequence of edges) is provided, mn and mx are inferred from the provided edges and vs data is filtered to fit.

Return Value (Map):

Returns a map describing the histogram structure and counts. Key elements include:

:size: The number of bins.
:step: The average width of the bins (only applicable for equally spaced bins).
:samples: The total number of data points included in the histogram (may be less than input if mn/mx are specified).
:min, :max: The minimum and maximum data values used for binning.
:intervals: The sequence of numbers defining the bin edges.
:bins: A sequence of [lower-edge, count] pairs for each bin.
:frequencies: A map where keys are the average value of each bin and values are the counts.
:bins-maps: A sequence of detailed maps for each bin, including :min, :max, :step (bin width), :count, :avg (mean value within the bin), and :probability (count / total samples).

If the input vs is a sequence of sequences, the function returns a sequence of such maps, one for each inner sequence.

(stats/histogram mpg)

{:bins
   ([10.4 6] [15.100000000000001 12] [19.8 8] [24.5 2] [29.200000000000003 4]),
 :bins-maps ({:avg 13.016666666666666,
              :count 6,
              :max 15.100000000000001,
              :min 10.4,
              :probability 0.1875,
              :step 4.700000000000001}
             {:avg 17.341666666666665,
              :count 12,
              :max 19.8,
              :min 15.100000000000001,
              :probability 0.375,
              :step 4.699999999999999}
             {:avg 22.0375,
              :count 8,
              :max 24.5,
              :min 19.8,
              :probability 0.25,
              :step 4.699999999999999}
             {:avg 26.65,
              :count 2,
              :max 29.200000000000003,
              :min 24.5,
              :probability 0.0625,
              :step 4.700000000000003}
             {:avg 31.775,
              :count 4,
              :max 33.9,
              :min 29.200000000000003,
              :probability 0.125,
              :step 4.699999999999996}),
 :frequencies
   {13.016666666666666 6, 17.341666666666665 12, 22.0375 8, 26.65 2, 31.775 4},
 :intervals (10.4 15.100000000000001 19.8 24.5 29.200000000000003 33.9),
 :max 33.9,
 :min 10.4,
 :samples 32,
 :size 5,
 :step 4.7}

Histogram with 3 bins

(stats/histogram mpg 3)

{:bins ([10.4 14] [18.233333333333334 13] [26.066666666666666 5]),
 :bins-maps ({:avg 14.957142857142857,
              :count 14,
              :max 18.233333333333334,
              :min 10.4,
              :probability 0.4375,
              :step 7.833333333333334}
             {:avg 21.469230769230766,
              :count 13,
              :max 26.066666666666666,
              :min 18.233333333333334,
              :probability 0.40625,
              :step 7.833333333333332}
             {:avg 30.879999999999995,
              :count 5,
              :max 33.9,
              :min 26.066666666666666,
              :probability 0.15625,
              :step 7.833333333333332}),
 :frequencies
   {14.957142857142857 14, 21.469230769230766 13, 30.879999999999995 5},
 :intervals (10.4 18.233333333333334 26.066666666666666 33.9),
 :max 33.9,
 :min 10.4,
 :samples 32,
 :size 3,
 :step 7.833333333333333}

Histogram with irregular bins

(stats/histogram mpg [10 20 25 27 35])

{:bins ([10.0 18] [20.0 8] [25.0 1] [27.0 5]),
 :bins-maps
   ({:avg 15.899999999999999,
     :count 18,
     :max 20.0,
     :min 10.0,
     :probability 0.5625,
     :step 10.0}
    {:avg 22.0375, :count 8, :max 25.0, :min 20.0, :probability 0.25, :step 5.0}
    {:avg 26.0, :count 1, :max 27.0, :min 25.0, :probability 0.03125, :step 2.0}
    {:avg 30.879999999999995,
     :count 5,
     :max 35.0,
     :min 27.0,
     :probability 0.15625,
     :step 8.0}),
 :frequencies {15.899999999999999 18, 22.0375 8, 26.0 1, 30.879999999999995 5},
 :intervals (10 20 25 27 35),
 :max 35.0,
 :min 10.0,
 :samples 32,
 :size 4,
 :step 6.25}

Histogram with given min and max range

The estimate-bins function provides recommendations for the number of bins in a histogram based on common rules.

Parameters:

vs (sequence of numbers): The input data.
bins-or-estimate-method (keyword, optional): The rule to use for estimation.
- A keyword (default: :freedman-diaconis): Specifies the rule.
  - :sqrt: Square root rule (simple). \(k = \lceil\sqrt{n}\rceil\)
  - :sturges: Sturges’ rule (assumes approximately normal data). \(k = \lceil\log_2(n) + 1\rceil\)
  - :rice: Rice rule. \(k = \lceil 2 n^{1/3} \rceil\)
  - :doane: Doane’s rule (modification of Sturges’ for non-normal data). \(k = \lceil \log_2(n) + 1 + \log_2(1 + \frac{|g_1|}{\sigma_{g_1}}) \rceil\), where \(g_1\) is sample skewness and \(\sigma_{g_1}\) is its standard error.
  - :scott: Scott’s normal reference rule (bin width \(h = 3.5 \hat{\sigma} n^{-1/3}\)). \(k = \lceil (max - min)/h \rceil\)
  - :freedman-diaconis (default): Freedman-Diaconis rule (robust to outliers, bin width \(h = 2 \cdot IQR \cdot n^{-1/3}\)). \(k = \lceil (max - min)/h \rceil\)
- A number: Returns the number itself (useful for passing a fixed value through).
- If omitted, uses the default :freedman-diaconis rule.

Return Value (Long):

Returns the estimated number of bins as a long integer. The returned value is constrained to be no greater than the number of samples.

Estimate bins for alcohol data using different methods

Examples

(stats/estimate-bins alcohol) ;; => 28
(stats/estimate-bins alcohol :sqrt) ;; => 69
(stats/estimate-bins alcohol :sturges) ;; => 14
(stats/estimate-bins alcohol :rice) ;; => 34
(stats/estimate-bins alcohol :doane) ;; => 17
(stats/estimate-bins alcohol :scott) ;; => 25
(stats/estimate-bins alcohol 3) ;; => 3

Bootstrap

Bootstrap is a widely used resampling technique in statistics. It involves repeatedly drawing samples with replacement from an original dataset (or simulating data from a model) to create many “bootstrap samples”. By analyzing the distribution of a statistic computed from each of these bootstrap samples, one can estimate the sampling distribution of the statistic, its standard error, bias, and construct confidence intervals without relying on strong parametric assumptions.

Defined functions

bootstrap
jackknife, jackknife+
bootstrap-stats
ci-normal, ci-basic, ci-percentile, ci-bc, ci-bca, ci-studentized, ci-t

The core of the bootstrap process is generating the resamples. fastmath.stats provides the main bootstrap function for general resampling, and dedicated functions for specific resampling techniques like Jackknife.

bootstrap: Generates bootstrap samples from data or a probabilistic model. It supports various sampling methods (standard resampling with replacement, Jackknife variants, or sampling from a distribution) and options like smoothing, antithetic sampling, and handling multidimensional data.
jackknife: Generates samples using the leave-one-out Jackknife method. For a dataset of size \(n\), it produces \(n\) samples, each by removing one observation from the original dataset. This is a less computationally intensive resampling method than standard bootstrap, primarily used for bias and variance estimation.
jackknife+: Generates samples using the positive Jackknife method. For a dataset of size \(n\), it produces \(n\) samples, each by adding one extra copy of an observation to the original dataset, resulting in samples of size \(n+1\).

The primary function for generating bootstrap samples. It’s flexible, supporting both nonparametric (resampling from data) and parametric (sampling from a model) approaches, and various options.

Parameters:

input (sequence or map): The data source.
- If a sequence of data values (e.g., [1.2 3.4 5.0]), it’s treated as nonparametric input.
- If a map {:data data-sequence :model model-object (optional)}, it supports parametric bootstrap. If :model is omitted, a distribution is automatically built from :data.
- Can be a sequence of sequences (e.g., [[1 2] [3 4]]) for multidimensional data when :dimensions is :multi.
statistic (function, optional): A function (fn [sample-sequence]) that calculates a statistic (e.g., fastmath.stats/mean). If provided, bootstrap-stats is automatically called on the results. If nil, the raw bootstrap samples are returned.
params (map, optional): Configuration options:
- :samples (long, default: 500): Number of bootstrap samples. Ignored for :jackknife/:jackknife+.
- :size (long, optional): Size of each sample. Defaults to original data size. Ignored for :jackknife/:jackknife+.
- :method (keyword, optional): Sampling method.
  - nil (default): Standard resampling with replacement from data or model.
  - :jackknife: Uses [jackknife].
  - :jackknife+: Uses [jackknife+].
  - Other keywords are passed to fastmath.random/->seq for sampling from a distribution model.
- :rng (random number generator, optional): fastmath.random RNG. Defaults to JVM RNG.
- :smoothing (keyword, optional): Applies smoothing. :kde for Kernel Density Estimation on the model; :gaussian adds noise to resampled values.
- :distribution (keyword, default :real-discrete-distribution): Type of distribution to build if :model is missing.
- :dimensions (keyword, optional): :multi for multidimensional data (sequence of sequences).
- :antithetic? (boolean, default false): Uses antithetic sampling (requires distribution model).
- :include? (boolean, default false): Includes original dataset as one sample.

Returns:

If statistic is provided: A map including original input + analysis results from bootstrap-stats (e.g., :t0, :ts, :bias, :mean, :stddev).
If statistic is nil: A map including original input + :samples (a collection of bootstrap sample sequences).

Let’s demonstrate various sampling methods for mpg data and see the outcome.

First we’ll generate two samples using default (resampling with replacement) method without calculating statistics.

(def rng (r/rng :mersenne 12345))

(def boot-mpg (boot/bootstrap mpg {:samples 2 :rng rng}))

boot-mpg

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.random$eval76834$fn$reify__76848 0x380abdde "fastmath.random$eval76834$fn$reify__76848@380abdde"], :samples ((30.4 21.4 18.1 21.4 26.0 19.2 22.8 33.9 30.4 15.0 22.8 33.9 16.4 30.4 21.4 21.0 21.5 21.4 14.7 19.2 21.0 18.1 15.2 18.1 16.4 26.0 14.7 18.1 21.4 26.0 21.4 19.7) (17.3 10.4 10.4 15.8 15.2 32.4 21.0 18.7 21.0 22.8 10.4 22.8 22.8 15.8 13.3 18.1 22.8 21.4 26.0 24.4 15.2 22.8 16.4 17.8 22.8 33.9 14.7 21.5 27.3 15.8 27.3 15.5))}

Now we’ll create set of samples using various bootstrapping methods: jackknife, jackknife+, gaussian and kde (epanechnikov kernel) smoothing, antithetic, systematic, stratified.

jackknife

Removes one selected data point from each sample.

(def boot-mpg-jacknife (boot/bootstrap mpg {:method :jackknife}))

(count (:samples boot-mpg-jacknife))

jackknife+

Duplicates one selected data point to each sample.

(def boot-mpg-jacknife+ (boot/bootstrap mpg {:method :jackknife+}))

(count (:samples boot-mpg-jacknife))

gaussian smoothing

Adds random gaussian noise to bootstrapped samples.

(def boot-mpg-gaussian (boot/bootstrap mpg {:samples 2 :rng rng :smoothing :gaussian}))

boot-mpg-gaussian

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.stats.bootstrap$build_model$reify__82752 0x1b843e88 "fastmath.stats.bootstrap$build_model$reify__82752@1b843e88"], :samples ((17.67627223078807 14.421273983904124 19.78342723622221 16.07408954450633 14.316546938062789 27.798521999070797 30.77113828285854 18.302867531466383 15.960192150840488 15.864337425252026 14.941708701743925 26.804072729895143 16.574983247993774 29.666351699518902 17.339228247855342 31.40604151247253 22.395701311599353 32.930311855428364 33.78970082261317 30.756294112465746 21.585229731828957 26.74839610035781 23.885829284382098 30.639999179485162 21.687174326300102 27.024298138739095 20.996894755344975 15.037358451484616 27.75666678457644 21.681540315448597 18.258506429956125 21.224613290413426) (15.557680619017212 21.857534902478296 17.173787452087744 21.772458916268835 30.79048018443092 14.401834047062994 29.91805443291403 23.84835433680409 14.97275468619625 15.831977361111361 12.75807146600363 29.478799801629 13.031673716237787 19.899937617191505 10.412081378944666 15.807822683372736 20.77838101998807 34.10293524709373 23.32318729591913 15.691254243065426 15.523902532414375 22.156319169598923 13.174824168708717 21.945995006952142 16.102210424539535 12.657658291816556 22.057155613451208 13.808960669738454 27.20472770459077 15.71216788846756 15.340973186948972 21.86783436275048))}

KDE (epanechnikov) smoothing

Builds KDE based continuous distributions and samples from it.

(def boot-mpg-kde (boot/bootstrap mpg {:samples 2 :rng rng :smoothing :kde
                                     :kernel :epanechnikov}))

boot-mpg-kde

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.random$eval76726$fn$reify__76741 0x3ebb436 "fastmath.random$eval76726$fn$reify__76741@3ebb436"], :samples ((22.71166231862099 21.5726212037736 21.491485668355004 26.80401596363164 11.036961153662787 19.671900487652536 19.19056125766834 14.688160551611023 16.200514869948623 24.206628273571038 17.059352260277656 12.680303574909846 22.250524161111514 17.436331799861314 25.681192054352337 22.158874322682465 15.681506740965093 31.02194432899891 19.462270416888536 33.728606429171 20.47057953199578 18.524421058031663 14.958105453393623 14.69506556273741 17.17075042372946 28.295932367320248 13.829212961072399 24.535423656489346 27.104003147221192 18.25942212803992 27.780775341929104 17.920535668692203) (17.28666485268042 12.638080093377924 14.96857016839802 15.521948357649412 8.753906073248814 29.75592094916252 16.347734250920293 34.81462170357484 32.232667295931996 23.212608320127185 10.744533703230774 20.269444624032467 17.09041749905982 17.345419179032728 19.814015784485466 12.438328238728095 15.908886080593 16.16469970328886 20.98631911829065 13.02615677642237 26.615189109218168 12.421811034398033 14.480450305155927 17.91739680794446 16.56234614189687 20.673676874252102 22.65887627864257 28.21681519740815 11.93320510417053 20.01124443774085 18.687367694505223 14.7899861991578))}

antithetic

Samples in pairs.

(def boot-mpg-antithetic (boot/bootstrap mpg {:samples 2 :rng rng :antithetic? true}))

boot-mpg-antithetic

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.random$eval76834$fn$reify__76848 0x67cfa7aa "fastmath.random$eval76834$fn$reify__76848@67cfa7aa"], :samples ((32.4 30.4 21.0 21.5 30.4 21.5 18.7 21.0 21.5 30.4 17.8 14.7 14.3 30.4 15.8 27.3 30.4 15.0 18.7 24.4 15.2 19.2 19.2 14.7 21.4 22.8 21.0 21.0 17.8 24.4 21.5 15.8) (10.4 13.3 18.1 15.8 14.3 15.8 19.7 17.8 15.8 13.3 21.0 27.3 30.4 13.3 21.5 14.7 13.3 26.0 19.7 15.2 24.4 19.2 19.2 27.3 16.4 15.5 18.1 17.8 21.0 15.2 15.8 21.5))}

systematic

When :systematic method is used, the result will be visible when model is continuous.

(def boot-mpg-systematic (boot/bootstrap mpg {:samples 2 :rng rng :method :systematic
                                            :smoothing :kde}))

boot-mpg-systematic

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.random$eval76726$fn$reify__76741 0x478756ee "fastmath.random$eval76726$fn$reify__76741@478756ee"], :samples ((7.606150009949184 10.075237326852191 11.413673809250726 12.393660938519448 13.192701804947857 13.883853741720861 14.504864771667009 15.077964794554122 15.617447530191576 16.133133483769825 16.632135165532496 17.11985976024962 17.600623861695897 18.078056714893737 18.555409167666742 19.035805185460838 19.52248025896614 20.019034758408264 20.529731011895603 21.059877586339276 21.61637474894962 22.208511540390997 22.849235227484154 23.55710632341578 24.359299899230063 25.29497771190241 26.413769034642186 27.750946296365644 29.27566787162567 30.9234351368731 32.76189217919069 35.34339715127126) (8.676493856283274 10.569043222796827 11.758115889203658 12.66764775786899 13.42609913122416 14.091376822965778 14.694919542040203 15.255831778482156 15.78668865312256 16.29628956322358 16.79110934287723 17.27614772688062 17.755459243241486 18.232521242300177 18.710516956755633 19.19258011239735 19.68203595423166 20.182661491418965 20.69901308885707 21.23684228788875 21.803719707918162 22.409951783548575 23.07004582120764 23.804977419951637 24.645471268906874 25.634844499022098 26.822677828928917 28.227959649536935 29.79700481962044 31.487573747147117 33.45546155582857 36.89510371835037))}

stratified

Similarly to a :systematic, we should sample from conitinuous distribution.

(def boot-mpg-stratified (boot/bootstrap mpg {:samples 2 :rng rng :method :stratified
                                            :smoothing :kde}))

boot-mpg-stratified

{:data (21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26 30.4 15.8 19.7 15 21.4), :model #object[fastmath.random$eval76726$fn$reify__76741 0x783decbe "fastmath.random$eval76726$fn$reify__76741@783decbe"], :samples ((7.246091436112749 9.941275520026366 11.32406615601483 12.32370249682059 13.133759200629084 13.831823291372558 14.45745740742365 15.03376742295543 15.575522401157436 16.092810948228664 16.59292399884376 17.081376183935813 17.56255414339351 18.04012960877534 18.51737397306466 18.99741250071758 19.483462705173242 19.97908634934658 20.48847962773443 21.016853572877427 21.570950008680434 22.159839800707662 22.796108151907163 23.497783186703003 24.291216370254393 25.21453611919852 26.316973452729822 27.636791537277393 29.149395484719662 30.78779806206383 32.6021090153638 35.06525265475502) (7.015778582858773 9.864058413269005 11.273191811589978 12.284239339511783 13.100626827648787 13.802643523386974 14.43091808088285 15.00905552836559 15.552100855151096 16.070304383761837 16.57105020709445 17.059919878505468 17.541338233013303 18.01900226855315 18.49619503554834 18.97604305379609 19.461755664336977 19.956872139879 20.465554298720164 20.992957541911824 21.54574327611805 22.13285599843595 22.766691619155246 23.4649857783239 24.253644833601463 25.170221950551866 26.263670285499295 27.573746699267982 29.079371999052828 30.712664195652117 32.51454108242075 34.92000288227524))}

Model

Automatically created model is in most cases a distribution object and it can be used to generate more data.

(r/->seq (:model boot-mpg) 10)

(27.3 18.7 15.5 22.8 10.4 33.9 14.3 30.4 15.8 30.4)

(r/->seq (:model boot-mpg) 10 :stratified)

(10.4 14.3 15.2 15.8 17.8 19.2 21.0 21.5 24.4 30.4)

Let’s extract median

(r/icdf (:model boot-mpg) 0.5)

19.2

Model based sampling

Now let’s simulate our data when model is given. Let’s assume that mpg follows normal distribution. So instead of resampling we will draw samples from fitted normal distribution.

(def boot-mpg-model (boot/bootstrap {:data mpg
                                   :model (r/distribution :normal {:mu (stats/mean mpg)
                                                                   :sd (stats/stddev mpg)})}
                                  stats/mean
                                  {:samples 10000 :rng rng}))

Non-numerical data

Bootstrap can be also called on categorical data.

(boot/bootstrap [:cat :dog :cat :dog :dog] {:samples 2, :size 20})

{:data [:cat :dog :cat :dog :dog],
 :model
   #object[fastmath.random$eval76859$fn$reify__76863 0x47ab6488 "fastmath.random$eval76859$fn$reify__76863@47ab6488"],
 :samples ((:cat :dog
                 :cat :dog
                 :dog :cat
                 :dog :dog
                 :cat :dog
                 :dog :dog
                 :dog :dog
                 :dog :dog
                 :dog :dog
                 :dog :dog)
            (:cat :cat
                  :cat :cat
                  :dog :dog
                  :cat :dog
                  :dog :cat
                  :dog :dog
                  :dog :dog
                  :dog :dog
                  :dog :cat
                  :cat :dog))}

Statistics

Once bootstrap samples are generated, one typically applies the statistic of interest to each sample to get the bootstrap distribution of the statistic.

bootstrap-stats: Computes a specified statistic on the original data (t0) and on each bootstrap sample (ts), and calculates descriptive statistics (mean, median, variance, stddev, SEM, bias) for the distribution of ts. This function is called automatically by bootstrap when a statistic function is provided.

Parameters:

boot-data (map): A map containing :data (original data) and :samples (collection of bootstrap samples).
statistic (function): The function (fn [sample-sequence]) to apply to the data and samples.

Returns: The input boot-data map augmented with results:

:statistic - statistic calculation function
:t0 - statistic for original data
:ts - statistics for all bootstrapped samples
:bias - difference between t0 and mean of ts
:mean - mean of ts
:median - median of ts
:variance - variance of ts
:stddev - standard deviation of ts
:sem - standard error of mean of ts

Example (usually called internally by bootstrap):

(def raw-boot-samples (boot/bootstrap mpg {:samples 1000}))

(def analyzed-boot-data (boot/bootstrap-stats raw-boot-samples stats/mean))

(dissoc analyzed-boot-data :samples :ts)

{:bias 0.029050000000001575,
 :data (21
        21
        22.8
        21.4
        18.7
        18.1
        14.3
        24.4
        22.8
        19.2
        17.8
        16.4
        17.3
        15.2
        10.4
        10.4
        14.7
        32.4
        30.4
        33.9
        21.5
        15.5
        15.2
        13.3
        19.2
        27.3
        26
        30.4
        15.8
        19.7
        15
        21.4),
 :mean 20.119675,
 :median 20.10625,
 :model
   #object[fastmath.random$eval76834$fn$reify__76848 0x2725a8f2 "fastmath.random$eval76834$fn$reify__76848@2725a8f2"],
 :sem 0.19447546049065773,
 :statistic #<Fn@dcbacae fastmath.stats/mean>,
 :stddev 1.1001193350985647,
 :t0 20.090625,
 :variance 1.2102625514577083}

Now we’ll see bootstrapped ts for different types of sampling methods and sample size = 10000. Vertical line marks t0.

Different statistic

We can also estimate median instead of mean from smoothed samples.

(def boot-mpg-median (boot/bootstrap mpg stats/median {:samples 10000 :smoothing :gaussian :rng rng}))

CI

Bootstrap confidence intervals are constructed from the distribution of the statistic calculated on the bootstrap samples (:ts). fastmath.stats.bootstrap provides several methods, varying in complexity and robustness. All CI functions take the results map (typically from bootstrap-stats) and an optional significance level alpha (default 0.05 for 95% CI). They return a vector [lower-bound, upper-bound, t0].

ci-normal: Normal approximation interval. Assumes the distribution of ts is normal.
ci-basic: Basic (Percentile-t) interval.
ci-percentile: Percentile interval. The simplest method, directly using the quantiles of the bootstrap distribution of the statistic.
ci-bc: Bias-Corrected (BC) interval. Adjusts the Percentile interval by correcting for median bias in the bootstrap distribution.
ci-bca: Bias-Corrected and Accelerated (BCa) interval. Corrects for both bias (\(z_0\)) and skewness/non-constant variance (acceleration factor \(a\)). This is often considered the most accurate method but is more complex. Requires original :data and :statistic in the input map to calculate \(a\) via Jackknife, or it estimates \(a\) from ts skewness if data/statistic are missing.
ci-studentized: Studentized (Bootstrap-t) interval.
ci-t: T-distribution based interval.

Parameters (common to most CI functions):

boot-data (map): Bootstrap analysis results, typically from [bootstrap-stats]. Must contain :t0 and :ts. Some methods (ci-bca, ci-studentized) require additional keys like :data, :samples, :statistic.
alpha (double, optional): Significance level. Default is 0.05 (for 95% CI).
estimation-strategy (keyword, optional, for ci-basic, ci-percentile, ci-bc, ci-bca): Quantile estimation strategy for :s. Defaults to :legacy. See [quantiles].

Returns (common): A vector [lower-bound, upper-bound, t0].

Let’s calculate various confidence intervals for the mean of mpg using the boot-results map obtained earlier (which includes :t0, :ts, etc.):

(def boot-results (boot/bootstrap mpg stats/mean {:samples 500 :rng rng}))

Examples

(:t0 boot-results) ;; => 20.090625
(boot/ci-normal boot-results) ;; => [18.029228142021623 22.207409357978378 20.090625]
(boot/ci-basic boot-results) ;; => [17.969375 22.1184375 20.090625]
(boot/ci-percentile boot-results) ;; => [18.0628125 22.211875 20.090625]
(boot/ci-bc boot-results) ;; => [18.077650533275566 22.225951643655776 20.090625]
(boot/ci-bca boot-results) ;; => [18.166803016582605 22.32402694368883 20.090625]
(boot/ci-studentized boot-results) ;; => [18.114093259364665 22.75279971068179 20.090625]
(boot/ci-t boot-results) ;; => [17.99645503084558 22.184794969154417 20.090625]

Reference

fastmath.stats

Namespace provides a comprehensive collection of functions for performing statistical analysis in Clojure. It focuses on providing efficient implementations for common statistical tasks, leveraging fastmath’s underlying numerical capabilities.

This namespace covers a wide range of statistical methods, including:

Descriptive Statistics: Measures of central tendency (mean, median, mode, expectile), dispersion (variance, standard deviation, MAD, SEM), and shape (skewness, kurtosis, L-moments).
Quantiles and Percentiles: Functions for calculating percentiles, quantiles, and the median, including weighted versions and various estimation strategies.
Intervals and Extents: Methods for defining ranges within data, such as span, IQR, standard deviation/MAD/SEM extents, percentile/quantile intervals, prediction intervals (PI, HPDI), and fence boundaries for outlier detection.
Outlier Detection: Functions for identifying data points outside conventional fence boundaries.
Data Transformation: Utilities for scaling, centering, trimming, winsorizing, and applying power transformations (Box-Cox, Yeo-Johnson) to data.
Correlation and Covariance: Measures of the linear and monotonic relationship between two or more variables (Pearson, Spearman, Kendall), and functions for generating covariance and correlation matrices.
Distance and Similarity Metrics: Functions for quantifying differences or likeness between data sequences or distributions, including error metrics (MAE, MSE, RMSE), L-p norms, and various distribution dissimilarity/similarity measures.
Contingency Tables: Functions for creating, analyzing, and deriving measures of association and agreement (Cramer’s V, Cohen’s Kappa) from contingency tables, including specialized functions for 2x2 tables.
Binary Classification Metrics: Functions for generating confusion matrices and calculating a wide array of performance metrics (Accuracy, Precision, Recall, F1, MCC, etc.).
Effect Size: Measures quantifying the magnitude of statistical effects, including difference-based (Cohen’s d, Hedges’ g, Glass’s delta), ratio-based, ordinal/non-parametric (Cliff’s Delta, Vargha-Delaney A), and overlap-based (Cohen’s U, p-overlap), as well as measures related to explained variance (Eta-squared, Omega-squared, Cohen’s f²).
Statistical Tests: Functions for performing hypothesis tests, including:
- Normality and Shape tests (Skewness, Kurtosis, D’Agostino-Pearson K², Jarque-Bera, Bonett-Seier).
- Binomial tests and confidence intervals.
- Location tests (one-sample and two-sample T/Z tests, paired/unpaired).
- Variance tests (F-test, Levene’s, Brown-Forsythe, Fligner-Killeen).
- Goodness-of-Fit and Independence tests (Power Divergence family including Chi-squared, G-test; AD/KS tests).
- ANOVA and Rank Sum tests (One-way ANOVA, Kruskal-Wallis).
- Autocorrelation tests (Durbin-Watson).
Time Series Analysis: Functions for analyzing the dependence structure of time series data, such as Autocorrelation (ACF) and Partial Autocorrelation (PACF).
Histograms: Functions for computing histograms and estimating optimal binning strategies.

This namespace aims to provide a robust set of statistical tools for data analysis and modeling within the Clojure ecosystem.

->confusion-matrix ^_DEPRECATED

Deprecated: Use confusion-matrix

Datasets

mtcars

iris

winequality

Data

mtcars

iris

winequality

Basic Descriptive Statistics

Basic

Mean

Weighted

Expectile

Mode

Stats

Quantiles and Percentiles

Weighted

Measures of Dispersion/Deviation

Variance and standard deviation

MAD

Moments and Shape

Conventional Moments (moment)

Skewness

Kurtosis

L-moment

Intervals and Extents

Outlier Detection

Data Transformation

Correlation and Covariance

Distance and Similarity Metrics

Error Metrics

Distance Metrics (L-p Norms and others)

Dissimilarity and Similarity

Dissimilarity Methods

Similarity Methods

Contingency Tables

Binary Classification Metrics

Effect Size

Difference Family

Ratio Family

Ordinal / Non-parametric Family

Overlap Family

Correlation / Association Family

Statistical Tests

Normality and Shape Tests

Binomial Tests

Binomial Confidence Intervals

Location Tests (T/Z Tests)

Variance Tests

Goodness-of-Fit and Independence Tests

ANOVA and Rank Sum Tests

Autocorrelation Tests

Time Series Analysis

Histograms

Bootstrap

Statistics

CI

Reference

fastmath.stats

->confusion-matrix DEPRECATED

L0

L1

L2

L2sq

LInf

acf

acf-ci

ad-test-one-sample

adjacent-values

ameasure

binary-measures

binary-measures-all

binomial-ci

binomial-ci-methods

binomial-test

bonett-seier-test

bootstrap DEPRECATED

bootstrap-ci DEPRECATED

box-cox-infer-lambda

box-cox-transformation

`mtcars`

`iris`

`winequality`

`mtcars`

`iris`

`winequality`

Conventional Moments (`moment`)

->confusion-matrix ^_DEPRECATED

bootstrap ^_DEPRECATED

bootstrap-ci ^_DEPRECATED

jensen-shannon-divergence ^_DEPRECATED

kullback-leibler-divergence ^_DEPRECATED