Statistics

(ns stats
  (:require [fastmath.stats :as stats]

            [fastmath.dev.codox :as codox]))

Reference

fastmath.stats

Statistics functions.

  • Descriptive statistics.
  • Correlation / covariance
  • Outliers
  • Confidence intervals
  • Extents
  • Effect size
  • Tests
  • Histogram
  • ACF/PACF
  • Bootstrap (see fastmath.stats.bootstrap)
  • Binary measures

Functions are backed by Apache Commons Math or SMILE libraries. All work with Clojure sequences.

##### Descriptive statistics

All in one function stats-map contains:

  • :Size - size of the samples, (count ...)
  • :Min - minimum value
  • :Max - maximum value
  • :Range - range of values
  • :Mean - mean/average
  • :Median - median, see also: median-3
  • :Mode - mode, see also: modes
  • :Q1 - first quartile, use: percentile, quartile
  • :Q3 - third quartile, use: percentile, quartile
  • :Total - sum of all samples
  • :SD - sample standard deviation
  • :Variance - variance
  • :MAD - median-absolute-deviation
  • :SEM - standard error of mean
  • :LAV - lower adjacent value, use: adjacent-values
  • :UAV - upper adjacent value, use: adjacent-values
  • :IQR - interquartile range, (- q3 q1)
  • :LOF - lower outer fence, (- q1 (* 3.0 iqr))
  • :UOF - upper outer fence, (+ q3 (* 3.0 iqr))
  • :LIF - lower inner fence, (- q1 (* 1.5 iqr))
  • :UIF - upper inner fence, (+ q3 (* 1.5 iqr))
  • :Outliers - list of outliers, samples which are outside outer fences
  • :Kurtosis - kurtosis
  • :Skewness - skewness

Note: percentile and quartile can have 10 different interpolation strategies. See docs

->confusion-matrix

  • (->confusion-matrix tp fn fp tn)
  • (->confusion-matrix confusion-matrix)
  • (->confusion-matrix actual prediction)
  • (->confusion-matrix actual prediction encode-true)

Convert input to confusion matrix

L0

Count equal values in both seqs. Same as count==

L1

  • (L1 [vs1 vs2-or-val])
  • (L1 vs1 vs2-or-val)

Manhattan distance

L2

  • (L2 [vs1 vs2-or-val])
  • (L2 vs1 vs2-or-val)

Euclidean distance

L2sq

  • (L2sq [vs1 vs2-or-val])
  • (L2sq vs1 vs2-or-val)

Squared euclidean distance

LInf

  • (LInf [vs1 vs2-or-val])
  • (LInf vs1 vs2-or-val)

Chebyshev distance

acf

  • (acf data)
  • (acf data lags)

Calculate acf (autocorrelation function) for given number of lags or a list of lags.

If lags is omitted function returns maximum possible number of lags.

See also acf-ci, pacf, pacf-ci

acf-ci

  • (acf-ci data)
  • (acf-ci data lags)
  • (acf-ci data lags alpha)

acf with added confidence interval data.

:cis contains list of calculated ci for every lag.

ad-test-one-sample

  • (ad-test-one-sample xs)
  • (ad-test-one-sample xs distribution-or-ys)
  • (ad-test-one-sample xs distribution-or-ys {:keys [sides kernel bandwidth], :or {sides :one-sided-greater, kernel :gaussian}})

Anderson-Darling test

adjacent-values

  • (adjacent-values vs)
  • (adjacent-values vs estimation-strategy)
  • (adjacent-values vs q1 q3 m)

Lower and upper adjacent values (LAV and UAV).

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LAV is smallest value which is greater or equal to the LIF = (- Q1 (* 1.5 IQR)).
  • UAV is largest value which is lower or equal to the UIF = (+ Q3 (* 1.5 IQR)).
  • third value is a median of samples

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See estimation-strategies.

ameasure

  • (ameasure [group1 group2])
  • (ameasure group1 group2)

Vargha-Delaney A measure for two populations a and b

binary-measures

  • (binary-measures tp fn fp tn)
  • (binary-measures confusion-matrix)
  • (binary-measures actual prediction)
  • (binary-measures actual prediction true-value)

Subset of binary measures. See binary-measures-all.

Following keys are returned: [:tp :tn :fp :fn :accuracy :fdr :f-measure :fall-out :precision :recall :sensitivity :specificity :prevalence]

binary-measures-all

  • (binary-measures-all tp fn fp tn)
  • (binary-measures-all confusion-matrix)
  • (binary-measures-all actual prediction)
  • (binary-measures-all actual prediction true-value)

Collection of binary measures.

Arguments: * confusion-matrix - either map or sequence with [:tp :fn :fp :tn] values

or

  • actual - list of ground truth values
  • prediction - list of predicted values
  • true-value - optional, true/false encoding, what is true in truth and prediction

true-value can be one of:

  • nil - values are treating as booleans
  • any sequence - values from sequence will be treated as true
  • map - conversion will be done according to provided map (if there is no correspondin key, value is treated as false)
  • any predicate

https://en.wikipedia.org/wiki/Precision_and_recall

binomial-ci

  • (binomial-ci number-of-successes number-of-trials)
  • (binomial-ci number-of-successes number-of-trials method)
  • (binomial-ci number-of-successes number-of-trials method alpha)

Return confidence interval for a binomial distribution.

Possible methods are: * :asymptotic (normal aproximation, based on central limit theorem), default * :agresti-coull * :clopper-pearson * :wilson * :prop.test - one sample proportion test * :cloglog * :logit * :probit * :arcsine * :all - apply all methods and return a map of triplets

Default alpha is 0.05

Returns a triple [lower ci, upper ci, p=successes/trials]

binomial-ci-methods

binomial-test

  • (binomial-test xs)
  • (binomial-test xs maybe-params)
  • (binomial-test number-of-successes number-of-trials {:keys [alpha p ci-method sides], :or {alpha 0.05, p 0.5, ci-method :asymptotic, sides :two-sided}})

Binomial test

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided (default), :one-sided-less (short: :one-sided) or :one-sided-greater
  • ci-method - see binomial-ci-methods
  • p - tested probability

bootstrap DEPRECATED

Deprecated: Please use fastmath.stats.bootstrap/bootstrap instead

  • (bootstrap vs)
  • (bootstrap vs samples)
  • (bootstrap vs samples size)

Generate set of samples of given size from provided data.

Default samples is 200, number of size defaults to sample size.

bootstrap-ci DEPRECATED

Deprecated: Please use fastmath.stats.boostrap/ci-basic instead

  • (bootstrap-ci vs)
  • (bootstrap-ci vs alpha)
  • (bootstrap-ci vs alpha samples)
  • (bootstrap-ci vs alpha samples stat-fn)

Bootstrap method to calculate confidence interval.

Alpha defaults to 0.98, samples to 1000. Last parameter is statistical function used to measure, default: mean.

Returns ci and statistical function value.

brown-forsythe-test

  • (brown-forsythe-test xss)
  • (brown-forsythe-test xss params)

chisq-test

  • (chisq-test contingency-table-or-xs)
  • (chisq-test contingency-table-or-xs params)

Chi square test, a power divergence test for lambda 1.0

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

ci

  • (ci vs)
  • (ci vs alpha)

T-student based confidence interval for given data. Alpha value defaults to 0.05.

Last value is mean.

cliffs-delta

  • (cliffs-delta [group1 group2])
  • (cliffs-delta group1 group2)

Cliff’s delta effect size for ordinal data.

coefficient-matrix

  • (coefficient-matrix vss)
  • (coefficient-matrix vss measure-fn)
  • (coefficient-matrix vss measure-fn symmetric?)

Generate coefficient (correlation, covariance, any two arg function) matrix from seq of seqs. Row order.

Default method: pearson-correlation

cohens-d

  • (cohens-d [group1 group2])
  • (cohens-d group1 group2)
  • (cohens-d group1 group2 method)

Cohen’s d effect size for two groups

cohens-d-corrected

  • (cohens-d-corrected [group1 group2])
  • (cohens-d-corrected group1 group2)
  • (cohens-d-corrected group1 group2 method)

Cohen’s d corrected for small group size

cohens-f

  • (cohens-f [group1 group2])
  • (cohens-f group1 group2)
  • (cohens-f group1 group2 type)

Cohens f, sqrt of Cohens f2.

Possible type values are: :eta (default), :omega and :epsilon.

cohens-f2

  • (cohens-f2 [group1 group2])
  • (cohens-f2 group1 group2)
  • (cohens-f2 group1 group2 type)

Cohens f2, by default based on eta-sq.

Possible type values are: :eta (default), :omega and :epsilon.

cohens-kappa

  • (cohens-kappa group1 group2)
  • (cohens-kappa contingency-table)

Cohens kappa

cohens-q

  • (cohens-q r1 r2)
  • (cohens-q group1 group2a group2b)
  • (cohens-q group1a group2a group1b group2b)

Comparison of two correlations.

Arity:

  • 2 - compare two correlation values
  • 3 - compare correlation of group1 and group2a with correlation of group1 and group2b
  • 4 - compare correlation of first two arguments with correlation of last two arguments

cohens-u2

  • (cohens-u2 [group1 group2])
  • (cohens-u2 group1 group2)
  • (cohens-u2 group1 group2 estimation-strategy)

Cohen’s U2, the proportion of one of the groups that exceeds the same proportion in the other group.

cohens-u3

  • (cohens-u3 [group1 group2])
  • (cohens-u3 group1 group2)
  • (cohens-u3 group1 group2 estimation-strategy)

Cohen’s U3, the proportion of the second group that is smaller than the median of the first group.

cohens-w

  • (cohens-w group1 group2)
  • (cohens-w contingency-table)

Cohen’s W effect size for discrete data.

contingency-2x2-measures

  • (contingency-2x2-measures & args)

contingency-2x2-measures-all

  • (contingency-2x2-measures-all a b c d)
  • (contingency-2x2-measures-all map-or-seq)
  • (contingency-2x2-measures-all [a b] [c d])

contingency-table

  • (contingency-table & seqs)

Returns frequencies map of tuples built from seqs.

contingency-table->marginals

  • (contingency-table->marginals ct)

correlation

  • (correlation [vs1 vs2])
  • (correlation vs1 vs2)

Correlation of two sequences.

correlation-matrix

  • (correlation-matrix vss)
  • (correlation-matrix vss measure)

Generate correlation matrix from seq of seqs. Row order.

Possible measures: :pearson (default), :kendall, :spearman.

count=

  • (count= [vs1 vs2-or-val])
  • (count= vs1 vs2-or-val)

Count equal values in both seqs. Same as L0

covariance

  • (covariance [vs1 vs2])
  • (covariance vs1 vs2)

Covariance of two sequences.

covariance-matrix

  • (covariance-matrix vss)

Generate covariance matrix from seq of seqs. Row order.

cramers-c

  • (cramers-c group1 group2)
  • (cramers-c contingency-table)

Cramer’s C effect size for discrete data.

cramers-v

  • (cramers-v group1 group2)
  • (cramers-v contingency-table)

Cramer’s V effect size for discrete data.

cramers-v-corrected

  • (cramers-v-corrected group1 group2)
  • (cramers-v-corrected contingency-table)

Corrected Cramer’s V

cressie-read-test

  • (cressie-read-test contingency-table-or-xs)
  • (cressie-read-test contingency-table-or-xs params)

Cressie-Read test, a power divergence test for lambda 2/3

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

demean

  • (demean vs)

Subtract mean from sequence

dissimilarity

  • (dissimilarity method P-observed Q-expected)
  • (dissimilarity method P-observed Q-expected {:keys [bins probabilities? epsilon log-base power remove-zeros?], :or {probabilities? true, epsilon 1.0E-6, log-base m/E, power 2.0}})

Various PDF distance between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

  • method - distance method
  • P-observed - frequencies, probabilities or actual data (when Q is a distribution of :bins is set)
  • Q-expected - frequencies, probabilities or distribution object (when P is a data or :bins is set)

Options:

  • :probabilities? - should P/Q be converted to a probabilities, default: true.
  • :epsilon - small number which replaces 0.0 when division or logarithm is used`
  • :log-base - base for logarithms, default: e
  • :power - exponent for :minkowski distance, default: 2.0
  • :bins - number of bins or bins estimation method, see histogram.

The list of methods: :euclidean, :city-block, :manhattan, :chebyshev, :minkowski, :sorensen, :gower, :soergel, :kulczynski, :canberra, :lorentzian, :non-intersection, :wave-hedges, :czekanowski, :motyka, :tanimoto, :jaccard, :dice, :bhattacharyya, :hellinger, :matusita, :squared-chord, :euclidean-sq, :squared-euclidean, :pearson-chisq, :chisq, :neyman-chisq, :squared-chisq, :symmetric-chisq, :divergence, :clark, :additive-symmetric-chisq, :kullback-leibler, :jeffreys, :k-divergence, :topsoe, :jensen-shannon, :jensen-difference, :taneja, :kumar-johnson, :avg

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha

durbin-watson

  • (durbin-watson rs)

Lag-1 Autocorrelation test for residuals

epsilon-sq

  • (epsilon-sq [group1 group2])
  • (epsilon-sq group1 group2)

Less biased R2

estimate-bins

  • (estimate-bins vs)
  • (estimate-bins vs bins-or-estimate-method)

Estimate number of bins for histogram.

Possible methods are: :sqrt :sturges :rice :doane :scott :freedman-diaconis (default).

The number returned is not higher than number of samples.

estimation-strategies-list

List of estimation strategies for percentile/quantile functions.

eta-sq

  • (eta-sq [group1 group2])
  • (eta-sq group1 group2)

R2, coefficient of determination

extent

  • (extent vs)

Return extent (min, max, mean) values from sequence

f-test

  • (f-test xs ys)
  • (f-test xs ys {:keys [sides alpha], :or {sides :two-sided, alpha 0.05}})

Variance F-test of two samples.

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided (default), :one-sided-less (short: :one-sided) or :one-sided-greater

fligner-killeen-test

  • (fligner-killeen-test xss)
  • (fligner-killeen-test xss {:keys [sides], :or {sides :one-sided-greater}})

freeman-tukey-test

  • (freeman-tukey-test contingency-table-or-xs)
  • (freeman-tukey-test contingency-table-or-xs params)

Freeman-Tukey test, a power divergence test for lambda -0.5

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

geomean

  • (geomean vs)
  • (geomean vs weights)

Geometric mean for positive values only with optional weights

glass-delta

  • (glass-delta [group1 group2])
  • (glass-delta group1 group2)

Glass’s delta effect size for two groups

harmean

  • (harmean vs)
  • (harmean vs weights)

Harmonic mean with optional weights

hedges-g

  • (hedges-g [group1 group2])
  • (hedges-g group1 group2)

Hedges’s g effect size for two groups

hedges-g*

  • (hedges-g* [group1 group2])
  • (hedges-g* group1 group2)

Less biased Hedges’s g effect size for two groups, J term correction.

hedges-g-corrected

  • (hedges-g-corrected [group1 group2])
  • (hedges-g-corrected group1 group2)

Cohen’s d corrected for small group size

histogram

  • (histogram vs)
  • (histogram vs bins-or-estimate-method)
  • (histogram vs bins-or-estimate-method [mn mx])
  • (histogram vs bins-or-estimate-method mn mx)

Calculate histogram.

Estimation method can be a number, named method: :sqrt :sturges :rice :doane :scott :freedman-diaconis (default) or a sequence of points used as intervals. In the latter case or when mn and mx values are provided - data will be filtered to fit in desired interval(s).

Returns map with keys:

  • :size - number of bins
  • :step - average distance between bins
  • :bins - seq of pairs of range lower value and number of elements
  • :min - min value
  • :max - max value
  • :samples - number of used samples
  • :frequencies - a map containing counts for bin’s average
  • :intervals - intervals used to create bins
  • :bins-maps - seq of maps containing:
    • :min - lower bound
    • :max - upper bound
    • :step - actual distance between bins
    • :count - number of elements
    • :avg - average value
    • :probability - probability for bin

If difference between min and max values is 0, number of bins is set to 1.

hpdi-extent

  • (hpdi-extent vs)
  • (hpdi-extent vs size)

Higher Posterior Density interval + median.

size parameter is the target probability content of the interval.

inner-fence-extent

  • (inner-fence-extent vs)
  • (inner-fence-extent vs estimation-strategy)

Returns LIF, UIF and median

iqr

  • (iqr vs)
  • (iqr vs estimation-strategy)

Interquartile range.

jarque-bera-test

  • (jarque-bera-test xs)
  • (jarque-bera-test xs params)
  • (jarque-bera-test xs skew kurt {:keys [sides], :or {sides :one-sided-greater}})

Goodness of fit test whether skewness and kurtosis of data match normal distribution

jensen-shannon-divergence DEPRECATED

Deprecated: Use dissimilarity.

  • (jensen-shannon-divergence [vs1 vs2])
  • (jensen-shannon-divergence vs1 vs2)

Jensen-Shannon divergence of two sequences.

kendall-correlation

  • (kendall-correlation [vs1 vs2])
  • (kendall-correlation vs1 vs2)

Kendall’s correlation of two sequences.

kruskal-test

  • (kruskal-test xss)
  • (kruskal-test xss {:keys [sides], :or {sides :right}})

Kruskal-Wallis rank sum test.

ks-test-one-sample

  • (ks-test-one-sample xs)
  • (ks-test-one-sample xs distribution-or-ys)
  • (ks-test-one-sample xs distribution-or-ys {:keys [sides kernel bandwidth distinct?], :or {sides :two-sided, kernel :gaussian, distinct? true}})

One sample Kolmogorov-Smirnov test

ks-test-two-samples

  • (ks-test-two-samples xs ys)
  • (ks-test-two-samples xs ys {:keys [sides distinct?], :or {sides :two-sided, distinct? true}})

Two samples Kolmogorov-Smirnov test

kullback-leibler-divergence DEPRECATED

Deprecated: Use dissimilarity.

  • (kullback-leibler-divergence [vs1 vs2])
  • (kullback-leibler-divergence vs1 vs2)

Kullback-Leibler divergence of two sequences.

kurtosis

  • (kurtosis vs)
  • (kurtosis vs typ)

Calculate kurtosis from sequence.

Possible typs: :G2 (default), :g2 (or :excess), :geary, ,:crow, :moors, :hogg or :kurt.

kurtosis-test

  • (kurtosis-test xs)
  • (kurtosis-test xs params)
  • (kurtosis-test xs kurt {:keys [sides type], :or {sides :two-sided, type :kurt}})

Normality test for kurtosis

levene-test

  • (levene-test xss)
  • (levene-test xss {:keys [sides statistic scorediff], :or {sides :one-sided-greater, statistic mean, scorediff abs}})

mad

Alias for median-absolute-deviation

mad-extent

  • (mad-extent vs)

-/+ median-absolute-deviation and median

mae

  • (mae [vs1 vs2-or-val])
  • (mae vs1 vs2-or-val)

Mean absolute error

mape

  • (mape [vs1 vs2-or-val])
  • (mape vs1 vs2-or-val)

Mean absolute percentage error

maximum

  • (maximum vs)

Maximum value from sequence.

mcc

  • (mcc group1 group2)
  • (mcc ct)

Matthews correlation coefficient also known as phi coefficient.

me

  • (me [vs1 vs2-or-val])
  • (me vs1 vs2-or-val)

Mean error

mean

  • (mean vs)
  • (mean vs weights)

Calculate mean of vs with optional weights.

mean-absolute-deviation

  • (mean-absolute-deviation vs)
  • (mean-absolute-deviation vs center)

Calculate mean absolute deviation

means-ratio

  • (means-ratio [group1 group2])
  • (means-ratio group1 group2)
  • (means-ratio group1 group2 adjusted?)

Means ratio

means-ratio-corrected

  • (means-ratio-corrected [group1 group2])
  • (means-ratio-corrected group1 group2)

Bias correced means ratio

median

  • (median vs estimation-strategy)
  • (median vs)

Calculate median of vs. See median-3.

median-3

  • (median-3 a b c)

Median of three values. See median.

median-absolute-deviation

  • (median-absolute-deviation vs)
  • (median-absolute-deviation vs center)
  • (median-absolute-deviation vs center estimation-strategy)

Calculate MAD

minimum

  • (minimum vs)

Minimum value from sequence.

minimum-discrimination-information-test

  • (minimum-discrimination-information-test contingency-table-or-xs)
  • (minimum-discrimination-information-test contingency-table-or-xs params)

Minimum discrimination information test, a power divergence test for lambda -1.0

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

mode

  • (mode vs method)
  • (mode vs method opts)
  • (mode vs)

Find the value that appears most often in a dataset vs.

For sample from continuous distribution, three algorithms are possible: * :histogram - calculated from histogram * :kde - calculated from KDE * :pearson - mode = mean-3(median-mean) * :default - discrete mode

Histogram accepts optional :bins (see histogram). KDE method accepts :kde for kernel name (default :gaussian) and :bandwidth (auto). Pearson can accept :estimation-strategy for median.

See also modes.

modes

  • (modes vs method)
  • (modes vs method opts)
  • (modes vs)

Find the values that appears most often in a dataset vs.

Returns sequence with all most appearing values in increasing order.

See also mode.

modified-power-transformation

  • (modified-power-transformation xs)
  • (modified-power-transformation xs lambda)
  • (modified-power-transformation xs lambda alpha)

Modified power transformation (Box-Cox transformation) of data.

There is no scaling by geometric mean.

Arguments: * lambda - power parameter (default: 0.0) * alpha - shift parameter (optional)

moment

  • (moment vs)
  • (moment vs order)
  • (moment vs order {:keys [absolute? center mean? normalize?], :or {mean? true}})

Calculate moment (central or/and absolute) of given order (default: 2).

Additional parameters as a map:

  • :absolute? - calculate sum as absolute values (default: false)
  • :mean? - returns mean (proper moment) or just sum of differences (default: true)
  • :center - value of center (default: nil = mean)
  • :normalize? - apply normalization by standard deviation to the order power

mse

  • (mse [vs1 vs2-or-val])
  • (mse vs1 vs2-or-val)

Mean squared error

multinomial-likelihood-ratio-test

  • (multinomial-likelihood-ratio-test contingency-table-or-xs)
  • (multinomial-likelihood-ratio-test contingency-table-or-xs params)

Multinomial likelihood ratio test, a power divergence test for lambda 0.0

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

neyman-modified-chisq-test

  • (neyman-modified-chisq-test contingency-table-or-xs)
  • (neyman-modified-chisq-test contingency-table-or-xs params)

Neyman modifield chi square test, a power divergence test for lambda -2.0

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

normality-test

  • (normality-test xs)
  • (normality-test xs params)
  • (normality-test xs skew kurt {:keys [sides], :or {sides :one-sided-greater}})

Normality test based on skewness and kurtosis

omega-sq

  • (omega-sq [group1 group2])
  • (omega-sq group1 group2)
  • (omega-sq group1 group2 degrees-of-freedom)

Adjusted R2

one-way-anova-test

  • (one-way-anova-test xss)
  • (one-way-anova-test xss {:keys [sides], :or {sides :one-sided-greater}})

outer-fence-extent

  • (outer-fence-extent vs)
  • (outer-fence-extent vs estimation-strategy)

Returns LOF, UOF and median

outliers

  • (outliers vs)
  • (outliers vs estimation-strategy)
  • (outliers vs q1 q3)

Find outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LIF (Lower Inner Fence) equals (- Q1 (* 1.5 IQR)).
  • UIF (Upper Inner Fence) equals (+ Q3 (* 1.5 IQR)).

Returns a sequence of outliers.

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See estimation-strategies.

p-overlap

  • (p-overlap [group1 group2])
  • (p-overlap group1 group2)
  • (p-overlap group1 group2 {:keys [kde bandwidth min-iterations steps], :or {kde :gaussian, min-iterations 3, steps 500}})

Overlapping index, kernel density approximation

p-value

  • (p-value stat)
  • (p-value distribution stat)
  • (p-value distribution stat sides)

Calculate p-value for given distribution (default: N(0,1)), stat and sides (one of :two-sided, :one-sided-greater or :one-sided-less/:one-sided).

pacf

  • (pacf data)
  • (pacf data lags)

Caluclate pacf (partial autocorrelation function) for given number of lags.

If lags is omitted function returns maximum possible number of lags.

pacf returns also lag 0 (which is 0.0).

See also acf, acf-ci, pacf-ci

pacf-ci

  • (pacf-ci data)
  • (pacf-ci data lags)
  • (pacf-ci data lags alpha)

pacf with added confidence interval data.

pearson-correlation

  • (pearson-correlation [vs1 vs2])
  • (pearson-correlation vs1 vs2)

Pearson’s correlation of two sequences.

pearson-r

  • (pearson-r [group1 group2])
  • (pearson-r group1 group2)

Pearson r correlation coefficient

percentile

  • (percentile vs p)
  • (percentile vs p estimation-strategy)

Calculate percentile of a vs.

Percentile p is from range 0-100.

See docs.

Optionally you can provide estimation-strategy to change interpolation methods for selecting values. Default is :legacy. See more here

See also quantile.

percentile-bc-extent

  • (percentile-bc-extent vs)
  • (percentile-bc-extent vs p)
  • (percentile-bc-extent vs p1 p2)
  • (percentile-bc-extent vs p1 p2 estimation-strategy)

Return bias corrected percentile range and mean for bootstrap samples. See https://projecteuclid.org/euclid.ss/1032280214

p - calculates extent of bias corrected p and 100-p (default: p=2.5)

Set estimation-strategy to :r7 to get the same result as in R coxed::bca.

percentile-bca-extent

  • (percentile-bca-extent vs)
  • (percentile-bca-extent vs p)
  • (percentile-bca-extent vs p1 p2)
  • (percentile-bca-extent vs p1 p2 estimation-strategy)
  • (percentile-bca-extent vs p1 p2 accel estimation-strategy)

Return bias corrected percentile range and mean for bootstrap samples. Also accounts for variance variations throught the accelaration parameter. See https://projecteuclid.org/euclid.ss/1032280214

p - calculates extent of bias corrected p and 100-p (default: p=2.5)

Set estimation-strategy to :r7 to get the same result as in R coxed::bca.

percentile-extent

  • (percentile-extent vs)
  • (percentile-extent vs p)
  • (percentile-extent vs p1 p2)
  • (percentile-extent vs p1 p2 estimation-strategy)

Return percentile range and median.

p - calculates extent of p and 100-p (default: p=25)

percentiles

  • (percentiles vs)
  • (percentiles vs ps)
  • (percentiles vs ps estimation-strategy)

Calculate percentiles of a vs.

Percentiles are sequence of values from range 0-100.

See docs.

Optionally you can provide estimation-strategy to change interpolation methods for selecting values. Default is :legacy. See more here

See also quantile.

pi

  • (pi vs)
  • (pi vs size)
  • (pi vs size estimation-strategy)

Returns PI as a map, quantile intervals based on interval size.

Quantiles are (1-size)/2 and 1-(1-size)/2

pi-extent

  • (pi-extent vs)
  • (pi-extent vs size)
  • (pi-extent vs size estimation-strategy)

Returns PI extent, quantile intervals based on interval size + median.

Quantiles are (1-size)/2 and 1-(1-size)/2

pooled-stddev

  • (pooled-stddev groups)
  • (pooled-stddev groups method)

Calculate pooled standard deviation for samples and method

pooled-variance

  • (pooled-variance groups)
  • (pooled-variance groups method)

Calculate pooled variance for samples and method.

Methods: * :unbiased - sqrt of weighted average of variances (default) * :biased - biased version of :unbiased * :avg - sqrt of average of variances

population-stddev

  • (population-stddev vs)
  • (population-stddev vs mu)

Calculate population standard deviation of vs.

See stddev.

population-variance

  • (population-variance vs)
  • (population-variance vs mu)

Calculate population variance of vs.

See variance.

population-wstddev

  • (population-wstddev vs weights)

Calculate population weighted standard deviation of vs

population-wvariance

  • (population-wvariance vs freqs)

Calculate population weighted variance of vs.

power-divergence-test

  • (power-divergence-test contingency-table-or-xs)
  • (power-divergence-test contingency-table-or-xs {:keys [lambda ci-sides sides p alpha bootstrap-samples ddof bins], :or {lambda m/TWO_THIRD, sides :one-sided-greater, ci-sides :two-sided, alpha 0.05, bootstrap-samples 1000, ddof 0}})

Power divergence test.

First argument should be one of:

  • contingency table
  • sequence of counts (for goodness of fit)
  • sequence of data (for goodness of fit against distribution)

For goodness of fit there are two options:

  • comparison of observed counts vs expected probabilities or weights (:p)
  • comparison of data against given distribution (:p), in this case histogram from data is created and compared to distribution PDF in bins ranges. Use :bins option to control histogram creation.

Options are:

power-transformation

  • (power-transformation xs)
  • (power-transformation xs lambda)
  • (power-transformation xs lambda alpha)

Power transformation of data.

All values should be positive.

Arguments: * lambda - power parameter (default: 0.0) * alpha - shift parameter (optional)

powmean

  • (powmean vs power)
  • (powmean vs weights power)

Generalized power mean

psnr

  • (psnr [vs1 vs2-or-val])
  • (psnr vs1 vs2-or-val)
  • (psnr vs1 vs2-or-val max-value)

Peak signal to noise, max-value is maximum possible value (default: max from vs1 and vs2)

quantile

  • (quantile vs q)
  • (quantile vs q estimation-strategy)

Calculate quantile of a vs.

Quantile q is from range 0.0-1.0.

See docs for interpolation strategy.

Optionally you can provide estimation-strategy to change interpolation methods for selecting values. Default is :legacy. See more here

See also percentile.

quantile-extent

  • (quantile-extent vs)
  • (quantile-extent vs q)
  • (quantile-extent vs q1 q2)
  • (quantile-extent vs q1 q2 estimation-strategy)

Return quantile range and median.

q - calculates extent of q and 1.0-q (default: q=0.25)

quantiles

  • (quantiles vs)
  • (quantiles vs qs)
  • (quantiles vs qs estimation-strategy)

Calculate quantiles of a vs.

Quantilizes is sequence with values from range 0.0-1.0.

See docs for interpolation strategy.

Optionally you can provide estimation-strategy to change interpolation methods for selecting values. Default is :legacy. See more here

See also percentiles.

r2

  • (r2 [vs1 vs2-or-val])
  • (r2 vs1 vs2-or-val)
  • (r2 vs1 vs2-or-val no-of-variables)

R2

r2-determination

  • (r2-determination [group1 group2])
  • (r2-determination group1 group2)

Coefficient of determination

rank-epsilon-sq

  • (rank-epsilon-sq xs)

Effect size for Kruskal-Wallis test

rank-eta-sq

  • (rank-eta-sq xs)

Effect size for Kruskal-Wallis test

remove-outliers

  • (remove-outliers vs)
  • (remove-outliers vs estimation-strategy)
  • (remove-outliers vs q1 q3)

Remove outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LIF (Lower Inner Fence) equals (- Q1 (* 1.5 IQR)).
  • UIF (Upper Inner Fence) equals (+ Q3 (* 1.5 IQR)).

Returns a sequence without outliers.

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See estimation-strategies.

rescale

  • (rescale vs)
  • (rescale vs low high)

Lineary rascale data to desired range, [0,1] by default

rmse

  • (rmse [vs1 vs2-or-val])
  • (rmse vs1 vs2-or-val)

Root mean squared error

robust-standardize

  • (robust-standardize vs)
  • (robust-standardize vs q)

Normalize samples to have median = 0 and MAD = 1.

If q argument is used, scaling is done by quantile difference (Q_q, Q_(1-q)). Set 0.25 for IQR.

rows->contingency-table

  • (rows->contingency-table xss)

rss

  • (rss [vs1 vs2-or-val])
  • (rss vs1 vs2-or-val)

Residual sum of squares

second-moment DEPRECATED

Deprecated: Use moment function

sem

  • (sem vs)

Standard error of mean

sem-extent

  • (sem-extent vs)

-/+ sem and mean

similarity

  • (similarity method P-observed Q-expected)
  • (similarity method P-observed Q-expected {:keys [bins probabilities? epsilon], :or {probabilities? true, epsilon 1.0E-6}})

Various PDF similarities between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

  • method - distance method
  • P-observed - frequencies, probabilities or actual data (when Q is a distribution)
  • Q-expected - frequencies, probabilities or distribution object (when P is a data)

Options:

  • :probabilities? - should P/Q be converted to a probabilities, default: true.
  • :epsilon - small number which replaces 0.0 when division or logarithm is used`
  • :bins - number of bins or bins estimation method, see histogram.

The list of methods: :intersection, :czekanowski, :motyka, :kulczynski, :ruzicka, :inner-product, :harmonic-mean, :cosine, :jaccard, :dice, :fidelity, :squared-chord

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha

skewness

  • (skewness vs)
  • (skewness vs typ)

Calculate skewness from sequence.

Possible types: :G1 (default), :g1 (:pearson), :b1, :B1 (:yule), :B3, :skew, :mode, :bowley, :hogg or :median.

skewness-test

  • (skewness-test xs)
  • (skewness-test xs params)
  • (skewness-test xs skew {:keys [sides type], :or {sides :two-sided, type :g1}})

Normality test for skewness.

span

  • (span vs)

Width of the sample, maximum value minus minimum value

spearman-correlation

  • (spearman-correlation [vs1 vs2])
  • (spearman-correlation vs1 vs2)

Spearman’s correlation of two sequences.

standardize

  • (standardize vs)

Normalize samples to have mean = 0 and stddev = 1.

stats-map

  • (stats-map vs)
  • (stats-map vs estimation-strategy)

Calculate several statistics of vs and return as map.

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See estimation-strategies.

stddev

  • (stddev vs)
  • (stddev vs mu)

Calculate standard deviation of vs.

See population-stddev.

stddev-extent

  • (stddev-extent vs)

-/+ stddev and mean

sum

  • (sum vs)

Sum of all vs values.

t-test-one-sample

  • (t-test-one-sample xs)
  • (t-test-one-sample xs m)

One sample Student’s t-test

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided, :one-sided-less (short: :one-sided) or :one-sided-greater
  • mu - mean (default: 0.0)

t-test-two-samples

  • (t-test-two-samples xs ys)
  • (t-test-two-samples xs ys {:keys [paired? equal-variances?], :or {paired? false, equal-variances? false}, :as params})

Two samples Student’s t-test

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided (default), :one-sided-less (short: :one-sided) or :one-sided-greater
  • mu - mean (default: 0.0)
  • paired? - unpaired or paired test, boolean (default: false)
  • equal-variances? - unequal or equal variances, boolean (default: false)

trim

  • (trim vs)
  • (trim vs quantile)
  • (trim vs quantile estimation-strategy)
  • (trim vs low high nan)

Return trimmed data. Trim is done by using quantiles, by default is set to 0.2.

trim-lower

  • (trim-lower vs)
  • (trim-lower vs quantile)
  • (trim-lower vs quantile estimation-strategy)

Trim data below given quanitle, default: 0.2.

trim-upper

  • (trim-upper vs)
  • (trim-upper vs quantile)
  • (trim-upper vs quantile estimation-strategy)

Trim data above given quanitle, default: 0.2.

tschuprows-t

  • (tschuprows-t group1 group2)
  • (tschuprows-t contingency-table)

Tschuprows T effect size for discrete data

ttest-one-sample DEPRECATED

Deprecated: Use t-test-one-sample

ttest-two-samples DEPRECATED

Deprecated: Use t-test-two-samples

variance

  • (variance vs)
  • (variance vs mu)

Calculate variance of vs.

See population-variance.

variation

  • (variation vs)

Coefficient of variation CV = stddev / mean

weighted-kappa

  • (weighted-kappa contingency-table)
  • (weighted-kappa contingency-table weights)

Cohen’s weighted kappa for indexed contingency table

winsor

  • (winsor vs)
  • (winsor vs quantile)
  • (winsor vs quantile estimation-strategy)
  • (winsor vs low high nan)

Return winsorized data. Trim is done by using quantiles, by default is set to 0.2.

wmean DEPRECATED

Deprecated: Use mean

  • (wmean vs)
  • (wmean vs weights)

Weighted mean

wmedian

  • (wmedian vs ws)
  • (wmedian vs ws method)

Weighted median.

Calculation is done using interpolation. There are three methods: * :linear - linear interpolation, default * :step - step interpolation * :average - average of ties

Based on spatstat.geom::weighted.quantile from R.

wmw-odds

  • (wmw-odds [group1 group2])
  • (wmw-odds group1 group2)

Wilcoxon-Mann-Whitney odds

wquantile

  • (wquantile vs ws q)
  • (wquantile vs ws q method)

Weighted quantile.

Calculation is done using interpolation. There are three methods: * :linear - linear interpolation, default * :step - step interpolation * :average - average of ties

Based on spatstat.geom::weighted.quantile from R.

wquantiles

  • (wquantiles vs ws)
  • (wquantiles vs ws qs)
  • (wquantiles vs ws qs method)

Weighted quantiles.

Calculation is done using interpolation. There are three methods: * :linear - linear interpolation, default * :step - step interpolation * :average - average of ties

Based on spatstat.geom::weighted.quantile from R.

wstddev

  • (wstddev vs freqs)

Calculate weighted (unbiased) standard deviation of vs

wvariance

  • (wvariance vs freqs)

Calculate weighted (unbiased) variance of vs.

yeo-johnson-transformation

  • (yeo-johnson-transformation xs)
  • (yeo-johnson-transformation xs lambda)
  • (yeo-johnson-transformation xs lambda alpha)

Yeo-Johnson transformation

Arguments: * lambda - power parameter (default: 0.0) * alpha - shift parameter (optional)

z-test-one-sample

  • (z-test-one-sample xs)
  • (z-test-one-sample xs m)

One sample z-test

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided, :one-sided-less (short: :one-sided) or :one-sided-greater
  • mu - mean (default: 0.0)

z-test-two-samples

  • (z-test-two-samples xs ys)
  • (z-test-two-samples xs ys {:keys [paired? equal-variances?], :or {paired? false, equal-variances? false}, :as params})

Two samples z-test

  • alpha - significance level (default: 0.05)
  • sides - one of: :two-sided (default), :one-sided-less (short: :one-sided) or :one-sided-greater
  • mu - mean (default: 0.0)
  • paired? - unpaired or paired test, boolean (default: false)
  • equal-variances? - unequal or equal variances, boolean (default: false)