Machine Learning
ns ml
(:require [fastmath.ml.regression :as regr]
(:as clust]
[fastmath.ml.clustering :as codox])) [fastmath.dev.codox
Regression
Clustering
fastmath.ml.regression
OLS, WLS and GLM regression models with analysis.
->Family
(->Family default-link variance initialize residual-deviance aic quantile-residuals-fun dispersion)
Positional factory function for class fastmath.ml.regression.Family.
->GLMData
(->GLMData model transformer xtxinv ys intercept? offset? intercept beta coefficients observations residuals fitted weights offset names deviance df dispersion dispersions estimated-dispersion? & overage)
Positional factory function for class fastmath.ml.regression.GLMData.
->LMData
(->LMData model intercept? offset? transformer xtxinv intercept beta coefficients offset weights residuals fitted df observations names r-squared adjusted-r-squared sigma2 sigma tss & overage)
Positional factory function for class fastmath.ml.regression.LMData.
->Link
(->Link g mean derivative)
Positional factory function for class fastmath.ml.regression.Link.
->family
(->family family-map)
(->family default-link variance initialize residual-deviance aic quantile-residuals-fun dispersion)
(->family variance residual-deviance)
Create Family
record.
Arguments:
default-link
- canonical link function, default::identity
variance
- variance function in terms of meaninitialize
- initialization of glm, default: the same as in:gaussian
residual-deviance
- calculates residual devianceaic
- calculates AIC, default(constantly ##NaN)
quantile-residuals-fun
- calculates quantile residuals, default as in:gaussian
disperation
- value or:estimate
(default),:pearson
or:mean-deviance
Initialization will be called with ys
and weights
and should return:
- ys, possibly changed if any adjustment is necessary
- init-mu, starting point
- weights, possibly changes or orignal
- (optional) any other data used to calculate AIC
AIC function should accept: ys
, fitted
, weights
, deviance
, observation
, rank
(fitted parameters) and additional data created by initialization
Minimum version should define variance
and residual-deviance
.
->link
(->link link-map)
(->link g mean mean-derivative)
Creates link record.
Args:
g
- link functionmean
- mean, inverse link functionmean-derivative
- derivative of mean
->string
analysis
(analysis model)
Influence analysis, laverage, standardized and studentized residuals, correlation.
dose
(dose glm-model)
(dose glm-model p)
(dose glm-model p coeff-id)
(dose {:keys [link-fun xtxinv coefficients]} p intercept-id coeff-id)
Predict Lethal/Effective dose for given p
(default: p=0.5, median).
- intercept-id - id of intercept, default: 0
- coeff-id is the coefficient used for calculating dose, default: 1
families
family-with-link
(family-with-link family)
(family-with-link family params)
(family-with-link family link params)
Returns family with a link as single map.
glm
(glm ys xss)
(glm ys xss {:keys [max-iters tol epsilon family link weights alpha offset dispersion-estimator intercept? init-mu simple? transformer names], :or {max-iters 25, tol 1.0E-8, epsilon 1.0E-8, family :gaussian, alpha 0.05, intercept? true, simple? false}, :as params})
Fit a generalized linear model using IRLS method.
Arguments:
ys
- response vectorxss
- terms of systematic component- optional parameters
Parameters:
:tol
- tolerance for matrix decomposition (SVD and Cholesky), default:1.0e-8
:epsilon
- tolerance for IRLS (stopping condition), default:1.0e-8
:max-iters
- maximum numbers of iterations, default:25
:weights
- optional weights:offset
- optional offset:alpha
- significance level, default:0.05
:intercept?
- should intercept term be included, default:true
:init-mu
- initial response vector for IRLS:simple?
- returns simplified result:dispersion-estimator
-:pearson
,:mean-deviance
or any number, replaces default one.:family
- family, default::gaussian
:link
- link:nbinomial-theta
- theta for:nbinomial
family, default:1.0
.:transformer
- an optional function which will be used to transform systematic componentxs
before fitting and prediction:names
- an optional vector of names to use when printing the model
Family is one of the: :gaussian
(default), :binomial
, :quasi-binomial
, :poisson
, :quasi-poisson
, :gamma
, :inverse-gaussian
, :nbinomial
, custom Family
record (see ->family) or a function returning Family (accepting a map as an argument)
Link is one of the: :probit
, :identity
, :loglog
, :sqrt
, :inverse
, :logit
, :power
, :nbinomial
, :cauchit
, :distribution
, :cloglog
, :inversesq
, :log
, :clog
, custom Link
record (see ->link) or a function returning Link (accepting a map as an argument)
Notes:
- SVD decomposition is used instead of more common QR
- intercept term is added implicitely if
intercept?
is set totrue
(by default) :nbinomial
family requires:nbinomial-theta
parameter- Each family has its own default (canonical) link.
Returned record implementes IFn
protocol and contains:
:model
- set to:glm
:intercept?
- whether intercept term is included or not:xtxinv
- (X^T X)^-1:intercept
- intercept term value:beta
- vector of model coefficients (without intercept):coefficients
- coefficient analysis, a list of maps containing:estimate
,:stderr
,:t-value
,:p-value
and:confidence-interval
:weights
- weights,:weights
(working) and:initial
:residuals
- a map containing:raw
,:working
,:pearsons
and:deviance
residuals:fitted
- fitted values for xss:df
- degrees of freedom::residual
,:null
and:intercept
:observations
- number of observations:deviance
- deviances::residual
and:null
:dispersion
- default or calculated, used in a model:dispersions
-:pearson
and:mean-deviance
:family
- family used:link
- link used:link-fun
- link function,g
:mean-fun
- mean function,g^-1
:q
- (1-alpha/2) quantile of T or Normal distribution for residual degrees of freedom:chi2
and:p-value
- Chi-squared statistic and respective p-value:ll
- a map containing log-likelihood and AIC/BIC:analysis
- laverage, residual and influence analysis - a delay:iters
and:converged?
- number of iterations and convergence indicator
Analysis, delay containing a map:
:residuals
-:standardized
and:studentized
residuals (pearsons and deviance):laverage
-:hat
,:sigmas
and laveraged:coefficients
(leave-one-out):influence
-:cooks-distance
,:dffits
,:dfbetas
and:covratio
:influential
- list of influential observations (ids) for influence measures:correlation
- correlation matrix of estimated parameters
glm-nbinomial
(glm-nbinomial ys xss)
(glm-nbinomial ys xss {:keys [nbinomial-theta max-iters epsilon], :or {max-iters 25, epsilon 1.0E-8}, :as params})
Fits theta for negative binomial glm in iterative process.
Returns fitted model with :nbinomial-theta
key.
Arguments and parameters are the same as for glm
.
Additional parameters:
:nbinomial-theta
- initial theta used as a starting point for optimization.
links
lm
(lm ys xss)
(lm ys xss {:keys [tol weights alpha intercept? offset transformer names], :or {tol 1.0E-8, alpha 0.05, intercept? true}})
Fit a linear model using ordinary (OLS) or weighted (WLS) least squares.
Arguments:
ys
- response vectorxss
- terms of systematic component- optional parameters
Parameters:
:tol
- tolerance for matrix decomposition (SVD and Cholesky), default:1.0e-8
:weights
- optional weights for WLS:offset
- optional offset:alpha
- significance level, default:0.05
:intercept?
- should intercept term be included, default:true
:transformer
- an optional function which will be used to transform systematic componentxs
before fitting and prediction:names
- sequence or string, used as name for coefficient when pretty-printing model, default'X'
Notes:
- SVD decomposition is used instead of more common QR
- intercept term is added implicitely if
intercept?
is set totrue
(by default) - Two variants of AIC/BIC are calculated, one based on log-likelihood, second on RSS/n
Returned record implementes IFn
protocol and contains:
:model
-:ols
or:wls
:intercept?
- whether intercept term is included or not:xtxinv
- (X^T X)^-1:intercept
- intercept term value:beta
- vector of model coefficients (without intercept):coefficients
- coefficient analysis, a list of maps containing:estimate
,:stderr
,:t-value
,:p-value
and:confidence-interval
:weights
- initial weights:residuals
- a map containing:raw
and:weighted
residuals:fitted
- fitted values for xss:df
- degrees of freedom::residual
,:model
and:intercept
:observations
- number of observations:r-squared
and:adjusted-r-squared
:sigma
and:sigma2
- deviance and variance:msreg
- regression mean squared:rss
,:regss
,:tss
- residual, regression and total sum of squares:qt
- (1-alpha/2) quantile of T distribution for residual degrees of freedom:f-statistic
and:p-value
- F statistic and respective p-value:ll
- a map containing log-likelihood and AIC/BIC in two variants: based on log-likelihood and RSS:analysis
- laverage, residual and influence analysis - a delay
Analysis, delay containing a map:
:residuals
-:standardized
and:studentized
weighted residuals:laverage
-:hat
,:sigmas
and laveraged:coefficients
(leave-one-out):influence
-:cooks-distance
,:dffits
,:dfbetas
and:covratio
:influential
- list of influential observations (ids) for influence measures:correlation
- correlation matrix of estimated parameters:normality
- residuals normality tests::skewness
,:kurtosis
,:durbin-watson
(for raw and weighted),:jarque-berra
and:omnibus
(normality)
map->Family
(map->Family m__7997__auto__)
Factory function for class fastmath.ml.regression.Family, taking a map of keywords to field values.
map->GLMData
(map->GLMData m__7997__auto__)
Factory function for class fastmath.ml.regression.GLMData, taking a map of keywords to field values.
map->LMData
(map->LMData m__7997__auto__)
Factory function for class fastmath.ml.regression.LMData, taking a map of keywords to field values.
map->Link
(map->Link m__7997__auto__)
Factory function for class fastmath.ml.regression.Link, taking a map of keywords to field values.
predict
(predict model xs)
(predict model xs stderr?)
Predict from the given model and data point.
If stderr?
is true, standard error and confidence interval is added. If model is fitted with offset, first element of data point should contain provided offset.
Expected data point:
[x1,x2,...,xn]
- when model was trained without offset[offset,x1,x2,...,xn]
- when offset was used for training[]
ornil
- when model was trained with intercept only[offset]
- when model was trained with intercept and offset
quantile-residuals
(quantile-residuals {:keys [quantile-residuals-fun residuals dispersion], :as model})
Quantile residuals for a model, possibly randomized.
fastmath.ml.clustering
dbscan
(dbscan xss)
(dbscan xss {:keys [eps points distance add-data?], :or {distance :euclidean, add-data? true}})
fuzzy-kmeans
(fuzzy-kmeans xss)
(fuzzy-kmeans xss {:keys [clusters fuzziness max-iters distance epsilon rng add-data?], :or {clusters 1, fuzziness 2, max-iters -1, distance :euclidean, epsilon 0.001, rng (r/rng :default), add-data? true}})
infer-dbscan-radius
(infer-dbscan-radius xss dist)
kmeans++
(kmeans++ xss)
(kmeans++ xss {:keys [clusters max-iters distance rng empty-cluster-strategy trials add-data?], :or {clusters 1, max-iters -1, distance :euclidean, empty-cluster-strategy :largest-variance, trials 1, rng (r/rng :default), add-data? true}})
source: clay/ml.clj