Statistics and Machine Learning Toolbox

Analyze and model data using statistics and machine learning

 

Statistics and Machine Learning Toolbox™ provides functions and apps to describe, analyze, and model data. You can use descriptive statistics, visualizations, and clustering for exploratory data analysis; fit probability distributions to data; generate random numbers for Monte Carlo simulations, and perform hypothesis tests. Regression and classification algorithms let you draw inferences from data and build predictive models either interactively, using the Classification and Regression Learner apps, or programmatically, using AutoML.

For multidimensional data analysis and feature extraction, the toolbox provides principal component analysis (PCA), regularization, dimensionality reduction, and feature selection methods that let you identify variables with the best predictive power.

The toolbox provides supervised, semi-supervised, and unsupervised machine learning algorithms, including support vector machines (SVMs), boosted decision trees, k-means, and other clustering methods. You can apply interpretability techniques such as partial dependence plots and LIME, and automatically generate C/C++ code for embedded deployment. Many toolbox algorithms can be used on data sets that are too big to be stored in memory.

Get Started:

Exploratory Data Analysis

Explore data through statistical plotting with interactive graphics and descriptive statistics. Identify patterns and features with clustering.

Visualizations

Visually explore data using probability plots, box plots, histograms, quantile-quantile plots, and advanced plots for multivariate analysis, such as dendrograms, biplots, and Andrews plots.

Use a multidimensional scatter plot to explore relationships between variables.

Descriptive Statistics

Understand and describe potentially large sets of data quickly using a few highly relevant numbers.

Explore data using grouped means and variances.

Applying DBSCAN to two concentric groups.

Feature Extraction and Dimensionality Reduction

Transform raw data into features that are most suitable for machine learning. Iteratively explore and create new features, and select the ones that optimize performance.

Feature Extraction

Extract features from data using unsupervised learning techniques such as sparse filtering and reconstruction ICA. You can also use specialized techniques to extract features from images, signals, text, and numeric data.

Extracting features from signals provided by mobile devices. 

Feature Selection

Automatically identify the subset of features that provide the best predictive power in modeling the data. Feature selection methods include stepwise regression, sequential feature selection, regularization, and ensemble methods.

NCA helps select the features that preserve most of the model's accuracy.

Feature Transformation and Dimensionality Reduction

Reduce dimensionality by transforming existing (non-categorical) features into new predictor variables where less descriptive features can be dropped. Feature transformation methods include PCA, factor analysis, and nonnegative matrix factorization.

PCA can project high-dimensional vectors onto a lower-dimensional orthogonal coordinate system with most of their information preserved.

Machine Learning

Build predictive classification and regression models using interactive apps or automated machine learning (AutoML). Automatically select features, identify the best model, and tune hyperparameters.

Train, Validate, and Tune Predictive Models

Compare various machine learning algorithms, select features, adjust hyperparameters, and evaluate the performance of many popular classification and regression algorithms. Build and automatically optimize predictive models with interactive apps, and incrementally improve models with streaming data.

Model Interpretability

Enhance the interpretability of black-box machine learning models by applying established interpretability methods including partial dependence plots, individual conditional expectations (ICE), and local interpretable model-agnostic explanations (LIME).

LIME builds simple approximations of complex models in a local area.

Automated Machine Learning (AutoML)

Improve model performance by automatically tuning hyperparameters, selecting features and models, and addressing data set imbalances with cost matrices.

Optimizing hyperparameters efficiently using Bayesian optimization.

Regression and ANOVA

Model a continuous response variable as a function of one or more predictors, using linear and nonlinear regression, mixed-effects models, generalized linear models, and nonparametric regression. Assign variance to different sources using ANOVA.

Linear and Nonlinear Regression

Model behavior of complex systems with multiple predictors or response variables choosing from many linear and nonlinear regression algorithms. Fit multilevel or hierarchical, linear, nonlinear, and generalized linear mixed-effects models with nested and/or crossed random effects to perform longitudinal or panel analyses, repeated measures, and growth modeling.

Fit regression models interactively with the Regression Learner app.

Nonparametric Regression

Generate an accurate fit without specifying a model that describes the relationship between predictors and response using SVMs, random forests, Gaussian processes, and Gaussian kernels.

 Identify outliers using quantile regression.

Analysis of Variance (ANOVA)

Assign sample variance to different sources and determine whether the variation arises within or among different population groups. Use one-way, two-way, multiway, multivariate, and nonparametric ANOVA, as well as analysis of covariance (ANOCOVA) and repeated measures analysis of variance (RANOVA).

Test groups using multiway ANOVA.

Probability Distributions and Hypothesis Tests

Fit distributions to data. Analyze whether sample-to-sample differences are significant or consistent with random data variation. Generate random numbers from various distributions.

Probability Distributions

Fit continuous and discrete distributions, use statistical plots to evaluate goodness-of-fit, and compute probability density functions and cumulative distribution functions for more than 40 different distributions.

Fit distributions interactively using the Distribution Fitter app.

Random Number Generation

Generate pseudorandom and quasi-random number streams from either a fitted or a constructed probability distribution.

Interactively generate random numbers.

Hypothesis Testing

Perform t-tests, distribution tests (Chi-square, Jarque-Bera, Lilliefors, and Kolmogorov-Smirnov), and nonparametric tests for one, paired, or independent samples. Test for autocorrection and randomness, and compare distributions (two-sample Kolmogorov-Smirnov).

Rejection region in a one-sided t-test.

Industrial Statistics

Statistically analyze effects and data trends. Apply industrial statistical techniques such as a customized design of experiments and statistical process control.

Design of Experiments (DOE)

Define, analyze, and visualize a customized DOE. Create and test practical plans for how to manipulate data inputs in tandem to generate information about their effects on data outputs.

Apply a Box-Behnken design to generate higher order response surfaces.

Statistical Process Control (SPC)

Monitor and improve products or processes by evaluating process variability. Create control charts, estimate process capability, and perform gage repeatability and reproducibility studies.

Monitoring manufacturing processes using control charts.

Reliability and Survival Analysis

Visualize and analyze time-to-failure data with and without censoring by performing Cox proportional hazards regression and fit distributions. Compute empirical hazard, survivor, and cumulative distribution functions, as well as kernel density estimates.

Failure data as an example of “censored” values.

Big Data, Parallelization, and Cloud Computing

Apply statistical and machine learning techniques to out-of-memory data. Speed up statistical computations and machine learning model training with parallelization on clusters and cloud instances.

Analyze Big Data with Tall Arrays

Use tall arrays and tables with many classification, regression, and clustering algorithms to train models on data sets that do not fit in memory without changing your code.

Speed up computations with Parallel Computing Toolbox or MATLAB Parallel Server.

Cloud and Distributed Computing

Use cloud instances to speed up statistical and machine learning computations. Perform the complete machine learning workflow in MATLAB Online™.

Perform computations on Amazon or Azure cloud instances.

Deployment, Code Generation, and Simulink Integration

Deploy statistics and machine learning to embedded systems, accelerate computationally intensive calculations using C code, and integrate with enterprise systems and Simulink models.

Code Generation

Generate portable and readable C or C++ code for inference of classification and regression algorithms, descriptive statistics, and probability distributions using MATLAB Coder™. Generate C/C++ prediction code with reduced precision using Fixed Point Designer™, and update parameters of deployed models without regenerating the prediction code.

Two paths to deployment: generate C code or compile MATLAB code.

Integration with Simulink

Integrate machine learning models with Simulink models for deployment to embedded hardware or for system simulation, verification, and validation.

Integrate with Applications and Enterprise Systems

Deploy statistical and machine learning models as standalone, MapReduce, or Spark™ applications; as web apps; or as Microsoft® Excel® add-ins using MATLAB Compiler™. Build C/C++ shared libraries, Microsoft .NET assemblies, Java® classes, and Python® packages using MATLAB Compiler SDK™.

Use MATLAB Compiler to integrate an air quality classification model.

Latest Features

AutoML

Automatically select the best model and associated hyperparameters for regression (fitrauto)

Interpretability

Obtain locally interpretable model-agnostic explanations (LIME)

SVM Prediction Blocks

Simulate and generate code for SVM models in Simulink

Incremental Learning

Train linear regression and binary classification models incrementally

Semi-Supervised Learning

Extrapolate partial class labels to the entire data set using graphs and self-trained models (fitsemigraph, fitsemiself)

Code Generation

Generate single precision C/C++ code for predictions

Performance

Speed up training of SVM models

See the release notes for details on any of these features and corresponding functions.

Machine Learning Onramp

An interactive introduction to practical machine learning methods for classification problems.