Main Content

Wilkinson notation provides a way to describe regression and repeated measures models without specifying coefficient values. This specialized notation identifies the response variable and which predictor variables to include or exclude from the model. You can also include squared and higher-order terms, interaction terms, and grouping variables in the model formula.

Specifying a model using Wilkinson notation provides several advantages:

You can include or exclude individual predictors and interaction terms from the model. For example, using the

`'Interactions'`

name-value pair available in each model fitting functions includes interaction terms for all pairs of variables. Using Wilkinson notation instead allows you to include only the interaction terms of interest.You can change the model formula without changing the design matrix, if your input data uses the

`table`

data type. For example, if you fit an initial model using all the available predictor variables, but decide to remove a variable that is not statistically significant, then you can re-write the model formula to include only the variables of interest. You do not need to make any changes to the input data itself.

Statistics and Machine Learning Toolbox™ offers several model fitting functions that use Wilkinson notation, including:

Linear models (using

`fitlm`

and`stepwiselm`

)Generalized linear models (using

`fitglm`

)Linear mixed-effects models (using

`fitlme`

and`fitlmematrix`

)Generalized linear mixed-effects models (using

`fitglme`

)Repeated measures models (using

`fitrm`

)Cox proportional hazards model (using

`fitcox`

)

A formula for model specification is a character vector or string scalar of the form `y ~ terms`

, where `y`

is the name of the response variable, and `terms`

defines the model using the predictor variable names and the following operators.

Predictor Terms in Model | Wilkinson Notation |
---|---|

intercept | `1` |

no intercept | `–1` |

x_{1} | `x1` |

x_{1}, x_{2} | `x1 + x2` |

x_{1}, x_{2}, x_{1}x_{2} | `x1*x2` or `x1 + x2 + x1:x2` |

x_{1}x_{2} | `x1:x2` |

x_{1}, x_{1}^{2} | `x1^2` |

x_{1}^{2} | `x1^2 – x1` |

Wilkinson notation includes an intercept term in the model by default, even if you do not add 1 to the model formula. To exclude the intercept from the model, use -1 in the formula.

The `*`

operator (for interactions) and the `^`

operator (for power and exponents) automatically include all lower-order terms. For example, if you specify `x^3`

, the model will automatically include *x*^{3}, *x*^{2}, and *x*. If you want to exclude certain variables from the model, use the `–`

operator to remove the unwanted terms.

For random-effects and mixed-effects models, the formula specification includes the names of the predictor variables and the grouping variables. For example, if the predictor variable *x*_{1} is a random effect grouped by the variable *g*, then represent this in Wilkinson notation as follows:

(x1 | g)

For repeated measures models, the formula specification includes all of the repeated measures as responses, and the factors as predictor variables. Specify the response variables for repeated measures models as described in the following table.

Response Terms in Model | Wilkinson Notation |
---|---|

y_{1} | `y1` |

y_{1}, y_{2}, y_{3} | `y1,y2,y3` |

y_{1}, y_{2}, y_{3}, y_{4}, y_{5} | `y1–y5` |

For example, if you have three repeated measures as responses and the factors *x*_{1}, *x*_{2}, and *x*_{3} as the predictor variables, then you can define the repeated measures model using Wilkinson notation as follows:

y1,y2,y3 ~ x1 + x2 + x3

or

y1-y3 ~ x1 + x2 + x3

If the input data (response and predictor variables) is stored in a table or dataset array, you can specify the formula using the variable names. For example, load the `carsmall`

sample data. Create a table containing `Weight`

, `Acceleration`

, and `MPG`

. Name each variable using the `'VariableNames'`

name-value pair argument of the fitting function `fitlm`

. Then fit the following model to the data:

$$MPG={\beta}_{0}+{\beta}_{1}Weight+{\beta}_{2}Acceleration$$

load carsmall tbl = table(Weight,Acceleration,MPG, ... 'VariableNames',{'Weight','Acceleration','MPG'}); mdl = fitlm(tbl,'MPG ~ Weight + Acceleration')

mdl = Linear regression model: MPG ~ 1 + Weight + Acceleration Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 Weight -0.0082475 0.00059836 -13.783 5.3165e-24 Acceleration 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27

The model object display uses the variable names provided in the input table.

If the input data is stored as a matrix, you can specify the formula using default variable names such as `y`

, `x1`

, and `x2`

. For example, load the `carsmall`

sample data. Create a matrix containing the predictor variables `Weight`

and `Acceleration`

. Then fit the following model to the data:

$$MPG={\beta}_{0}+{\beta}_{1}Weight+{\beta}_{2}Acceleration$$

load carsmall X = [Weight,Acceleration]; y = MPG; mdl = fitlm(X,y,'y ~ x1 + x2')

mdl = Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 x1 -0.0082475 0.00059836 -13.783 5.3165e-24 x2 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27

The term `x1`

in the model specification formula corresponds to the first column of the predictor variable matrix `X`

. The term `x2`

corresponds to the second column of the input matrix. The term `y`

corresponds to the response variable.

Use `fitlm`

and `stepwiselm`

to fit linear models.

For a linear regression model with an intercept and two fixed-effects predictors, such as

$${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2}\text{\hspace{0.17em}}+{\epsilon}_{i},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1 + x2'`

For a linear regression model with no intercept and two fixed-effects predictors, such as

$${y}_{i}={\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2}+{\epsilon}_{i}\text{\hspace{0.17em}},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ -1 + x1 + x2'`

For a linear regression model with an intercept, two fixed-effects predictors, and an interaction term, such as

$${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2}+{\beta}_{3}{x}_{i1}{x}_{i2}+{\epsilon}_{i}\text{\hspace{0.17em}},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1*x2'`

or

`'y ~ x1 + x2 + x1:x2'`

For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between all three predictors plus all lower-order terms, such as

$${y}_{i}={\beta}_{0}+{\beta}_{1}x{i}_{1}+{\beta}_{2}{x}_{i2}+{\beta}_{3}{x}_{i3}+{\beta}_{4}{x}_{1}{x}_{i2}+{\beta}_{5}{x}_{1}{x}_{i3}+{\beta}_{6}{x}_{2}{x}_{i3}+{\beta}_{7}{x}_{i1}{x}_{i2}{x}_{i3}+{\epsilon}_{i}\text{\hspace{0.17em}},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1*x2*x3'`

For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between two of the predictors, such as

$${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2}+{\beta}_{3}{x}_{i3}+{\beta}_{4}{x}_{1}{x}_{i2}+{\epsilon}_{i}\text{\hspace{0.17em}},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1*x2 + x3'`

or

`'y ~ x1 + x2 + x3 + x1:x2'`

For a linear regression model with an intercept, three fixed-effects predictors, and pairwise interaction effects between all three predictors, but excluding an interaction effect between all three predictors simultaneously, such as

$${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2}+{\beta}_{3}{x}_{i3}+{\beta}_{4}{x}_{1}{x}_{i2}+{\beta}_{5}{x}_{i1}{x}_{i3}+{\beta}_{6}{x}_{i2}{x}_{i3}+{\epsilon}_{i}\text{\hspace{0.17em}},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1*x2*x3 - x1:x2:x3'`

Use `fitlme`

and `fitlmematrix`

to fit linear mixed-effects models.

For a linear mixed-effects model that contains a random intercept but no predictor terms, such as

$${y}_{im}={\beta}_{0m}\text{\hspace{0.17em}},$$

where

$${\beta}_{0m}={\beta}_{00}+{b}_{0m}\text{\hspace{0.17em}},\text{\hspace{0.17em}}{b}_{0m}\sim N\left(0,{\sigma}_{0}^{2}\right)$$

and *g* is the grouping variable with *m* levels, specify the model formula using Wilkinson notation as follows:

`'y ~ (1 | g)'`

For a linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, such as

$${y}_{im}={\beta}_{0m}+{\beta}_{1}{x}_{im}\text{\hspace{0.17em}},$$

where

$${\beta}_{0m}={\beta}_{00}+{b}_{0m}\text{\hspace{0.17em}},\text{\hspace{0.17em}}{b}_{0m}\sim N\left(0,{\sigma}_{0}^{2}\right)$$

and *g* is the grouping variable with *m* levels, specify the model formula using Wilkinson notation as follows:

`'y ~ x1 + (1 | g)'`

For a linear mixed-effects model that contains a fixed intercept, plus a random intercept and a random slope that have a possible correlation between them, such as

$${y}_{im}={\beta}_{0m}+{\beta}_{1m}{x}_{im}\text{\hspace{0.17em}},$$

where

$${\beta}_{0m}={\beta}_{00}+{b}_{0m}$$

$${\beta}_{1m}={\beta}_{10}+{b}_{1m}$$

$$\left[\begin{array}{c}{b}_{0m}\\ {b}_{1m}\end{array}\right]\sim N\left\{0,{\sigma}^{2}D\left(\theta \right)\right\}$$

and *D* is a 2-by-2 symmetric and positive semidefinite covariance matrix, parameterized by a variance component vector θ, specify the model formula using Wilkinson notation as follows:

`'y ~ x1 + (x1 | g)'`

The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through `fitlme`

when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the `'CovariancePattern'`

name-value pair argument in `fitlme`

.

Use `fitglm`

and `stepwiseglm`

to fit generalized linear models.

In a generalized linear model, the *y* response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:

Distribution of the response variable

Link function

Linear predictor

The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function `fitglm`

or `stepwiseglm`

.

The linear predictor portion of the equation, which appears on the right side of the `~`

symbol in the model specification formula, uses Wilkinson notation in the same way as for the linear model examples.

A generalized linear model models the link function, rather than the actual response, as *y*. This is reflected in the output display for the model object.

For a generalized linear regression model with an intercept and two predictors, such as

$$\mathrm{log}({y}_{i})={\beta}_{0}+{\beta}_{1}{x}_{i1}+{\beta}_{2}{x}_{i2},$$

specify the model formula using Wilkinson notation as follows:

`'y ~ x1 + x2'`

Use `fitglme`

to fit generalized linear mixed-effects models.

In a generalized linear mixed-effects model, the *y* response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:

Distribution of the response variable

Link function

Linear predictor

The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function `fitglme`

.

The linear predictor portion of the equation, which appears on the right side of the `~`

symbol in the model specification formula, uses Wilkinson notation in the same way as for the linear mixed-effects model examples.

A generalized linear model models the link function as *y*, not the response itself. This is reflected in the output display for the model object.

The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through `fitglme`

when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the `'CovariancePattern'`

name-value pair argument in `fitglme`

.

For a generalized linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, where the response can be modeled using a Poisson distribution, such as

$$\mathrm{log}({y}_{im})={\beta}_{0}+{\beta}_{1}{x}_{im}+{b}_{i}\text{\hspace{0.17em}},$$

where

$${b}_{i}\sim N\left(0,{\sigma}_{b}^{2}\right)$$

and *g* is the grouping variable with *m* levels, specify the model formula using Wilkinson notation as follows:

`'y ~ x1 + (1 | g)'`

Use `fitrm`

to fit repeated measures models.

For a repeated measures model with five response measurements and one predictor variable, specify the model formula using Wilkinson notation as follows:

`'y1-y5 ~ x1'`

For a repeated measures model with five response measurements and three predictor variables, plus an interaction between two of the predictor variables, specify the model formula using Wilkinson notation as follows:

`'y1-y5 ~ x1*x2 + x3'`

[1] Wilkinson, G. N., and C. E. Rogers. "Symbolic description of factorial models for analysis of variance." *J. Royal Statistics Society* 22, pp. 392–399, 1973.