Main Content

anova

Analysis of variance (ANOVA) results

    Description

    An anova object contains the results of a one-, two-, or N-way ANOVA. Use the properties of an anova object to determine if the means in a set of response data differ with respect to the values (levels) of a factor or multiple factors. The object properties include information about the coefficient estimates, ANOVA model fit to the response data, and factors used to perform the analysis.

    Creation

    Description

    example

    aov = anova(y) performs a one-way ANOVA and returns an anova object for the response data in the matrix y. Each column of y is treated as a different factor value.

    example

    aov = anova(factors,y) performs a one-, two-, or N-way ANOVA and returns an anova object for the response data in the vector y. The argument factors specifies the number of factors and their values.

    example

    aov = anova(tbl,y) uses the variables in the table tbl as factors for the response data in the vector y. Each table variable corresponds to a factor.

    example

    aov = anova(tbl,responseVarName) uses the variables in tbl as factors and response data. The responseVarName argument specifies which variable contains the response data.

    aov = anova(tbl,formula) specifies the ANOVA model in Wilkinson notation. The terms of formula use only the variable names in tbl.

    example

    aov = anova(___,Name=Value) specifies additional options using one or more name-value arguments. For example, you can specify which factors are categorical or random, and specify the sum of squares type.

    Input Arguments

    expand all

    Response data, specified as a matrix or a numeric vector.

    • If y is a matrix, anova treats each column of y as a separate factor value in a one-way ANOVA. In this design, the function evaluates whether the population means of the columns are equal. Use this design when you want to perform a one-way ANOVA on data that is equally divided between each group (balanced ANOVA).

      Example of the sample input argument Y in a matrix form, illustrating how anova treats each column of y as a separate group

    • If y is a numeric vector, you must also specify either the factors or tbl input argument. For a one-way ANOVA, factors is a cell array of character vectors or a vector in which each element represents the factor value of the corresponding element in y.

      Example of the sample data input argument y and the factors input argument g. Each element in g represents the factor value of the corresponding element in y.

    • For an N-way ANOVA, factors is a cell array of vectors in which each cell is treated as a separate factor. Alternatively, for an N-way ANOVA, you can provide a table tbl in which each variable is treated as a separate factor. Use this design when you want to perform a two- or N-way ANOVA, or when factor values correspond to different numbers of observations in y (unbalanced ANOVA).

    Note

    The anova function ignores NaN values, <undefined> values, empty characters, and empty strings in y. If factors or tbl contains NaN or <undefined> values, or empty characters or strings, the function ignores the corresponding observations in y. The ANOVA is balanced if each factor value has the same number of observations after the function disregards empty or NaN values. Otherwise, the function performs an unbalanced ANOVA.

    Data Types: single | double

    Factors and factor values for the ANOVA, specified as a numeric, logical, categorical, string, or character vector, or a cell array of vectors. Factors and factor values are sometimes called grouping variables and group names, respectively.

    For a one-way ANOVA, factors is a vector or cell array of character vectors in which each element represents the factor value of the observation in y at the same position. The anova function groups observations in y by their factor values during the ANOVA. The length of factors must be the same as the length of y.

    Example of the sample data input argument y and the factors input argument g. Each element in g represents a factor value of the corresponding element in y.

    For a two- or N-way ANOVA, factors is a cell array of vectors in which each cell corresponds to a different factor. Each vector contains the values of the corresponding factor and must have the same length as y. Factor values are associated with observations in y by their index.

    y=[y1,y2,y3,y4,y5,,yN]g1={'A','A','C','B','B',,'D'}g2=[12131,2]g3={'hi','mid','low','mid','hi',,'low'}

    If factors contains NaN values, anova ignores the corresponding observations in y.

    For more information on factors, see Grouping Variables.

    Note

    If factors or tbl contains NaN values, <undefined> values, empty characters, or empty strings, the anova function ignores the corresponding observations in y. The ANOVA is balanced if each factor value has the same number of observations after the function disregards empty or NaN values. Otherwise, the function performs an unbalanced ANOVA.

    Example: [1,2,1,3,1,...,3,1]

    Example: ["white","red","white",...,"black","red"]

    Example: school=["Springfield","Springfield","Springfield","Arlington","Springfield","Arlington","Arlington"]; monthnumber=[6,12,1,9,4,6,2]; factors={school,monthnumber};

    Data Types: single | double | logical | categorical | char | string | cell

    Factors, factor values, and response data, specified as a table. The variables of tbl can contain numeric, logical, categorical, character vector, or string elements, or cell arrays of characters. When you specify tbl, you must also specify the response data y, responseVarName, or formula.

    • If you specify the response data in y, the table variables represent only the factors for the ANOVA. A factor value in a variable of tbl corresponds to the observation in y at the same position. tbl must have the same number of rows as the length of y. If tbl contains NaN values, then anova ignores the corresponding observations in y.

    • If you do not specify y, you must indicate which variable in tbl contains the response data by using the responseVarName or formula input argument. You can also choose a subset of factors in tbl to use in the ANOVA by setting the name-value argument FactorNames. The anova function associates the values of the factor variables in tbl with the response data in the same row.

    Note

    If factors or tbl contains NaN values, <undefined> values, empty characters, or empty strings, the anova function ignores the corresponding observations in y. The ANOVA is balanced if each factor value has the same number of observations after the function disregards empty or NaN values. Otherwise, the function performs an unbalanced ANOVA.

    Example: mountain=table(altitude,temperature,soilpH); anova(mountain,"soilpH")

    Data Types: table

    Name of the response data, specified as a string scalar or character vector. responseVarName indicates which variable in tbl contains the response data. When you specify responseVarName, you must also specify the tbl input argument.

    Example: "r"

    Data Types: char | string

    ANOVA model, specified as a string scalar or a character vector in Wilkinson notation. anova supports the use of parentheses and commas to specify nested factors in formula. For example, you can specify that factor f1 is nested inside factor f2 by including the term f1(f2) in formula. To specify that f1 is nested inside two factors, f2 and f3, include the term f1(f2,f3). When you specify formula, you must also specify tbl.

    Example: "r ~ f1 + f2 + f3 + f1:f2:f3"

    Example: "MPG ~ Origin + Model(Origin)"

    Data Types: char | string

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: anova(factors,y,CategoricalFactors=[1 2],FactorNames=["school" "major" "age"],ResponseName="GPA") specifies the first two factors in factors as categorical, the factor names as "school", "major", and "age", and the name of the response variable as "GPA".

    Factors to treat as categorical, specified as a numeric, logical, or string vector, or a cell array of character vectors. When CategoricalFactors is set to the default value "all", the anova function treats all factors as categorical.

    Specify CategoricalFactors as one of the following:

    • A numeric vector with indices between 1 and N, where N is the number of factor variables. The anova function treats factors with indices in CategoricalFactors as categorical. The index of a factor is the order in which it appears in the columns of matrix y, the cells of factors, or the columns of tbl.

    • A logical vector of length N, where a true entry means that the corresponding factor is categorical.

    • A string vector or cell array of factor names. The factor names must match the names in tbl or FactorNames.

    Example: CategoricalFactors=["Location" "Smoker"]

    Example: CategoricalFactors=[1 3 4]

    Data Types: single | double | logical | char | string | cell

    Factor names, specified as a string vector or a cell array of character vectors.

    • If you specify tbl in the call to anova, FactorNames must be a subset of the table variables in tbl. anova uses only the factors specified in FactorNames. In this case, the default value of FactorNames is the collection of names of the factor variables in tbl.

    • If you specify the matrix y or factors in the call to anova, you can specify any names for FactorNames. In this case, the default value of FactorNames is ["Factor1","Factor2",…,"FactorN"], where N is the number of factors.

    When you specify formula, anova ignores FactorNames.

    Example: FactorNames=["time","latitude"]

    Data Types: char | string | cell

    Type of ANOVA model to fit, specified as one of the options in the following table or an integer, string scalar, character vector, or terms matrix. The default value for ModelSpecification is "linear".

    OptionTerms Included in ANOVA Model
    "linear" (default)Main effect (linear) terms
    "interactions"Main effect and pairwise interaction terms
    "purequadratic"Main effects and squared main effects. All factors must be continuous to use this option. Set CategoricalFactors = [] to specify all factors as continuous.
    "quadratic"Main effect, squared main effect, and pairwise interaction terms. All factors must be continuous to use this option.
    "polyIJK"Polynomial terms up to degree I for the first factor, degree J for the second factor, and so on. The degree of an interaction term cannot exceed the maximum exponent of a main term. You must specify a degree for each factor.
    "full"Main effect and all interaction terms

    To include all main effects and interaction levels up to the kth level, set ModelSpecification equal to k. When ModelSpecification is an integer, the maximum level of an interaction term in the ANOVA model is the minimum between ModelSpecification and the number of factors.

    If you specify formula, anova ignores ModelSpecification.

    You can also specify the terms of an ANOVA regression model using one of the following:

    • Double or single terms matrix, T, with a column for each factor. Each term in the ANOVA model is a product corresponding to a row of T. The row elements are the exponents of their corresponding factors. For example, T(i,:) = [1 2 1] means that term i is (Factor1)(Factor2)2(Factor3). Because the anova function automatically includes a constant term in the ANOVA model, you do not need to include a row of zeros in the terms matrix.

    • Character vector or string scalar formula in Wilkinson notation, representing one or more terms. anova supports the use of parentheses and commas to specify nested factors, as described in formula. The formula must use names contained in FactorNames, ResponseName, or table variable names if tbl is specified.

    Example: ModelSpecification="poly3212"

    Example: ModelSpecification=3

    Example: ModelSpecification="r ~ c1*c2"

    Example: ModelSpecification=[0 0 0;1 0 0;0 1 0;0 0 1]

    Data Types: single | double | char | string

    Factors to treat as random rather than fixed, specified as a numeric, logical, or string vector, or a cell array of character vectors. The anova function treats an interaction term as random if it contains at least one random factor. The default value is [], meaning all factors are fixed. To specify all factors as random, set RandomFactors to "all".

    Specify RandomFactors as one of the following:

    • A numeric vector with indices between 1 and N, where N is the number of factor variables. The anova function treats factors with indices in RandomFactors as random. The index of a factor is the order in which it appears in the columns of matrix y, the cells of factors, or the columns of tbl.

    • A logical vector of length N, where a true entry means that the corresponding factor is random.

    • A string vector or cell array of factor names. The factor names must match the names in tbl or FactorNames.

    Example: RandomFactors=[1]

    Example: RandomFactors=[1 0 0]

    Data Types: single | double | logical | char | string | cell

    Name of the response variable, specified as a string scalar or a character vector. If you specify responseVarName or formula, anova ignores ResponseName.

    Example: ResponseName="soilpH"

    Data Types: char | string

    Type of sum of squares used to perform the ANOVA, specified as "three", "two", "one", or "hierarchical". For a model containing main effects but no interactions, the value of SumOfSquaresType influences the computations on the unbalanced data only.

    The sum of squares of a term (SSTerm) is defined as the reduction in the sum of squares error (SSE) obtained by adding the term to a model that excludes it. The formula for the sum of squares of a term Term has the form

    SSTerm=i=1n(yifexcl(g1,...,gN))2SSEfexcli=1n(yifincl(g1,...,gN))2SSEfincl

    where n is the number of observations, yi are the response data, g1,...,gN are the factors used to perform the ANOVA, fexcl is a model that excludes Term, and fincl is a model that includes Term. Both fexcl and fincl are specified by SumOfSquaresType. The variables SSEfexcl and SSEfincl are the sum of squares errors for fexcl and fincl, respectively. You can specify fexcl and fincl using one of the options for SumOfSquaresType described in the following table.

    OptionType of Sum of Squares
    "three" (default)

    fincl is the full ANOVA model specified in the property Formula. fexcl is a model composed of all terms in fincl except Term. The model fexcl has the same sigma-restricted coding as fincl. This type of sum of squares is known as Type III.

    "two"

    fexcl is a model composed of all terms in the ANOVA model specified in the property Formula that do not contain Term. If Term is a continuous term, then powers of Term are treated as separate terms that do not contain Term. fincl is a model composed of Term and all the terms in fexcl. This type of sum of squares is known as Type II.

    "one"

    fexcl is a model composed of all the terms that precede Term in the ANOVA model specified in the property Formula. fincl is a model composed of Term and all the terms in fexcl. This type of sum of squares is known as Type I.

    "hierarchical"

    fexcl and fincl are defined as in Type II, except powers of Term are treated as terms that contain Term.

    Example: SumOfSquaresType="hierarchical"

    Data Types: char | string

    Properties

    expand all

    This property is read-only.

    Indices of categorical factors, specified as a numeric vector. This property is set by the CategoricalFactors name-value argument.

    Data Types: double

    This property is read-only.

    Fitted ANOVA model coefficients, specified as a double vector. The anova function expands each categorical factor into F dummy variables, where F is the number of values for the factor. Each dummy variable is fit with a different coefficient during the ANOVA. Continuous factors have coefficients that are constant across factor values.

    For example, let y be a set of response data and factor1 be a continuous factor. Let factor2 be a categorical factor with values value1, value2, and value3. The formula "y ~ 1 + factor1 + factor2" expands to "y ~ 1 + factor1 + (factor2==value1) + (factor2==value2) + (factor2==value3)" and anova fits the expanded formula with coefficients.

    Data Types: single | double

    This property is read-only

    Names of coefficients, specified as a string vector of names. The anova function expands each categorical factor into F dummy variables, where F is the number of values for the factor. The vector ExpandedFactorNames contains the name of each dummy variable. For more information, see Coefficients.

    Data Types: string

    This property is read-only.

    Names of the factors used to fit the ANOVA model, specified as a string vector of names. This property is set by the tbl input argument or the FactorNames name-value argument.

    Data Types: string

    This property is read-only.

    Names and values of the factors used to fit the ANOVA model, specified as a table. The names of the table variables are the factor names, and each variable contains the values of its corresponding factor. If the factors used to fit the model are not given as a table, anova converts them into a table with one column per factor.

    This property is set by one of the following:

    • tbl input argument

    • Matrix y input argument together with the FactorNames name-value argument

    • Vector y input argument together with the factors input argument and the FactorNames name-value argument

    Data Types: table

    This property is read-only.

    ANOVA model, specified as a LinearFormulaWithNesting object. This property is set by the formula input argument or the ModelSpecification name-value argument.

    Model metrics, specified as a table. The table Metrics has these variables:

    • MSE — Mean squared error.

    • RMSE — Root mean squared error, which is the square root of MSE.

    • SSE — Sum of squares of the error.

    • SSR — Sum of squares regression.

    • SST — Total sum of squares.

    • RSquared — Coefficient of determination, also known as R2.

    • AdjustedRSquaredR2 value, adjusted for the number of coefficients. This value is given by the formula Radj2=1(n1)SSE(np)SST, where n is the number of observations, and p is the number of coefficients. A higher value for R2 indicates a better fit for the ANOVA model.

    Data Types: table

    This property is read-only.

    Number of observations used to fit the ANOVA model, specified as a positive integer.

    Data Types: double

    This property is read-only.

    Indices of random factors, specified as a numeric vector. This property is set by the RandomFactors name-value argument.

    Data Types: double

    This property is read-only.

    Residual values, specified as an n-by-2 table, where n is the number of observations. Residuals has two variables:

    • Raw contains the observed minus fitted values.

    • Pearson contains the raw residuals divided by the root mean squared error (RMSE).

    Data Types: table

    This property is read only.

    Type of sum of squares used when fitting the ANOVA model, specified as "three", "two", "one", or "hierarchical". This property is set by the SumOfSquaresType name-value argument.

    Data Types: string

    This property is read-only.

    Name of the response variable, specified as a string scalar or character vector. This property is set by the responseVarName input argument or the ResponseName name-value argument.

    Data Types: char | string

    This property is read-only.

    Response data used to fit the ANOVA model, specified as a numeric vector. This property is set by the y input argument, or the tbl input argument together with the responseVarName input argument.

    Data Types: single | double

    Object Functions

    boxchartBox chart (box plot) for analysis of variance (ANOVA)
    groupmeansMean response estimates for analysis of variance (ANOVA)
    multcompareMultiple comparison of means for analysis of variance (ANOVA)
    plotComparisonsInteractive plot of multiple comparisons of means for analysis of variance (ANOVA)
    statsAnalysis of variance (ANOVA) table
    varianceComponentVariance component estimates for analysis of variance (ANOVA)

    Examples

    collapse all

    Load popcorn yield data.

    load popcorn.mat

    The columns of the 6-by-3 matrix popcorn contain popcorn yield observations in cups for three different brands. Perform a one-way ANOVA to test the null hypothesis that the popcorn yield is not affected by the brand of popcorn.

    aov = anova(popcorn)
    aov = 
    1-way anova, constrained (Type III) sums of squares.
    
    Y ~ 1 + Factor1
    
                   SumOfSquares    DF    MeanSquares     F        pValue  
                   ____________    __    ___________    ____    __________
    
        Factor1       15.75         2        7.875      18.9    7.9603e-05
        Error          6.25        15      0.41667                        
        Total            22        17                                     
    
    
      Properties, Methods
    
    
    

    aov is an anova object that contains the results of the one-way ANOVA.

    The Factor1 row of the ANOVA table shows statistics for the model term Factor1, and the Error row shows statistics for the entire model. The sum of squares and the degrees of freedom are given in the SumOfSquares and DF columns, respectively. The Total degrees of freedom is the total number of observations minus one, which is 18 – 1 = 17. The Factor1 degrees of freedom is the number of factor values minus one, which is 3 – 1 = 2. The Error degrees of freedom is the total degrees of freedom minus the Factor1 degrees of freedom, which is 17 – 2 = 15.

    The mean squares, given in the MeanSquares column, are calculated with the formula SumOfSquares/DF. The F-statistic is the ratio of the mean squares, which is 7.875/0.41667 = 18.9. The F-statistic follows an F-distribution with degrees of freedom 2 and 15. The p-value is calculated using the cumulative distribution function (cdf). The p-value for the F-statistic is small enough that the null hypothesis can be rejected at the 0.01 significance level. Therefore, the brand of popcorn has a significant effect on the popcorn yield.

    Load popcorn yield data.

    load popcorn.mat

    The columns of the 6-by-3 matrix popcorn contain popcorn yield observations in cups for the brands Gourmet, National, and Generic. The first three rows of the matrix correspond to popcorn that was popped with an oil popper, and the last three rows correspond to popcorn that was popped with an air popper.

    Create string vectors containing factor values for the brand and popper type. Use the function repmat to repeat copies of strings.

    brand = [repmat("Gourmet",6,1);repmat("National",6,1);repmat("Generic",6,1)];
    poppertype = [repmat("Air",3,1);repmat("Oil",3,1);repmat("Air",3,1);repmat("Oil",3,1);repmat("Air",3,1);repmat("Oil",3,1)];
    factors = {brand,poppertype};

    Perform a two-way ANOVA to test the null hypothesis that the popcorn yield is not affected by the brand of popcorn or the type of popper.

    aov = anova(factors,popcorn(:),FactorNames=["Brand" "PopperType"])
    aov = 
    2-way anova, constrained (Type III) sums of squares.
    
    Y ~ 1 + Brand + PopperType
    
                      SumOfSquares    DF    MeanSquares     F       pValue  
                      ____________    __    ___________    ___    __________
    
        Brand            15.75         2       7.875        63         1e-07
        PopperType         4.5         1         4.5        36    3.2548e-05
        Error             1.75        14       0.125                        
        Total               22        17                                    
    
    
      Properties, Methods
    
    
    

    aov is an anova object containing the results of the two-way ANOVA. The small p-values indicate that both the brand and popper type have a statistically significant effect on the popcorn yield.

    Compute the mean response estimates to see which brand and popper type produce the most popcorn.

    groupmeans(aov,["Brand" "PopperType"])
    ans=6×6 table
          Brand       PopperType    Mean      SE       MeanLower    MeanUpper
        __________    __________    ____    _______    _________    _________
    
        "Gourmet"       "Air"       5.75    0.16667     5.0329       6.4671  
        "National"      "Air"       4.25    0.16667     3.5329       4.9671  
        "Generic"       "Air"        3.5    0.16667     2.7829       4.2171  
        "Gourmet"       "Oil"       6.75    0.16667     6.0329       7.4671  
        "National"      "Oil"       5.25    0.16667     4.5329       5.9671  
        "Generic"       "Oil"        4.5    0.16667     3.7829       5.2171  
    
    

    The table shows the mean response estimates with their standard error and 95% confidence bounds. The mean response estimates indicate that the Gourmet brand popped in an oil popper yields the most popcorn.

    Load the patient sample data.

    load patients.mat

    Create a table of factors from the Age and Smoker variables.

    tbl = table(Age,Smoker,VariableNames=["Age" "SmokingStatus"]);

    The factor SmokingStatus is a randomly sampled categorical factor, and Age is a continuous factor. Perform a two-way ANOVA to test the null hypothesis that systolic blood pressure is not affected by age or smoking status.

    aov = anova(tbl,Systolic,CategoricalFactors=2,RandomFactors=2)
    aov = 
    2-way anova, constrained (Type III) sums of squares.
    
    Y ~ 1 + Age + SmokingStatus
    
                         SumOfSquares    DF    MeanSquares      F         pValue  
                         ____________    __    ___________    ______    __________
    
        Age                 37.562        1      37.562       1.6577       0.20098
        SmokingStatus       2182.9        1      2182.9       96.337    3.3613e-16
        Error                 2198       97      22.659                           
        Total               4461.2       99                                       
    
    
      Properties, Methods
    
    
    

    aov is an anova object that contains the results of the two-way ANOVA. The p-value for Age is larger than 0.05. At the 95% confidence level, not enough evidence exists to reject the null hypothesis that age does not have a statistically significant effect on systolic blood pressure. SmokingStatus has a p-value smaller than 0.05, indicating that smoking status has a statistically significant effect on systolic blood pressure.

    To investigate whether the variability of the random factor SmokingStatus has an effect on the SmokingStatus mean square, use the object functions varianceComponent and stats.

    v = varianceComponent(aov)
    v=2×3 table
                         VarianceComponent    VarianceComponentLower    VarianceComponentUpper
                         _________________    ______________________    ______________________
    
        SmokingStatus          48.31                  9.0308                    49707         
        Error                 22.659                  17.425                    30.68         
    
    
    [~,ems] = stats(aov)
    ems=3×5 table
                           Type              ExpectedMeanSquares            MeanSquaresDenominator    DFDenominator    FDenominator
                         ________    ___________________________________    ______________________    _____________    ____________
    
        Age              "fixed"     "5135.47*Q(Age)+V(Error)"                      22.659                  97          MS(Error)  
        SmokingStatus    "random"    "44.7172*V(SmokingStatus)+V(Error)"            22.659                  97          MS(Error)  
        Error            "random"    "V(Error)"                                                                                    
    
    

    Inserting the VarianceComponent values into the SmokingStatus formula for ExpectedMeanSquares gives 44.7172*48.3098+22.6594 = 2.1829e+03. To see how much the variance component of SmokingStatus affects the expected mean squares, divide the SmokingStatus term of ExpectedMeanSquares by ExpectedMeanSquares to get 44.7172*48.3098/2.1829e+03 = 0.9896. This calculation shows that the SmokingStatus variance component contributes to almost 99% of the SmokingStatus expected mean squares.

    Load data of the results for five exams taken by 120 students.

    load examgrades.mat

    Create a table with variables for the math, biology, history, literature, and multisubject comprehensive exams.

    subject = ["math" "biology" "history" "literature" "comprehensive"];
    grades = table(grades(:,1),grades(:,2),grades(:,3),grades(:,4),grades(:,5),VariableNames=subject)
    grades=120×5 table
        math    biology    history    literature    comprehensive
        ____    _______    _______    __________    _____________
    
         65       77         69           75             69      
         61       74         70           66             68      
         81       80         71           74             79      
         88       76         80           88             79      
         69       77         74           69             76      
         89       93         78           77             80      
         55       64         60           50             63      
         84       83         80           77             78      
         86       75         81           87             79      
         84       82         86           92             85      
         71       70         73           81             79      
         81       88         80           79             83      
         84       78         80           74             80      
         81       77         81           83             79      
         78       66         90           84             75      
         67       74         73           76             72      
          ⋮
    
    

    Perform a four-way ANOVA for the continuous factors math, biology, history, and literature, and the response data comprehensive.

    aov = anova(grades,"comprehensive",CategoricalFactors = [])
    aov = 
    N-way anova, constrained (Type III) sums of squares.
    
    comprehensive ~ 1 + math + biology + history + literature
    
                      SumOfSquares    DF     MeanSquares      F         pValue  
                      ____________    ___    ___________    ______    __________
    
        math             58.973         1      58.973       6.1964      0.014231
        biology          100.35         1      100.35       10.544     0.0015275
        history          243.89         1      243.89       25.626    1.5901e-06
        literature       152.22         1      152.22       15.994    0.00011269
        Error            1094.5       115      9.5173                           
        Total              3291       119                                       
    
    
      Properties, Methods
    
    
    

    aov is an anova object that contains the results of the four-way ANOVA. The p-values of all factors are all smaller than 0.05, indicating that each subject exam can be used to predict a student's grade on the comprehensive exam. Display the estimated coefficients of the ANOVA model.

    coef = aov.Coefficients
    coef = 5×1
    
       21.9901
        0.0997
        0.1805
        0.2563
        0.1701
    
    

    The coefficient corresponding to the history exam is the largest; therefore, history makes the largest contribution to the predicted value of comprehensive.

    Load popcorn yield data.

    load popcorn.mat

    The columns of the 6-by-3 matrix popcorn contain popcorn yield observations for the brands Gourmet, National, and Generic. The first three rows of the matrix correspond to popcorn that was popped with an oil popper, and the last three rows correspond to popcorn that was popped with an air popper.

    Create a table containing variables representing the brand, popper type, and popcorn yield by using the repmat and table functions.

    brand = [repmat("Gourmet",6,1);repmat("National",6,1);repmat("Generic",6,1)];
    poppertype = [repmat("air",3,1);repmat("oil",3,1);repmat("air",3,1);repmat("oil",3,1);repmat("air",3,1);repmat("oil",3,1)];
    tbl = table(brand,poppertype,popcorn(:),VariableNames=["Brand" "PopperType" "PopcornYield"]);

    Perform a two-way ANOVA to test the null hypothesis that the popcorn yield is the same across the three brands and the two popper types. Specify the ANOVA model formula using Wilkinson notation.

    aovLinear = anova(tbl,"PopcornYield ~ Brand + PopperType")
    aovLinear = 
    2-way anova, constrained (Type III) sums of squares.
    
    PopcornYield ~ 1 + Brand + PopperType
    
                      SumOfSquares    DF    MeanSquares     F       pValue  
                      ____________    __    ___________    ___    __________
    
        Brand            15.75         2       7.875        63         1e-07
        PopperType         4.5         1         4.5        36    3.2548e-05
        Error             1.75        14       0.125                        
        Total               22        17                                    
    
    
      Properties, Methods
    
    
    

    aovLinear is an anova object that contains the results of the two-way ANOVA. The ANOVA model for aovLinear is linear and does not include an interaction term. The small p-values indicate that both the brand and popper type have a significant effect on the popcorn yield.

    To investigate whether the interaction between the brand and popper type has a significant effect on the popcorn yield, perform a two-way ANOVA with a model that contains the interaction term Brand:PopperType.

    aovInteraction = anova(tbl,"PopcornYield ~ Brand + PopperType + Brand:PopperType")
    aovInteraction = 
    2-way anova, constrained (Type III) sums of squares.
    
    PopcornYield ~ 1 + Brand*PopperType
    
                            SumOfSquares    DF    MeanSquares     F        pValue  
                            ____________    __    ___________    ____    __________
    
        Brand                    15.75       2        7.875      56.7     7.679e-07
        PopperType                 4.5       1          4.5      32.4    0.00010037
        Brand:PopperType      0.083333       2     0.041667       0.3       0.74622
        Error                   1.6667      12      0.13889                        
        Total                       22      17                                     
    
    
      Properties, Methods
    
    
    

    The ANOVA model for the anova object aovInteraction includes the interaction term Brand:PopperType. The p-value for the Brand:PopperType term is larger than 0.05. Therefore, not enough evidence exists to conclude that the brand and popper type have an interaction effect on the popcorn yield.

    The Metrics property of an anova object provides statistics about the fit of the ANOVA model. To determine which model is a better fit for the response data, display the Metrics property of aovLinear and aovInteraction.

    aovLinear.Metrics
    ans=1×7 table
         MSE      RMSE      SSE      SSR     SST    RSquared    AdjustedRSquared
        _____    _______    ____    _____    ___    ________    ________________
    
        0.125    0.35355    1.75    20.25    22     0.92045         0.88731     
    
    
    aovInteraction.Metrics
    ans=1×7 table
          MSE       RMSE       SSE       SSR      SST    RSquared    AdjustedRSquared
        _______    _______    ______    ______    ___    ________    ________________
    
        0.13889    0.37268    1.6667    20.333    22     0.92424         0.78535     
    
    

    The metrics tables show that the mean squared error (MSE) is slightly smaller for the linear model than for the interaction model. The adjusted R-squared value is higher for the linear model. Together, these metrics suggest that the linear model is a better fit for the popcorn data than the interaction model.

    Load the sample car data.

    load carbig.mat

    The variable Model contains data for the car model, and the variable Origin contains data for the country in which the car is manufactured. Convert Model and Origin from character arrays with trailing whitespace to string vectors.

    Model = strtrim(string(Model));
    Origin = strtrim(string(Origin));

    The variable MPG contains mileage data for the cars. Create a table containing data for the model, country of origin, and mileage of the cars manufactured in Japan and the United States.

    idxJapanUSA = (Origin=="Japan"|Origin=="USA");
    tbl = table(Model(idxJapanUSA),Origin(idxJapanUSA),MPG(idxJapanUSA),VariableNames=["Origin" "Model" "MPG"]);

    Japan and the United States each manufacture a unique set of models. Therefore, the factor Model is nested in the factor Origin. Perform a two-way, nested ANOVA to test the null hypothesis that the car mileage is the same between the models and countries of origin.

    aov = anova(tbl,"MPG ~ Origin + Model(Origin)")
    aov = 
    2-way anova, constrained (Type III) sums of squares.
    
    MPG ~ 1 + Model(Origin) + Origin
    
                         SumOfSquares    DF     MeanSquares      F         pValue  
                         ____________    ___    ___________    ______    __________
    
        Model(Origin)            0         0           0            0           NaN
        Origin               18873       244      77.347       10.138    3.0582e-25
        Error               633.26        83      7.6296                           
        Total                19506       327                                       
    
    
      Properties, Methods
    
    
    

    The small p-values indicate that the null hypothesis can be rejected at the 99% confidence level. Enough evidence exists to conclude that the model of the car and the country of origin have a statistically significant effect on the car mileage.

    Algorithms

    ANOVA partitions the total variation in the response data into two components:

    • Variation in the relationship between the factor data and the response data, as described by the ANOVA model. This variation is known as the sum of squares regression (SSR). The SSR is represented by the equation i=1n(y^iy¯)2, where n is the number of observations in the sample, y^i is the predicted value of observation i, and y¯ is the sample mean.

    • Variation in the data due to the ANOVA model error term, known as the sum of squares error (SSE). The SSE is represented by the equation i=1n(yiy^i)2, where yi is the value of observation i.

    With the above partitioning, the total sum of squares (SST) is represented by

    i=1n(yiy¯)2SST=i=1n(y^iy¯)2SSR+i=1n(yiy^i)2SSE

    The anova function calculates the sum of squares of a term (SSTerm) in the ANOVA model by measuring the reduction in the SSE when the term is added to a comparison model. The comparison model is given by aov.SumOfSquaresType (see SumOfSquaresType for more information).

    ANOVA uses SSE and SSTerm to perform an F-test. For categorical main effects, the null hypothesis is that the term's coefficient is the same across all groups. For continuous and interaction terms, the null hypothesis is that the term's coefficient is zero. A zero coefficient means that the value of the term does not have an effect on the response data. The F-statistic is calculated as

    F=SSTerm/dfTermSSE/dfError=MSTermMSError

    In the above formula, dfTerm is the degrees of freedom of a term, dfError is the degrees of freedom of the error, and MSTerm and MSError are the mean squares of the term and error, respectively.

    The anova function displays a component ANOVA table with rows for the model terms and error. The columns of the ANOVA table are described as follows:

    ColumnDefinition
    SumOfSquaresSum of squares
    DFDegrees of freedom
    MeanSquaresMean squares, which is the ratio SumOfSquares/DF
    FF-statistic, which is the source mean square to error mean square ratio
    pValuep-value, which is the probability that the F-statistic, as computed under the null hypothesis, can take a value larger than the computed test-statistic value. anova derives this probability from the cdf of the F-distribution

    References

    [1] Wackerly, D. D., W. Mendenhall, III, and R. L. Scheaffer. Mathematical Statistics with Applications, 7th ed. Belmont, CA: Brooks/Cole, 2008.

    [2] Dunn, O. J., and V. A. Clark Hoboken. Applied Statistics: Analysis of Variance and Regression. NJ: John Wiley & Sons, Inc., 1974.

    Version History

    Introduced in R2022b