Fixed Effects Design Matrix Must be of full column rank with multiple categorical predictors

41 views (last 30 days)
I am probably doing something very dumb, however I cannot figure out my mistake.
I am trying to regress out some predictors from a data set -- I have two categorical predictors, A1 and A2 in a table, something like this:
It seems obvious to me that A1 and A2 are linearly independent. They are also linearly independent from the intercept, which I believe should be a categorical variable that looks like ones(1,11) ? But regardless, I want the global mean to not be removed from everything, so I don't include an intercept in the model.
Then, if I run something like this:
lme = fitlme('values ~ A1 + A2 -1, 'DummyVarCoding','full' )
I always get the same error :
Error using classreg.regr.lmeutils.StandardLinearLikeMixedModel/validateInputs (line 229)
Fixed Effects design matrix X must be of full column rank.
I don't understand why this is happening -- and probably this shows that I have a pretty big misunderstanding of what the dummy variables actually are.
However, if I run two fitlme's -- one on the subset A1==1 and one on A1==0, they both work, which just super confuses me.

Answers (1)

Ive J
Ive J on 29 Jan 2022
The error is self-explanatory, and the reason is full dummy variable scheme you're using (why?). See here https://mathworks.com/help/stats/dummy-indicator-variables.html
Note that the error has nothing to do with mixed-model design. Consider this example:
n = 100; % sample size
tab = table(randn(n,1), categorical(randi([0 1], n, 1)), ...
categorical(randi([0, 1], n, 1)),...
'VariableNames', {'value', 'A1', 'A2'});
mdl1 = fitlm(tab, 'value ~ A1 + A2 - 1', 'DummyVarCoding', 'full') % design matrix is rank deficient
Warning: Regression design matrix is rank deficient to within machine precision.
mdl1 =
Linear regression model: value ~ A1 + A2 Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ _______ A1_0 -0.20234 0.20399 -0.99191 0.32373 A1_1 0 0 NaN NaN A2_0 -0.045804 0.17202 -0.26627 0.7906 A2_1 0.097693 0.18145 0.53839 0.59155 Number of observations: 100, Error degrees of freedom: 97 Root Mean Squared Error: 1.02 R-squared: 0.0145, Adjusted R-Squared: -0.00585 F-statistic vs. constant model: 0.712, p-value = 0.493
So, what happened? Let's construct the design matrix:
X = [dummyvar(tab.A1), dummyvar(tab.A2)]; % DummyVarCoding -> full
disp(rank(X)) % 3 < size(X, 2) --> 3 < 4 --> rank deficient
3
% what about when considering them alone?
disp(rank(X(:, 1:2))) % full rank
2
disp(rank(X(:, 3:4))) % full rank
2
We can approximately find the problematic variable:
[~, R] = qr(X, 0);
find(abs(diag(R)) < 1e-6)
ans = 4
Therefore, don't set 'DummyVarCoding' in such cases (default is 'reference')
  1 Comment
Laurie König
Laurie König on 28 Nov 2024
Hi there, may I ask a followup question? I am running into a similar problem. I am also having two categorical predictors, but with three groups (0,1,2). However, I have included them as categorical variables in the equation which leads to reference coding. My variables are called word_cat and attribute.
When I run the regression model, I see the folllowing output. Could you give me a hint towards why parameters can be estimated for one reference group and not the other even though all 3 groups are present in the data and the two predictors are not correlated?
word_cat_1 -31.78 3.6778 -8.6411 3.0585e-17
word_cat_2 -15.24 3.6778 -4.1438 3.7843e-05
attribute_1 -28.71 3.6778 -7.8063 1.866e-14
attribute_2 1.49 3.1851 0.46781 0.64005
word_cat_1:attribute_1 50.81 5.2012 9.769 2.3292e-21
word_cat_2:attribute_1 0 0 NaN NaN
word_cat_1:attribute_2 0 0 NaN NaN
word_cat_2:attribute_2 30.46 4.8653 6.2607 6.2802e-10

Sign in to comment.

Categories

Find more on Descriptive Statistics in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!