Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?

1 view (last 30 days)
Sorry, I think I have overwritten a new question over an old one... is there a way to restore the old question?

Accepted Answer

John D'Errico
John D'Errico on 22 May 2023
Edited: John D'Errico on 22 May 2023
Do you understand that R^2 is not valid, when computed for a model with no constant term? Instead, I recall there are variations of R^2 that are more valid for models with no constant term.
Did you notice that R^2 for one of those computations was negative? That clearly suggests a problem. That your model has NO constant term also suggests why there is a problem.
That your data has one outlier in it, that would heavily influence the fit is another problem. I won't get into that at all.
Simple one number measures like R^2 are a problem. They are a far larger problem if you don't understand what they are telling you. And if you are worried about R^2 at all, on a problem with no constant term, then you don't understand R^2.
Let me spend some time writing and explaining...
x1 = [
1734.46674110029
1721.86718990168
1456.18495599912
1748.16876863704
1262.25401459449
1584.17734249873
1859.79267910209
864.395721875175
1952.12705609501
1503.45890484099
1976.60164096334
3862.90470480267
1914.88763478115
1373.87007296104
1826.95766710343
1515.55469767365
1765.08584136511
1318.10668583756];
y1 = [
4289.03428582923
2246.49016736711
1540.98595650498
2038.68253981628
3541.64494736076
4039.09183624669
3602.87690315152
1747.01379244271
4285.91071657769
2935.38381778387
2432.46183991586
4121.49502991896
2455.73593671295
4682.44200564969
3633.35882024405
1763.8042370884
1803.2292759675
3724.18628227003];
Essentially, R^2 is a very simple measure that compares the variability in the data itself, IF we had essentially no model at all. We can get that from the variance.
var(y1)
ans = 1.1056e+06
Note that the variance SUBTRACTS OFF THE MEAN OF THE DATA. It implicitly assumes the model for this process is a constant model, with gaussian noise added. So the implicit model in that variance computation is just
y = a0 + noise
And we can recover the best least square estimate of a0 from the mean.
a0 = mean(y1)
a0 = 3.0491e+03
But what did you fit? You tried to fit a model that lacks a constant term.
y = a1*x
We can get that directly from backslash, or we can use fit. Since we will do these computations essentially by hand, I'll use backslash.
format long g
a1 = x1\y1
a1 =
1.62650678745712
I can also use fit though, just to convince you that backslash was correct.
mdl = fittype('a1*x1','indep','x1');
[fittedmdl,G] = fit(x1,y1,mdl)
Warning: Start point not provided, choosing random start point.
fittedmdl =
General model: fittedmdl(x1) = a1*x1 Coefficients (with 95% confidence bounds): a1 = 1.627 (1.289, 1.964)
G = struct with fields:
sse: 26350446.5536834 rsquare: -0.401930869730069 dfe: 17 adjrsquare: -0.401930869730069 rmse: 1245.00050918212
fittedmdl.a1
ans =
1.6265067878172
So the same value, fit should be actually a little less accurate here, because fit uses an iterative procedure. So the slop lies in the convergence tolerance.
You will notice that fit returns a NEGATIVE R^2. Again, that should be a hint. NO CONSTANT TERM.
Now, what does R^2 tell us? R^2 compares how well the current model does in terms of reducing the variability in the data, compared to no model more complex than assuming the prcess is simply a constant one, plus noise.
SSbase = sum((y1 - mean(y1)).^2)
SSbase =
18795824.4751091
SSmdl = sum((y1 - a1*x1).^2)
SSmdl =
26350446.5536833
As you can see, the base sum of squares, where I subtracted off only the mean is SMALLER then the sum of squares when I subtracted off the estimate from this model. The R^2 computation is now a simple one.
R2bad = 1 - SSmdl/SSbase
R2bad =
-0.401930869730068
So again, a negative R^2 tells us that your model does not fit the data better than if you had just used a constant approximation for the process.
Finally, we might consider if a better meaure (for THIS process) is how well the fit reduces the simple sum of squares of your data, had we not subtracted off the mean.
R2nocon = 1 - SSmdl/sum(y1.^2)
R2nocon =
0.858439152113143
This assume that the default model for the process is
y = noise
with a presumed mean of zero. And that is probably what ezyfit computed, since it apparently knew the model has no constant term.
Finally, I'll plot the various models.
plot(x1,y1,'mo')
hold on
plot(fittedmdl,'r')
yline(mean(y1),'b')
The blue horizontal line is a model of the process where there is no signal at all, just random noise. In fact, that actually fits the data better in terms of explaining the sum of squares of errors, compared to the linear fit with no constant term.
In the end, mono-numerosis is a BAD thing. NEVER RELY ON A SIMPLE NUMBER TO TELL YOU IF YOUR MODEL IS ANY GOOD, certainly not if you don't understand the number in the first place. I would even go further, to tell you to rely on your eyes and your brain, NOT on any number. If the fit looks right, it is right. At the very least, think about what you are doing. Is the fit adequate for what you need?
  1 Comment
Sim
Sim on 22 May 2023
Edited: Sim on 23 May 2023
Thanks a lot @John D'Errico for your great answer!
I tried to understand and reproduce what you did, performing the fitting/linear regression for the following cases:
  • (a) only slope, with the original equation for R^2 (done manually)
  • (b) only slope, with the corrected equation for R^2 (done manually)
  • (c) slope and intercept, with the original equation for R^2 (done manually)
  • (d) this is an additional case, that I add here, just to compare to the other cases: slope and intercept detected automatically and R^2 calculated automatically with "fitlm" (done automatically)
clc;close all;
% input
x1 = [
1734.46674110029
1721.86718990168
1456.18495599912
1748.16876863704
1262.25401459449
1584.17734249873
1859.79267910209
864.395721875175
1952.12705609501
1503.45890484099
1976.60164096334
3862.90470480267
1914.88763478115
1373.87007296104
1826.95766710343
1515.55469767365
1765.08584136511
1318.10668583756];
y1 = [
4289.03428582923
2246.49016736711
1540.98595650498
2038.68253981628
3541.64494736076
4039.09183624669
3602.87690315152
1747.01379244271
4285.91071657769
2935.38381778387
2432.46183991586
4121.49502991896
2455.73593671295
4682.44200564969
3633.35882024405
1763.8042370884
1803.2292759675
3724.18628227003];
% (1) Manual fitting/regression
% find slopes (and intercept)
b1 = x1\y1; % <-- (a & b) only slope
yfit1 = b1*x1;
x2 = [ones(length(x1),1) x1];
b2 = x2\y1; % <-- (c) slope and intercept
yfit3 = x2*b2;
% draw a plot
hold on
scatter(x1,y1,[],'b','filled')
plot(x1,yfit1,'-.k')
plot(x1,yfit3,'-r')
hold off
xlabel('x1')
ylabel('y1')
legend('Data','(a & b) only slope','(c) slope and intercept','Location','best');
% calculate R^2 and RMSE statistics
Rsq1 = 1 - sum((y1 - yfit1).^2)/sum((y1 - mean(y1)).^2); % <-- (a) only slope, with original equation for R^2
MSE1 = sum((y1 - yfit1) .^ 2) / numel(y1);
RMSE1 = sqrt(MSE1);
Rsq2 = 1 - sum((y1 - yfit1).^2)/sum((y1 ).^2); % <-- (b) only slope, with the corrected equation for R^2
MSE2 = MSE1; % <-- both MSE and RMSE are the same as the previous case since do not depend on mean(y1)
RMSE2 = RMSE1;
Rsq3 = 1 - sum((y1 - yfit3).^2) /sum((y1 - mean(y1)).^2); % <-- (c) slope and intercept, with original equation for R^2
MSE3 = sum((y1 - yfit3) .^ 2) / numel(y1);
RMSE3 = sqrt(MSE3);
% make a summary
t1 = table({'a'},{'NaN'},b1, Rsq1, MSE1, RMSE1);
t2 = table({'b'},{'NaN'},b1, Rsq2, MSE2, RMSE2);
t3 = table({'c'},b2(1),b2(2),Rsq3, MSE3, RMSE3);
t1.Properties.VariableNames = {'case','intercept','slope','R^2','MSE','RMSE'};
t2.Properties.VariableNames = t1.Properties.VariableNames;
t3.Properties.VariableNames = t1.Properties.VariableNames;
[t1;t2;t3]
ans = 3×6 table
case intercept slope R^2 MSE RMSE _____ ______________ ______ ________ __________ ______ {'a'} {'NaN' } 1.6265 -0.40193 1.4639e+06 1209.9 {'b'} {'NaN' } 1.6265 0.85844 1.4639e+06 1209.9 {'c'} {[2.2105e+03]} 0.4832 0.076757 9.6406e+05 981.87
% (2) matlab function for linear regression
mdl = fitlm(x1,y1) % <-- (d) slope, intercept and R^2 automatically calculated
mdl =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ ________ (Intercept) 2210.5 767.46 2.8802 0.010878 x1 0.4832 0.41896 1.1533 0.26571 Number of observations: 18, Error degrees of freedom: 16 Root Mean Squared Error: 1.04e+03 R-squared: 0.0768, Adjusted R-Squared: 0.0191 F-statistic vs. constant model: 1.33, p-value = 0.266
figure
plotAdded(mdl)
In summary:
  • Yes, in my case, with the original definition of R^2, I got R^2 = -0.4 because the variance around the least-squares fit line is greater than the variance around the mean. This is clear, but I did not know that it was related to the lack of intercept, and I did not understand why Ezyfit was giving a different result.
  • I did not know that R^2 loses its meaning if there is not the intercept. I understood that, for that case, we need to use a slithgly different definition for R^2, i.e.
Rsq2 = 1 - sum((y1 - yfit1).^2)/sum((y1 ).^2); % <-- (b) only slope, with the corrected equation for R^2
  • However, I still do not understand why the mean(y) cannot be subtracted for that case. Something is explained in Removal of statistically significant intercept term increases 𝑅2 in linear model, but I did not grab the idea yet.
  • I understood that, for my case, and by using the corrected equation for R^2, i.e. that one where the mean(y) is not subtracted, I would get "y = noise" as default model. However, if I do not subtract mean(y) in the equation of R^2, it would be equivalent to say that mean(y)=0. But mean(y) is around 3000, as you have showed in your plot. I do not understand this part well.
  • It might be that Ezyfit calculates R^2 = 0.84396 without subtracting mean(y) in the equation of R^2, even though, with that small modification, we get R^2 = 0.85844. A light difference.
  • Cases (c) and (d) produce the same output, i.e. same slope, intercept and R^2. That means that the embedded function "fitlm" considers - obviously - the intercept in the calculation of the linear regression. OK, good to know.
  • Besides R^2, I could use other statistical metrics, as MSE and RMSE, since they do not depend on "mean(y)", but only on the variance around the least-sqaures fit line. Yes, by adding the intercept in the fitting/linear regression, the MS Error and its square root decrease, which is a good sign. However, how to interpet them...?

Sign in to comment.

More Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!