Clear Filters
Clear Filters

I wanted to apply the chi-squared function with the return of the p-value, but matlab's chi2cdf function only returns zero.

8 views (last 30 days)
% Example data matrix (2000 rows and 9 columns)
data_matrix = randi([0, 10], 2000, 9); % Replace this with your actual data
% Example empirical frequency (a vector with 9 elements)
empirical_frequency = [10, 20, 30, 40, 50, 60, 70, 80, 90]; % Replace this with your actual empirical frequency
% Initialize vectors to store results
chi_squared_results = zeros(2000, 1);
p_values = zeros(2000, 1);
for i = 1:2000
% Select the data for row i
row_i = data_matrix(i, :);
% Calculate the chi-squared statistic manually
chi_squared = sum((row_i - empirical_frequency).^2 ./ empirical_frequency);
% Determine the degrees of freedom (df)
df = length(row_i) - 1;
% Calculate the p-value using the chi-squared distribution
p = 1 - chi2cdf(chi_squared, df);
% Store the results in vectors
chi_squared_results(i) = chi_squared;
p_values(i) = p;
end
unique(p_values)
ans = 0
The problem is that chicdf return 0.

Accepted Answer

dpb
dpb on 2 Nov 2023
Edited: dpb on 2 Nov 2023
% Example data matrix (2000 rows and 9 columns)
data_matrix = randi([0, 10], 2000, 9); % Replace this with your actual data
% Example empirical frequency (a vector with 9 elements)
empirical_frequency = [10, 20, 30, 40, 50, 60, 70, 80, 90]; % Replace this with your actual empirical frequency
% Initialize vectors to store results
chi_squared_results = zeros(2000, 1);
p_values = zeros(2000, 1);
for i = 1:2000
% Select the data for row i
row_i = data_matrix(i, :);
% Calculate the chi-squared statistic manually
chi_squared = sum((row_i - empirical_frequency).^2 ./ empirical_frequency);
% Determine the degrees of freedom (df)
df = length(row_i) - 1;
% Calculate the p-value using the chi-squared distribution
p = 1 - chi2cdf(chi_squared, df);
% Store the results in vectors
chi_squared_results(i) = chi_squared;
p_values(i) = p;
end
histogram(chi_squared_results)
%unique(p_values)
[min(chi_squared_results) max(chi_squared_results)]
ans = 1×2
321.3518 421.4471
chi2cdf(ans, df)
ans = 1×2
1 1
What would you expect when compare a random vector from 1:10 against an expected cumulative distribution frequency of 10:10:100?
As the above indicates, the minimum ch-square statistic calculated was 323; that's so far from being within the range of a realistic test statistic the actual percentage less than unity underflows the precision of a double and so is returned as identically 1. Try something more like
row_i=randi([0, 100], 1, 9) % test vector between 0-100 instead 0-1
row_i = 1×9
97 27 9 47 15 34 43 54 38
chi_squared = sum((row_i - empirical_frequency).^2 ./ empirical_frequency)
chi_squared = 859.9504
p = 1 - chi2cdf(chi_squared, df)
p = 0
That's still way out of reason; by chance for the given vector the essentially full cdf value turned out to be in the first element; not exactly surprising it ends up with identically zero estimate.
Now, keep the same vector but sort it to get what could be an approximation to a cdf...
row_i=sort(row_i)
row_i = 1×9
9 15 27 34 38 43 47 54 97
chi_squared = sum((row_i - empirical_frequency).^2 ./ empirical_frequency)
chi_squared = 26.7983
p = 1 - chi2cdf(chi_squared, df)
p = 7.6598e-04
Now, the above random vector starts out not too bad in comparison to exected, with several quite low values in the 50:80 range that make it not fit all that well--but at least it's computable.
figure
plot(empirical_frequency,sort(row_i))
xlabel('Expected','Observed')
  1 Comment
dpb
dpb on 2 Nov 2023
Edited: dpb on 3 Nov 2023
I didn't want to ruin the great for illustration random vector created last run above so I didn't actually rerun to plot the observed versus expected...
row_i=[97 27 9 47 15 34 43 54 38];
empirical_frequency=[10:10:90];
subplot(3,1,1)
hold on
plot([0 empirical_frequency 100],[0 empirical_frequency 100],'k-')
plot(empirical_frequency,row_i,'b*-')
xlim([0 100]), ylim([0 100]), box on
legend('reference','random','location','north')
subplot(3,1,2)
hold on
plot([0 empirical_frequency 100],[0 empirical_frequency 100],'k-')
plot(empirical_frequency,sort(row_i),'r*-')
xlim([0 100]), ylim([0 100]), box on
legend('reference','sorted','location','northwest')
subplot(3,1,3)
hold on
row_i=[9 15 27 34 53 57 78 78 97 ];
chi_squared = sum((row_i - empirical_frequency).^2 ./ empirical_frequency)
chi_squared = 4.3887
df=numel(row_i)-1;
p = 1 - chi2cdf(chi_squared, df)
p = 0.8205
plot([0 empirical_frequency 100],[0 empirical_frequency 100],'k-')
plot(empirical_frequency,row_i,'g*-')
xlim([0 100]), ylim([0 100]), box on
legend('reference','adjusted','location','northwest')
Now if in the end we take a set of data that actually do follow roughly the path of the empirical cdf, then, by golly, we get a chi-square statistic that actually indicates that set of observations couldn't really be ruled out as having come from the parent distribution. As noted, the "corrections" made to the random vector were to raise the 5th thru 8th values up to some values that were roughly in line...then the deviations from empirical weren't nearly so large...

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!