Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

1 view (last 30 days)
Hi all,
I have a bit of a specialised question involving histograms and mean integrated squared error (MISE). I want to find the 'best' way to construct a histogram for some data using a quantifiable method. I thought that MISE would provide a good way to do this as I can simulate data and then compare different histograms to the underlying probability distribution.
However, surprisingly (to me at least) I keep finding that histograms made with tiny bins always have a smaller MISE than ones with larger bins, even though the latter seem to reflect the data more accurately.
For an example I have made some mock code (below) which just simulates some random numbers from a normal distribution, bins them into a histogram and compares this histogram to the actual underlying distribution.
If we look at MISE with respect to bin size we get this:
So the tiny bins have the smallest error, however, if we plot some examples:
The red is the actual probability distribution, the blue line is a histogram built using the smallest bin size (with the smallest MISE) and the green is a histogram built using a larger binsize (with a larger MISE but 'looks' closer to the real distribution).
So what's going on? Is this just a property of MISE or am I making a mistake? Some areas where I think I might have made a mistake:
  1. Before calculating MISE I sum normalise both distributions, this makes sense to me as I should be comparing probability distributions but maybe they should be normalised in a different manner?
  2. Sometimes when MISE is expressed there is also an 'Expected Value' coefficient which I have not been able to identify. This seems to be an average, but an average of what? I think this might fix the problem by scaling the MISE according to the average bin contents but I'm not sure how to apply it.
Any help would be greatly appreciated,
NP.
% distribution mean
mu = 0;
% distribution standard deviation
std = 2;
% binsizes we want to test
bin_size = 0.1:0.1:3;
% random values from this distribution
vals = normrnd(mu,std,100,1);
% preallocate
mise_values = NaN(length(bin_size),1);
% run through every bin size
for bb = 1:length(bin_size)
% values to evaluate distributions at
xi = -10:bin_size(bb):10;
% histogram of values
kpdf = histcounts(vals,xi);
% locations of bin centers (so PDF will match histogram)
xi2 = movmean(xi,2,'Endpoints','discard');
% underlying probability distribution
updf = normpdf(xi2,mu,std);
% mean integrated squared error between the histogram and the PDF
% first normalise both
updf = updf ./ nansum(updf);
kpdf = kpdf ./ nansum(kpdf);
% calculate MISE
mise_values(bb) = sum( sum( (updf - kpdf).^2 ) ) .* bin_size(bb);
end
% plot MISE vs bin size
figure
scatter(bin_size,mise_values,'k');
refline
% plot different distributions
figure
xi = -10:0.1:10;
plot(xi,normpdf(xi,mu,std),'r'); hold on;
% plot 'best' binsize
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'b')
% plot a better one
xi = -10:0.8:10;
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'g')

Accepted Answer

John D'Errico
John D'Errico on 23 Jul 2020
Edited: John D'Errico on 23 Jul 2020
Your error is a subtle one, but important to understand why it happens.
x = randn(1000,1);
xi = -5:0.1:5;
histogram(x,xi,'norm','pdf')
hold on
fplot(@(x) normpdf(x))
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(x,xi);
plot(xi2,f./nansum(f),'r')
legend('histogram - pdf normalization','true pdf','histogram - relative counts')
The red curve at the bottom (look carefully, it is hard to see there) is the one you plotted. It is a simple relative number of counts per bin, so normalized to sum to 1. However, a pdf is normalized to have unit area.
Instead, see the difference here:
figure
dx = 0.1;
plot(xi2,f./nansum(f)/dx,'r')
hold on
fplot(@(x) normpdf(x))
legend('histogram - pdf normalization','true pdf')
Do you see the difference? I used your same data, but now the histogram is properly normalized, in a way that is consistent with a pdf.
While you think it makes sense for the simple frequency histogram to sum to 1, it was NOT normalized to INTEGRATE to have an area of 1. That only happened when I scaled it by dividing by dx.
As far as the smaller bin size being better, that should just reflect the idea that a smaller bin size can better approximate the true distribution.
  1 Comment
Neuropragmatist
Neuropragmatist on 23 Jul 2020
Of course, this is exactly the problem! This is extremely helpful, thank you. Now, when I normalise for integration the MISE is actually high for small bins and then decreases to a plateau of nice bin sizes.
Do you also have any insight on the 'expected value' in the MISE formula?
I have seen a few papers where this was omitted, so I'm not sure what purpose it serves.
Thanks,
NP.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!