Fitting probability distributions to the data (allfitdist)

Hello
I have found the amazing script allfitdist which fits all valid parametric distributions to the data and sorts them using a metric (e.g. BIC or AIC). In a blog post there is a example for a normal distribution:
% Create a normally distributed (mu: 5, sigma: 3) random data set
x = normrnd(5, 3, 1e4, 1);
% Compute and plot results. The results are sorted by "Bayesian information
% criterion".
[D, PD] = allfitdist(x, 'PDF');
If I do this, I'm getting the Rayleigh as winning distribution. In the blog post they got the normal distribution (as expected). The plot I'm getting is in the attachment.
Does somebody see from where this problem comes? I have the newest version of Matlab. Is this the problem?

1 Comment

I had the same problem and I swiched from 2016b to 2012b. Well, in 2012b the code gives results which seem to be OK.

Sign in to comment.

 Accepted Answer

It is probably a result of changes in default behavior across MATLAB versions. You are getting that, because the log likelihood value for Raleigh Distribution returns a complex number, which shouldn't happen. You can go to line 212 to and append this there:
if ~isreal(NLL)
error('non-finite NLL');
end
Now the program will catch the complex log likelihood and disregard the distribution like it should.

7 Comments

Thank you so much, now it works. :) In the blog post they got the logistic distribution as the third and generalized extreme value as the fourth best distribution. I'm getting the generalized extreme value as third and logistic as fourth best distribution, please see my attached plot.
Do you see why this is?
Edit: There is a second problem. As you can see in my plot, the normal distribution is not shown at all in the plot. => I think I have found the solution to this problem. The tlocationscale distribution seems to be visually almost identical to the normal distribution, so it is overlayed over it. How can I adapt the script so that the better fitting distributions (e.g. normal) are overlayed over the others (e.g. tlocationscale)?
First problem is simply randomness.
x = normrnd(5, 3, 1e4, 1);
Run the demo 20 times, you will see that you will manage to catch an instance that logistic beats generalized extreme value.
Second is due to plotting order, which if we reverse will cause the legend to legend to reverse as well. If you don't care about that add this to line 334 - ish (since we changed the file a bit), right below where it says:
max_num_dist=4;
Add:
Dindex = min([length(D),max_num_dist]);
D = fliplr(D(1:Dindex));
PD = fliplr(PD(1:Dindex));
Thank you again for the great help. Now everything works very well. I only have a final small question:
In the script there is example 2 and example 3:
% Example 2
data = nbinrnd(20,.3,1e4,1);
values=unique(data); freq=histc(data,values);
[D PD] = allfitdist(values,'NLogL','frequency',freq,'PDF','DISCRETE');
% Example 3
data=geornd(.7,1e4,1); %Random from Geometric
[D PD]= allfitdist(data,'PDF','DISCRETE');
In Example 2 the frequencies are given, while in Example 3 they are not given. When I ommit the frequencies in Example 2, it does not work. When I give the frequencies in Example 3 it still works.
When do I have to give the frequencies for discrete distributions or should I give them always?
Let's break it down:
You have data in both cases that is 10000 points.
In the third example you are passing the data itself, all 10000 points, and the function fits a distribution based on how many times each value is observed.
In the second example, you find the unique values of the 10000 points instead and pass that as an argument. So even if say 56 is observed 147 times, you only put it once. Your otherwise 10000 long input is now say 103 long. Now to account for that, you are passing a second argument called frequency to tell the function how many times each unique value is observed. For example instead of passing [56 56 56 56 56 ...], you say 56 -> 147 times to the function. Now if you don't pass a frequency argument while using the unique values, the function will just assume every value only occurs once, hence the different output.
Thanks a lot for the detailed explanation. So both examples are just the same (the same result) but the notation is different (i.e. the way to get the result)?
No, here is a modified version that uses the same data to get the same results using two different notation:
% Example 2
data = nbinrnd(20,.3,1e4,1);
values=unique(data); freq=histc(data,values);
[D PD] = allfitdist(values,'NLogL','frequency',freq,'PDF','DISCRETE'); <-- Counts and Frequency
% Example 3
[D PD] = allfitdist(data,'NLogL','PDF','DISCRETE'); <-- Data
Thank you again. I have just encountered the following warning message: Maximum likelihood estimation did not converge. Iteration limit exceeded. This appeared e.g. for tlocationscale. Of course such a fit is bad. How can I exclude this? I mean if this warning appears, I just want to throw an error.
Second, currently, I'm investigating if the neagtive log likelihood is complex or infinite and if so, I'm throwing an error. Are there other checks which I should do?

Sign in to comment.

More Answers (0)

Asked:

on 3 Mar 2016

Commented:

on 5 Mar 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!