Fit a statistical distribution to truncated data

30 views (last 30 days)
I have a "truncated dataset" and I would need to infer the distribution that most likely fits the data. Even though I have a "truncated dataset", instead of a "full dataset", I think that the best fitting distribution would be that one that could describe the "full dataset". This best-fitting distribution would be something like what is depicted by the blue line in this plot:
Do you have any comment, suggestion, or idea on how to get that blue line ?
When I tried to reproduce - with the fitdist function - the blue line in the above-mentioned figure, i.e. the best-fitting distribution as if I had the "full dataset", I was not successful. Here below you can find a comparison between the fitdist applied to the "full dataset" and the "truncated dataset", having both the same "origin", i.e. makedist('Normal','mu',3).
% (1) from a normal probability distribution, i.e. "makedist('Normal','mu',3)",
% create:
% (i) a "full dataset" and
% (ii) a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_full = random(pd,10000,1);
data_trunc = random(t,10000,1);
% (2) fit the normal distribution to
% (i) the "full dataset"
% (ii) the set of "truncated data"
pd_fit_full = fitdist(data_full,'normal');
pd_fit_trunc = fitdist(data_trunc,'normal');
% (3) plot
% (i.a) the "histogram of the full dataset" (from the "full dataset")
% (i.b) the density function corresponding to the distribution that fits the "full dataset"
% (ii.a) the "truncated histogram" (from the "truncated data")
% (ii.b) the density function corresponding to the distribution that fits the "truncated histogram"
xgrid = linspace(0,100,1000)';
hold on
histogram(data_full,100,'Normalization','pdf','facecolor','red')
line(xgrid,pdf(pd_fit_full,xgrid),'Linewidth',2,'color','red')
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,pdf(pd_fit_trunc,xgrid),'Linewidth',2,'color','blue')
hold off
xlim([0 10])

Accepted Answer

Jeff Miller
Jeff Miller on 20 Jun 2023
If you would like to fit a variety of truncated distributions in addition to the normal, you might find Cupid helpful. For instance, here's an example with a 2-parameter Weibull:
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% Lower cutoff of 3 is known. Start with
% any reasonable guesses for the Weibull parameters--here, 2 & 2.
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% Now estimate the Weibull parameters by maximum likelihood,
% allowing for the truncation.
fittedDist.EstML(data_trunc);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6])
  1 Comment
Sim
Sim on 20 Jun 2023
Edited: Sim on 20 Jun 2023
Thanks a lot to everyone, @Jeff Miller, @the cyclist, @Torsten, for your replies and suggestions!! They are all great solutions, and I found the @Jeff Miller's one, probably, the closest one to my needs. I explain you the reason why. Even though I asked something about the fitting of a normal distribution to a normally distributed truncated dataset (that I created artificially), my real cases involve a variety of distributions (often unknown to me) that can be "detected" by the Cupid tool. Therefore, to my understanding, Cupid could be seen as an extension or generalisation of the solutions proposed by @the cyclist (i.e. Fitting a truncated normal (Gaussian) distribution) and @Torsten.
After having downloaded the Cupid's zip folder, I run the @Jeff Miller solution, which works well:
addpath('.../Cupid-master')
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% Lower cutoff of 3 is known. Start with
% any reasonable guesses for the Weibull parameters--here, 2 & 2.
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% Now estimate the Weibull parameters by maximum likelihood,
% allowing for the truncation.
fittedDist.EstML(data_trunc);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6])

Sign in to comment.

More Answers (2)

Torsten
Torsten on 19 Jun 2023
Edited: Torsten on 19 Jun 2023
Why should it be justified to fit a dataset of a truncated normal by a normal distribution ?
pd_fit_trunc = fitdist(data_trunc,'normal');
First complete the data set "data_trunc" by reflection at x = 3 such that it becomes distributed according to a normal distribution. Then you can fit it by a normal distribution:
% (1) from a normal probability distribution, i.e. "makedist('Normal','mu',3)",
% create:
% (i) a "full dataset" and
% (ii) a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_full = random(pd,10000,1);
data_trunc = random(t,10000,1);
data_trunc = [data_trunc;-(data_trunc-3)+3];
% (2) fit the normal distribution to
% (i) the "full dataset"
% (ii) the set of "truncated data"
pd_fit_full = fitdist(data_full,'normal');
pd_fit_trunc = fitdist(data_trunc,'normal');
% (3) plot
% (i.a) the "histogram of the full dataset" (from the "full dataset")
% (i.b) the density function corresponding to the distribution that fits the "full dataset"
% (ii.a) the "truncated histogram" (from the "truncated data")
% (ii.b) the density function corresponding to the distribution that fits the "truncated histogram"
xgrid = linspace(0,100,1000)';
hold on
histogram(data_full,100,'Normalization','pdf','facecolor','red')
line(xgrid,pdf(pd_fit_full,xgrid),'Linewidth',2,'color','red')
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,pdf(pd_fit_trunc,xgrid),'Linewidth',2,'color','blue')
hold off
xlim([0 10])
  1 Comment
Sim
Sim on 20 Jun 2023
Edited: Sim on 20 Jun 2023
@Torsten thanks a lot!! It is a very good answer, but then, I was thinking, how can we really now how is the distribution below the last known value, i.e. 3 ? (I got this doubt, thanks to your nice answer, thanks a lot!)
I mean, Yes, it works if you already know that the underlying distribution is a normal one.... but what if you do not have any previous knowledge?
For example, in my case, the real one, and not this "toy-one", I do not have a real knowledge, and I use the best fitting distribution tool to understand approximately how data could be distributed, at least for the known part, i.e. the over the value 3...

Sign in to comment.


the cyclist
the cyclist on 19 Jun 2023
I haven't used it much, but this File Exchange submission seems to do a pretty good job.
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
[norm_trunc, phat, phat_ci] = fitdist_ntrunc(data_trunc, [3, Inf]);
xgrid = linspace(0,100,1000)';
figure
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,norm_trunc(xgrid,phat(1),phat(2)),'Linewidth',2,'color','red')
xlim([0 10])
  1 Comment
Sim
Sim on 20 Jun 2023
@the cyclist Thanks a lot! ......But.....does that work also for other distributions? (Yes, in my question I asked about a set of data that are normally distributed, but it was just as to make it simple...actually I might have a variety of distributions....)

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!