kstest - normal?

39 views (last 30 days)
Ian
Ian on 31 Mar 2011
Commented: the cyclist on 30 Jan 2020
Hi, I am confused from reading the description from the 'kstest' function. Usually '1' means true and '0' means false, and the purpose of this function is to test whether or not a set of data is normally distributed. However, what I gather from reading the description, '0' is returned when the data is normally distributed, and '1' is returned when the data is not normally distributed.
Is this correct interpretation? The example is also a little confusing x = -2:1:4 x = -2 -1 0 1 2 3 4
[h,p,k,c] = kstest(x,[],0.05,0)
h =
0
p =
0.13632
k =
0.41277
c =
0.48342
These data are linear, not a normal distribution. Yet the kstest returns '0', which means the kstest classifies these data as normal, which is a limitation of the kstest with small data samples?
From what I read, the resolution is thus to use the 'smaller' or 'larger' tag to correct for this problem, but is there any clear cut-off for what is 'smaller' and what is 'larger'?
Lastly, if I were to use this test in a publication and say that our data was 'normal' (this function returned 0) or failed to be classified as 'normal' (this function returned 1) with this test and I used the 'smaller' or 'larger' tags, how does that change the name of the test? It can't be the same test if it is returning different values. How would I explain this?

Accepted Answer

Andrew Newell
Andrew Newell on 31 Mar 2011
Your example (taken from the documentation), "illustrates the difficulty of testing normality in small samples." If you plot
normplot(x)
you'll see that the deviations from a standard normal distribution occur in the two outer points. It doesn't take a lot more data to get a reasonable result, though:
x = -2:0.5:4;
[h,p,k,c] = kstest(x,[],0.05,0)
h =
1
p =
0.0245
k =
0.3947
c =
0.3614
Keep in mind, too, their comment about the Lilliefors test - it is more likely to be the one you want.
  2 Comments
the cyclist
the cyclist on 31 Mar 2011
Andrew, I think you meant "normplot(x)" rather than "normpdf(x)" here.
Andrew Newell
Andrew Newell on 31 Mar 2011
Oops!

Sign in to comment.

More Answers (2)

the cyclist
the cyclist on 31 Mar 2011
Ian,
Lots and lots of things that need to be addressed here. I'll try to address as much as I can.
First, in your little example, you only have seven data points. Therefore, the statistical test you are applying has very little power to distinguish between normal and non-normal distributions. Note that if you added even one more point, x=-2:1:5, the K-S test would have rejected the null hypothesis, though. I hope that the real study you are planning to submit has more data than this!
The test certainly does not "classify these data as normal"! It fails to reject the hypothesis that the data are normally distributed. That's an important distinction. Given this dataset, you should not say your data are normal.
The data [-2 -1 0 1 2 3 4] are not, in and of themselves, "linear". They are seven data points that you just happen to know you generated linearly.
The resolution of this issue is not to use the additional arguments "larger" or "smaller". Those arguments are more related to one's expectation that the distribution being sampled is skewed toward one side or the other of normal. I don't think those are relevant here. (But, the way it would be described, if it were relevant, would be to say you used a one-sided KS test rather than two-sided.)
There are other tests of normality that may also be useful to you: jbtest and lillietest.
I would say that if it is important to distinguish normality, then, sadly, you do not have enough data to do so confidently.
  6 Comments
N
N on 29 Jan 2020
On a side note related to the definition of the tails:
  • when using 'Tail' set to 'smaller' we are testing if the the distribution is left skewed
  • when using 'Tail' set to 'larger' we are testing if the the distribution is right skewed
Is this correct?
the cyclist
the cyclist on 30 Jan 2020
% Set random number seed to default
rng default
% Generate data that is clearly shifted larger than standard normal
% (I'm not sure I would refer to this as "right skewed", but I think this is what you mean.)
N = 1000;
x = randn(N,1) + 5;
% Null hypothesis that the distribution is larger than standard normal is NOT rejected
h_larger = kstest(x,'Tail',"larger")
% Null hypothesis that the distribution is unequal to standard normal IS rejected
h_unequal = kstest(x,'Tail',"unequal")
% Null hypothesis that the distribution is smaller than standard normal IS rejected
h_smaller = kstest(x,'Tail',"smaller")

Sign in to comment.


Matt Tearle
Matt Tearle on 31 Mar 2011
The output is the more likely hypothesis, not a true/false. Hence, h = 0 means the null hypothesis (H0) which is that the data comes from the assumed distribution.
The smaller/larger options are for performing one-sided tests - eg if your data came from a normal distribution with positive mean.
Other than that, see Andrew's answer. In particular, look at lillietest and jbtest.
  2 Comments
the cyclist
the cyclist on 31 Mar 2011
h=0 does not mean that the null hypothesis is the more likely hypothesis. It means only that the null hypothesis cannot be rejected at the specified level of confidence.
Matt Tearle
Matt Tearle on 31 Mar 2011
Yes, but given that it returns a single value 0 or 1, I was trying to find a way to phrase that this return is the "decision" (H0 or H1), rather than a true/false.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!