School of Population Health


Appendix 7. Assessing statistical significance

Although the objective of SPAN is not to test for statistical significance - rather it is an exploratory data analysis tool - the question of statistical significance of a partition is often posed.

P-values are output in various SPAN procedures. For example, those associated with frequency tables and in the Rank attributes procedure. These are from standard x2 or F tests. However, their validity is sometimes questioned when a partition is the outcome of an extensive search. The argument is that if you search hard enough you will eventually come up with something statistically significant. On the other hand, the counter argument is: why should the fact that a partition has arisenfrom a search make any difference to interpreting the meaning or importance of the result? It could be considered illogical that what we think about the significance of a partition should depend on how much searching was done to find it.

One way to test statistical significance of a partition is to generate the sampling distribution of the measure of effectiveness that arises from "random partitions" of the data. If the effectiveness of a derived best partition is in the extreme tail of the sampling distribution, it suggests that the partition is unlikely to be one of the population of random partitions.

A7.1 Non-randomness of generated partitions


The Process:Random facility (see 16.6) allows the random partition sampling distribution to be generated, by repeated dividing the sample in two at random. By comparing the random partition distribution of the effectiveness measure G against the distribution of G in a SPAN search, some indication of whether the search has generated a set of random partitions can be obtained.

Suppose the size of the SPAN search just done is N and the effectiveness values are ranked G(1),...,G(N). A similar set of Gvalues, say G(1)¢,...,G(M)¢, for M random partitions can be generated. A test of the null hypothesis that the G(1),...,G(N) values are random, can be obtained by comparing the empirical distribution of G(1)¢,...,G(M)¢ against that of G(1),...,G(N). Formally a Kolmogorov-Smirnov test can be done, though usually a visual comparison of distribution or survival curves (given that N and Mare usually large) provides a good indication of randomness.

A7.2 Significance of the best partition


Although the above procedure provides a general test of randomness, an alternative view is to focus on the significance of theparticular partition with G value G* say. If f(G) is the empirical sampling distribution of G based on M random partitions the significance of G* can be assessed with respect to f(G) by simply counting how many random partitions exceed G* and dividing by M to give a P-value. This is given by Exceedances > best G P-value in SPAN output.

If G* is the best partition in a search of size N, that is G(N), then the problem alluded to above is raised. How should the significance of G(N) be assessed with respect to f(G)? There is no clear answer; but two possible approaches:

First, you can simply obtain the P-value of G(N) with respect to f(G) as above for G* disregarding the fact that it was the best of a search.

By an alternative viewpoint, similar to the multiple comparisons difficulty, the above P-value is invalid. As G(N) arose from a search of N partitions, in which it emerged as the largest G, it should be compared against the Nth order statistic distribution of G. That is, the distribution of the largest G in repeated searches of size N. To empirically assess the significance in this way requires more work. For example, if N = 500, you would require M = 50000 random partitions to obtain a sample of 100 G(500)'s. The number that exceeded the actual G(500) would estimate the required P- value, as given by On 100 extrema ....

Alternatively, extreme value theory can be invoked: if the order statistic distribution is assumed to follow Gumbel's Type III extreme value distribution - which may be a good approximation - a P-value can be otained from the (N-1)/Nth and (Ne-1)/Neth percentiles of f(G) (see Mood, Graybill and Boes Introduction to the theory of statistics McGraw- Hill). This procedure is also used in SPAN, the percentiles being estimated non-parametrically from the cumulative sampling distribution. The result is output as Non-parametric extreme value statistic P-value.

 

[Back to table of contents]