Bookmark and Share Print this page
School of Population Health Appendix 3. Partition criteria

 

Referring to the Criteria dialog, there are 8 options for assessing partitions:


A3.1 Within MSE

This is the measure described in reference 2 equation (7), viz.,

 

where i is impurity measured by mean-square-error, for example, in A,

 

where i(AÈA- ) refers to impurity of the whole sample. This measure can be calculated for interval, nominal or binary Y. However, note that for nominal Y, the coded numbers of the categories are treated simply as their numeric coded value, so that the index may not have a sensible interpretation.

For binary Y, G is the same as the Gini index of diversity (see below), apart for a factor 2.

A3.2 Subgroup MSE

This is the measure described in Reference 2 measuring the effectiveness a partition in terms of its constituent subgroups. Specifically:

 

where

 

and where impurity i is mean square error.

This measure can be used for any Y, as for the Within MSE measure. Note that it is possible for negative values of G to arise with this index.

My experience with it now is that it is best avoided. It may give subgroups that are reasonably pure with respect to the outcome, but, apparently paradoxically, not necessarily good overall purity in terms of A and A-.

A3.3 Entropy

This is as in equation (1) but with an entropy measure instead of mean square error measuring impurity. It is allowed for nominal outcomes, with up to 10 categories (represented by digits 0,1, to 9).

Suppose there are k categories and p = p1,...,pk are probabilities assigned to each. Then i() is defined as

 

The entropy measure is then

 

where p is the distribution in the whole sample, pA in A and pA- in A-. Note that the entropy measure does not extend to equation (2), that is, you cannot use a subgroup purity measure with other than mean square error for impurity.

A3.3.1 Prior probabilities

In general the p's in the above are the sample proportions in the data. However, you can specify the p's accounting for prior probabilities. Suppose p1,...,pk are specified prior probabilities for each category of the outcome. Then you can work out the first term i(p) in (3) with these values. The pA and pA- are then calculated from the data and the pj's. Specifically, using Bayes formula the jth element of pA is

 

where p(A|j) is the proportion of the data in category j with A.

There are three radio buttons to assign priors in the Criteria dialog box: Data priors are the sample proportions in the whole data; Equal priors assign pj = 1/k; User priors are values input by the user.

A3.4 Quality index QI(r)

This is the quality index QI(r) of a 2 by 2 table as described by Kraemer (reference 4, equation 6.1). Only applicable for binary Y. The relative cost r is set by the Costs that are assigned, so the Specify Costs button must be checked in the Criteria dialog. The r value is calculated by the proportion of excess costs. For example, for the following specified costs the excess cost for outcome 1 is (5-1)=4 and for outcome 0 it is (10-2)=8, so that r=4 / (8+4)=0.33

 

A3.5 Chi-square

This is the usual chi-square statistic, not corrected for continuity. It can only be used for a Y that is defined as nominal, ordinal or binary in the control file. Accordingly, the categories of Y must be coded with a single digit and limited to 10 possible values (see 9.3.1)

Note that for a 2 by 2 table, chi-square is the well known quantity

 

where A,B,C,D are marginal totals with A,B sums in A and A- and C and D the sums in the two groups of Y. If you use this measure and the balancing option with g = 1 the statistic is

 

since C and D are fixed. In other words, the measure is equivalent to |ad-bc| which has sometimes been suggested as an effectiveness measure.

Note also that as c2/N = f2, where f is the so- called phi-coefficient, using c2 is equivalent to using f2, which is itself the same thing as the multiple correlation coefficient and is a good prognostic discriminator for binary outcomes (see Buyse, M. Statistics in Medicine, 18, 271-274 (2000)).

A3.6 Odds ratio (Bayes)

This is only applicable to binary Y and is the quantity

 

where a,b,c,d are counts in the 4 cells. This is a Bayes estimator in the sense that augmenting each cell by unity is equivalent to prior information of one observation in each cell.

A3.7 Log-rank statistic

This is the log-rank statistic for testing differences between two survival functions. The outcome measure may be censored. If it is, you must have an attribute that signifies censoring. You will be shown a menu of all created attributes and asked to pick the one that indicates censoring. None of the other effectiveness measures allow for censoring. SPAN takes no account of possible tied values in the computation of log-rank.

When log-rank is selected SPAN enters a mode in which incidence rates and incidence rate ratios are output rather than means.

You cannot have a multivariate log-rank measure.

A3.8 Gini diversity

This is as in equation (1) but with a Gini index of diversity measure instead of mean square error measuring impurity. It is allowed for nominal outcomes, with up to 10 categories (represented by digits 0 to 9).

Suppose there are k categories and pj; j = 1,...,k are probabilities assigned to each. Then i() is defined as

 

Note that this measure does not extend to equation (2), that is, you cannot use a subgroup purity measure with other than mean square error for impurity.

The index can be specified with user defined prior probabilities, as for the Entropy index (see A3.3.1).

A3.9 Directional v. Non-directional indices

Note that, with the exception of the Odds ratio and Quality indices, all the effectiveness criteria are non-directional, in the sense that, for example, a partition A = { x = 0}, for a binary variable x, will score precisely the same as A = { x = 1}. That is, non- directional effectiveness measures do not explicitly assess the SPAN paradigm: "A corresponds to high Y and A- to low Y" (see 3). However, provided positive attributes are appropriately constructed, so that they are indicative of high Y, the situation where the best partition is the reverse of the SPAN paradigm is unlikely to occur, unless the synergistic effect of a combination of positive attributes produces a complete reversal of the individual effects.

Note, however, that when ranking univariate partitions (see 12) with a non-directional measure, partitions that are the reverse of the SPAN paradigm may score well. For example, in the extreme situation in which two attributes {x = 0} and {x = 1} are created and each assigned a positive designator (which can be achieved by having consecutive lines x b 1 and x b 0 in the control file), both attributes will be tied on the ranking procedure.

A3.10 Multiple Y measures

When a multivariate set of outcomes is selected, say Y = (Y1,Y2,..., Yk) the multiple effectiveness measure is the sum of the individual measures of the Yr's. For example, if Gr is the measure for Yr the multiple measure is

 

If the individual Yr are measured on quite different scales it is sometimes sensible to re-scale the Yr by dividing each by the overall sample variance. This is equivalent to attaching a weight to the sum in (4), that is, forming

 

where wr = 1/sr2 is the inverse sample variance of Yr. When multiple Y is selected you can tick the inverse variance weighting in the Criteria dialog box.

When Gr in (4) is the within MSE in (1), it can be shown that maximising (4) is equivalent to maximising the Euclidean distance metric

 

where

 

is the total distance in the k-dimensional space between observations, and DA and DA- are corresponding measures in A and A-. In other words, using multiple Y can be considered as a means to form two "clusters" in k- dimensional space, A and A-, that are homogeneous with respect to Y. The clusters are defined by attributes rather than in a conventional cluster analysis where they are specified only by their observation number.

There are certain restrictions on specifying multivariate Y: you cannot use the Gini or Entropy measures with user specified probabilities and you cannot use the log-rank criterion.

[Back to table of contents]



Please give us your feedback or ask us a question

This message is...


My feedback or question is...


My email address is...

(Only if you need a reply)