## School of Population Health

# Frequently asked questions

- » What is SPAN?
- » What operating systems support SPAN?
- » What is a simple strategy to generate a partition?
- » How many attributes should I include in a search?
- » The windows to view output and data seem awkward. Why?
- » How can cross-validation be done?
- » How can I create an ROC curve?
- » How do I set the complexity penalising parameter b?
- » What is the point of the balancing parameter g?
- » Can SPAN read character data?
- » Does SPAN read other data formats?
- » Why can't I close windows?
- » How are the ordinal classifier procedures in Statistics in Medicine (1999) done?
- » How does SPAN compare to CART?
- » What is the difference between specifying multiple attributes from one line of a control file and by separate lines?
- » But I don't want to specify an attribute as either positive or negative?
- » What are rectangle diagrams and how do I make them?
- » How can I use SPAN for Boolean algebra?
- » How can I use SPAN to do a tree-based analysis?
- » Which measure of effectiveness should I use?
- » Does SPAN handle censored survival data?
- » How large can data sets be?
- » How can I test the "statistical significance" of a partition?

## What is SPAN?

SPAN is a program designed, ostensibly, to implement Search Partition Analysis procedures. Key references are: Marshall, RJ. Statistics in Medicine; 5, 517-526(1986); 14, 2645-2659 (1995); and 18, 2723-2736 (1999).

Statistics in Medicine, 18, 2723-2736 (1999) (118.5KB PDF)

These are statistical procedures to give decision and classification rules based on Boolean combinations of predictors. The program does, however, have other statistical analysis features. For example, tree-based analysis, ROC curves, scaled rectangle diagrams, mosaic plots, distributions, scatter diagrams, rotating 3D plots.

## What operating systems support SPAN?

SPAN runs under Windows systems but there are no Mac or Unix versions.

## What is a simple strategy to generate a partition?

Suppose you wish to develop a two-class classifier from a set of binary predictors. You have data on the two states in variable *Y*, coded 0 and 1, and *X*_{1},*X*_{2},...,*X*_{20} are twenty binary predictors. You need to create attributes from the predictors in a control file, each with a line like:

X4 b 1=yes

ie, a value 1 represents a *yes* to whatever *X*_{4} represents. The simple way to do a SPAN analysis, with no cross-validation, is to select a *Y* from the *Y* dropdown menu. Once done do *Search:Go *to initiate the search. This will run a search with the default settings in the *Criteria*, *Rank*, *Strategy* and *Extent* drop down menus. The defaults consist of: the Within Mean Square error criterion of effectiveness, utilising the best m=10 attributes, automatically deciding on "positivity" of attributes, iterative mode for partition generation, up to extent parameters *p*_{1}...*p _{q}* = 2 2

If you want to change the default settings, go to each of these dialogs and change the parameters as required. You need to select a partition effectiveness criterion from the criteria dialog.

## How many attributes should I include in a search?

The attribute set for a search is set in the *Strategy* dialog box. The number of attributes selected, in conjunction with the parameters in the *Extent* dialog box, determines how big a search will be. For *m* = 10 attributes and *p*_{1}...*p _{q}* = 2 1 with no floating cuts, the search size is 910. Increase to

*m*= 12 and the search size becomes 1596. If you make

*p*

_{1}...

*p*= 2 2 and corresponding figures are 2080 and 4434. For an indication of search sizes, see this table in the SPAN manual.

_{q}Generally speaking, use *m* £ 15.

## The windows to view output and data seem awkward. Why?

The *View* facility lets you look and the output log, data and other files. The output log has its own format and gives colour coding. The *Vie*w browser is somewhat primitive and is a DIY thing of the author's.

You can, of course, always invoke your own browser to view these files, which are just plain text. The *View* facility of the .LOG file has a benefit that certain key text lines are highlighted.

## How can cross-validation be done?

Cross validation refers to a technique for estimating error rates, using sub-samples that are not used in partition construction. At its simplest, splitting data into a training sample, to obtain a partition, and a test sample to test the partition, is a validation method.

So-called *V*-fold cross-validation works as follows: Divide the data into *V* mutually exclusive sub-samples. Remove one of these samples and estimate a partition on the remainder and then test out the partition on the removed sample. This is repeated for each of the *V* sub-samples giving *V* estimates of misclassification, each not necessarily from the same partition.

It is often suggested that *V* = 10 is a good choice. Then you have one tenth of the sample for test and the other 9/10ths for training. Suppose *L* denotes the whole sample and the vth subsample (for training) is denoted *L _{v}* and

*L*¢ is the remainder (for testing). Suppose, based on

_{v}*L*¢

*, the generalised error measure (see manual section 16.5) is*

_{v}*E*¢

*, using the*

_{v}*n*¢ observations in

_{v}*L*¢. Then the cross-validated error is weighted average of

_{v}*E*¢

*with weights*

_{v}*n*¢

_{v}In SPAN cross-validation can be achieved by using the *Y:By subgroup/Cross-validate* facility in the *Y* drop down. The *By* groups define the samples used to obtain the *V* partitions. For each By-group the data that are excluded are automatically used as the test portion. The use of the _i_ special variable (see manual section 9.9) that indexes the observations can be used to do this. For example, if *N* = 100, and there is one observation per record of data (i.e. data are not grouped) the line

_i_ 20,0 40,20 60,40 80,60 100,80

effectively forms five overlapping "training samples" each of size 80 for 5- fold cross validation: 20,0 is *L*_{1}, the set of observations excluding those indexed 1 to 20, 40,20 is *L*_{2} the set excluding those indexed 21 to 40 and so on.

Note that if data are in grouped form, this method will not necessarily produce equal sized training samples.

If the observations in a data set are in a structured order, you can use the random permutation index _r_ instead of _i_ to obtain random cross-validation samples. The shorthand cvx operator can be used to generate cross-validation group. For example use:

_r_ cv5

to create subgroups for 5-fold cross validation with 5 randomly chosen subgroups according to the quintiles of _r_. i.e. the line is equivalent to

_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80

The resulting cross-validated estimate of misclassification is based on the misclassification rates of *V* (possibly)* different *partitions. However, the estimate is, it is assumed, a valid misclassification rate of the partition you are interested in, that is, that constructed from the whole sample. This is partition is easily got once the cross-validation run is done by de-selection *Y:By subgroup/Cross-validate.*

## How can I create an ROC curve?

An ROC (Receiver Operating Characteristic) curve is produced using *Rank:ROC*. The function is not enabled unless the *Y* variable, which corresponds to the variable that defines the "disease" and "no disease" categories of the ROC, is binary.

The variable that defines the "test" (eg blood pressure) must be one of the other variables and you must specify a sequence of cutpoints for it in the control file. SPAN does not automatically generate cutpoints or use every unique data value as a plotting position. Plotting positions correspond to attributes created in the control file.

The "multiple cuts" (see 9.2.3) syntax is useful for specifying (up to 50) cutpoints:

BP >(80-120)2

specifies cutpoints 80, 82, 84 etc.

## How do I set the complexity penalising parameter b?

Partitions are penalised according to how complex they are by calculating *G*-b*c*, for effectiveness measure *G* and complexity *c*. If b = 0 no penalising is done. When, during a search, values of *G* are plotted against *c*, the optimal partition is effectively that which is furthest from a line drawn with slope b.

In SPAN, b can be set automatically by so requesting in the *Criteria* dialog. It also can be user specified in the *Criteria* dialog or it can be altered in the Searching window, where a red line is drawn with slope *b*, by clicking on the Beta+ and Beta- buttons to increment or decrement b (Note that this feature only works when the *Options: Monitor iterations* is checked.)

The automatic setting feature is done as described in 15.4 of the SPAN manual.

## What is the point of the balancing parameter g?

The purpose of the balancing is to try and prevent a partition into *A* and *A*- which is imbalanced, in the sense, that the number of observations in *A* and *A*- are grossly disparate. It is achieved by multiplying the effectiveness measure (Gini, entropy, or whatever) by

(*p _{A}p_{A}*-)**g

where *p _{A}* and

*p*- are the proportions in

_{A}*A*and

*A*-.

For case-control data _{g} = 1 has a theoretical justification (see Marshall, Statistics in Medicine, 14, 2645-60, 1996).

## Can SPAN read character data?

Yes, but only the initial character of the character string is read and used to define a category. So that data for a race variable with values "European" and "Estonian" would be treated as the same value, so care is required in choosing words to represent categories. Also the case of the initial character is ignored, lower case is converted to upper. Categorical states are numerically encoded when the data are read in. The numerical codes are used in some calculations. Encoding is as follows : A=10, B=11,....Z=36.

## Does SPAN read other data formats?

No. Data is read into SPAN from plain ASCII text files. Can be fixed format or free, space or comma delimited. Read the section on "Data Preparation" in the manual.

## Why can't I close windows?

The program is written using Visual Fortran and its "QuickWin" procedures to create the user interface. These do not give full Windows potential, though they are perfectly adequate. One feature of QuickWin is that the X close button in a window opened by SPAN is disabled.

## How are the ordinal classifier procedures in Statistics in Medicine (1999) done?

In this paper , it was suggested that by collapsing categories of an ordinal variable, a multiway partition could be formed.

Statistics in Medicine, 18, 2723-2736 (1999) (118.5KB PDF)

Suppose the outcome IGT (impaired glucose tolerance, as in the paper) is ordinal with states 0, 1, 2. The analysis can be done by selecting IGT as the Y variable. Because it will be seen as having just 3 categories a "Discrete Y" dialog will open. Selecting the "Ordinal" option will enable the ordinal analysis. Two consecutive (serial) searches when "Search" will be done combining states 0,1 against 2 and then 0 against 1,2 forcing the result of the first search into the second as described in the paper. Options for re-ordering states and doing a "descending" or "ascending" analysis can be done in the "Discrete Y" dialog.

## How does SPAN compare to CART?

I originally developed SPAN because of the limitations I perceived in tree-based analysis.

A critique of trees, in Journal of Clinical Epidemiology (2001) (92.4KB PDF)

Despite the criticisms, trees generally work quite well, in terms of their ability to correctly classify objects, even if the combinations used for classification are not always interpretable or meaningful.

A comparison of classification error rates by SPAN and tree, and other methods, can be found in:

Comparison of misclassification rates of search partition analysis and other classification methods (2006) (96.8KB PDF)

Generally SPAN performs reasonably well. SPAN has advantages, but also some disadvantages.

### The advantages

- The search is not local; it is achieved globally within the framework of a particular class of binary partitions.
- It is less likely to produce combinations that are counter-intuitive, because the class of binary partitions is based on the idea of combining together positive attributes.

### The disadvantages

- It applies to binary partitions and the two- way classification problem. Extending the method to multi-way partitions is feasible (see eg, ordinary classifier procedures above) but not well developed.
- SPAN requires more computer time.

## What is the difference between specifying multiple attributes from one line of a control file and by separate lines?

For example, for a variable AGE, you could have the one line

AGE >20 >30 >40

which generates three AGE attributes. Or you could have three separate lines

AGE >20

AGE >30

AGE >40

which also generates the same three attributes.

In the former case, only one of the three can be chosen to be included in a search attribute set and you have to specify which (from the *Extent* dialog). If you run *Rank* the the best of the three (the optimum), in terms of partition effectiveness, becomes the currently designated AGE attribute. However, you can let AGE "float" so that all possible AGE cutpoints are tried whenever AGE occurs in a Boolean expression.

In the latter case, the three attributes are treated as distinct and so you could include all three in the search attribute set. Different combinations of the AGE attributes would be tried. Of course, some of these combinations may simplify, for example (AGE>20 & AGE>30) = AGE>30, so that it may be inefficient to specify attributes in this way. However, in certain cases it may not. See the next FAQ, for example.

## But I don't want to specify an attribute as either positive or negative?

One objection to SPAN that is sometimes raised is that it is restrictive to have to define an attribute as positive. In different combinations, it is argued, an attribute may be positive, in others it may be negative. For example, AGE<20 may be positive for low birth weight (that is, increases the chances) but in conjunction with another attribute, that is also positive for low birthweight, perhaps RACE=blue, it might *reduce* the chance of a low birth weight baby. Reversals of this sort may sometimes occur, but are generally unlikely. If anticipated, they can be accommodated by specifying the attribute, and its complement, to *both* be positive. For example in the above example, the control file lines

AGE >20

AGE <20

would create positive attributes AGE >20 and AGE <= 20. Including both in the search attribute set would allow potential direction reversals to be generated. From experience, this device only makes a search unnecessarily more extensive.

## What are rectangle diagrams and how do I make them?

The *Process:Rectangle diagrams* function produces mosaic plots or scaled rectangle diagrams for a partition. Scaled rectangle diagrams are an invention of the author and are useful for displaying attributes and the extent to which they overlap. See:

Displaying clinical data relationships using scaled rectangle diagrams (2000) (509.5KB PDF)

Scaled rectangle diagrams can be used to visualize clinical and epidemiological data (2005) (599.3KB PDF)

In SPAN, the *subgroups* of a partition define the rectangles. So you can create a rectangle diagram for any combination of attributes by using the manual facility to create a partition.

For example, Agresti gives the 2^{4} data on premarital sex, divorce, extramarital sex and gender:

1 0 0 1 68

1 0 0 0 130

1 0 1 1 17

1 0 1 0 4

1 1 0 1 60

1 1 0 0 42

1 1 1 1 28

1 1 1 0 11

0 0 0 1 214

0 0 0 0 322

0 0 1 1 36

0 0 1 0 4

0 1 0 1 54

0 1 0 0 25

0 1 1 1 17

0 1 1 0 4

the last column being a frequency count, the first four being binary indicators of gender, premarital sex, extramarital sex and divorce. Read the data from file *sexy.dat* using the .spn file:

Extra-marital sex data.

GENDER PRESEX EXTRASEX DIVORCE FREQ

*

GENDER b 1=men

PRESEX b 1=yes

EXTRASEX b 1=yes

DIVORCE b 1=yes

*

free

sexy.dat

To create a diagram of the four attributes GENDER, PRESEX, EXTRASEX and DIVORCE go to *Process:Rectangles *and enter the partition (*GENDER*=yes) or (*PRESEX*=yes) or (*EXTRASEX*=yes) or (*DIVORCE*=yes) manually (which is achieved by entering (2)(3)(4)(5) in the Manual dialog).

This will create a 4- rectangle diagram displaying the intersection of the four attributes, in effect, a representation of the 24 table.

## How can I use SPAN for Boolean algebra?

SPAN does Boolean algebra but subject to certain restrictions. SPAN only simplifies Boolean expressions that are in already in canonical (normal) form.

SPAN usually runs in the context of data analysis, but you can use it without any data to do Boolean algebra. For example, using control (.SPN) file:

Control file for Boolean algebra only

A B C D X

*

A b 1

A b 0

B b 1

B b 0

C b 1

C b 0

D b 1

X i <50

X i >75 >50

*

will throw up error messages about there being no data, but you can still do Boolean algebra via the *Process:Create attribute *feature.

## How can I use SPAN to do a tree-based analysis?

SPAN has an "add-on" feature to create classification and regression trees. Tree splits are generated according to attributes specified in the control file. Choose a splitting criterion from the *Criteria* dialog and check *Tree in the Strategy* dialog. The search is done at each tree node to whatever parameters have been set in the *Extent* dialog, so allowing "Boolean" node splits. You may want to specify *q*=1, *p*_{1}=1 to dis-allow Boolean splits.

To create the tree, initiate the search with *Search:Go!*.

The Tree is displayed initially with just a split at the root node. Click on Auto grow in the *Tree* window grow the tree according to a stopping rule based on size of a node and the P-value associated with the split. The stopping rule values are set in the *Options* dialog. You can expand the tree by clicking a terminal node and prune it by clicking a non-terminal node.

You cannot do tree-based cross- validation because SPAN utilises its by-group facility for both creating trees (as trees are an afterthought add-on) and for its cross-validation procedures (see manual section 13.7).

## Which measure of effectiveness should I use?

Effectiveness is the criteria that is used to measure how well a binary partition splits the data into two groups. There are 8 choices in the *Criteria* dialog box: *Within MSE*, *Subgroup MSE*, *Entropy*, *Quality index*, *Chi-square*, *Odds ratio*, *Log-rank*, *Gini diversity*.

To choose an effectiveness measure you have to consider the nature of the outcome variable *Y*:

If *Y* is categorical and can take 3, or possibly more, values the only sensible choices are: entropy, chi-square and Gini diversity. These do not assume an ordinal or interval structure to *Y*. The Gini and Entropy measures allow you to specify user defined priors on *Y*. This is not allowed with any other option.

If *Y* is categorical with only *two* values (i.e binary) you can use the above three measures and, in addition, the quality index and Odds ratio (Bayes) measures come into effect.

If *Y* is a continuous measure the Within MSE, Subgroup MSE and log-rank measures are appropriate. The log-rank is for survival data and allows data to be censored. No other measure allows censoring.

The choice between the Within MSE and Subgroup MSE may depend on the objective of the analysis - the latter was introduced with the intention that the purity of subgroups could be achieved. Whether it achieves this is something that needs investigating.

## Does SPAN handle censored survival data?

Yes. If the outcome Y is censored (say a time variable), the log- rank criterion for partition effectiveness can be used. The indicator of censoring has to be a created SPAN attribute. When the log-rank criteria is used SPAN automatically enters "epidemiology" mode and outputs incidence rates assuming Y is a time to event measure.

## How large can data sets be?

In theory there is no limit - data arrays are dimensioned dynamically to accommodate whatever data is read in. In practice, with very large data sets the computation burden of scanning through the data for each partition may be prohibitive. The largest data set I have analysed consisted of over 50,000 records and creating 180 attributes.

If the outcome variable is is binary or discrete with just a few categories, data are grouped internally using a "turbo" facility and savings are considerable. In the 50,000 data set, the data on 10 attributes and a binary outcome collapsed to just a few hundred groups and presented no computational problem.(see SPAN manual Appendix 4. Computational notes). In practice it is better to collapse data beforehand and then read into SPAN in grouped form. For example, using the Stata collapse command.

## How can I test the "statistical significance" of a partition?

Statistical significance of a partition is often not an issue when SPAN is used to create a decision rule - the pertinent question is whether the rule is a good classifier. However, it is often useful to determine whether a rule is statistically significant. For a discussion of the issues, see the SPAN manual Appendix 7.

The *Process:Random* feature allows randomisation tests of statistical significance to be done. Effectively the feature generates binary partitions of the data at random and constructs the (null) sampling distribution of the measure of effectiveness. The effectiveness measure of a given partition can be compared against it to determine whether it could itself be a random partition.