What is SPAN?
SPAN is a program designed, ostensibly, to implement Search Partition Analysis procedures.
Key references are: Marshall, RJ. Statistics in Medicine; 5, 517-526(1986); 14,
2645-2659 (1995); and 18, 2723-2736 (1999).
Statistics in Medicine, 18, 2723-2736 (1999)(118.5KB PDF)
These are statistical procedures to give decision and classification rules based
on Boolean combinations of predictors. The program does, however, have other
statistical analysis features. For example, tree-based analysis, ROC curves,
scaled rectangle diagrams, mosaic plots, distributions, scatter diagrams,
rotating 3D plots.
Top
What operating systems support SPAN?
SPAN runs under Windows systems but there are no Mac or Unix versions.
Top
What is a simple strategy to generate a partition?
Suppose you wish to develop a two-class classifier from a set of binary predictors.
You have data on the two states in variable Y, coded 0 and 1, and X1,X2,...,X20
are
twenty binary predictors. You need to create attributes from the predictors in a
control file, each with a line like:
X4 b 1=yes
ie, a value 1 represents a yes to whatever X4 represents. The simple way to do
a SPAN analysis, with no cross-validation, is to select a Y from the Y dropdown
menu. Once done do Search:Go to initiate the search. This will run a search with
the default settings in the Criteria, Rank, Strategy
and Extent drop down menus.
The defaults consist of: the Within Mean Square error criterion of effectiveness,
utilising the best m=10 attributes, automatically deciding on "positivity" of attributes,
iterative mode for partition generation, up to extent parameters p1...pq = 2 2
If you want to change the default settings, go to each of these dialogs and change
the parameters as required. You need to select a partition effectiveness criterion
from the criteria dialog.
Top
How many attributes should I include in a search?
The attribute set for a search is set in the Strategy dialog box. The number of
attributes selected, in conjunction with the parameters in the Extent
dialog box,
determines how big a search will be. For m = 10 attributes
and p1...pq
= 2 1 with no floating cuts, the search size is 910. Increase to m = 12 and the
search size becomes 1596. If you make p1...pq = 2 2 and corresponding figures are 2080 and 4434.
For an indication of search sizes, see this table
in the SPAN manual.
Generally speaking, use m £ 15.
Top
The windows to view output and data seem awkward. Why?
The View facility lets you look and the output log, data and other files.
The output log has its own format and gives colour coding. The View browser is somewhat
primitive and is a DIY thing of the author's.
You can, of course, always invoke your own browser to view these files, which are
just plain text. The View facility of the .LOG file has a benefit that certain key
text lines are highlighted.
Top
How can cross-validation be done?
Cross validation refers to a technique for estimating error rates, using sub-samples
that are not used in partition construction. At its simplest, splitting data into
a training sample, to obtain a partition, and a test sample to test the partition,
is a validation method.
So-called V-fold cross-validation works as follows: Divide the data into
V mutually
exclusive sub-samples. Remove one of these samples and estimate a partition on the
remainder and then test out the partition on the removed sample. This is repeated
for each of the V sub-samples giving V estimates of misclassification, each not
necessarily from the same partition.
It is often suggested that V = 10 is a good choice. Then you have one tenth of the
sample for test and the other 9/10ths for training. Suppose L denotes the whole
sample and the vth subsample (for training) is denoted Lv and Lv¢
is the remainder (for testing). Suppose, based on L¢v,
the generalised error measure (see
manual section 16.5) is E¢v, using
the nv¢ observations in Lv¢.
Then the cross-validated error is weighted average of E¢v
with weights nv¢
Manual section 16.5
In SPAN cross-validation can be achieved by using the Y:By subgroup/Cross-validate
facility in the Y drop down. The By groups define the samples
used to obtain the V partitions. For each By-group the data that are excluded are automatically used
as the test portion. The use of the _i_ special variable (see manual section 9.9)
that indexes the observations can be used to do this. For example, if N = 100, and
there is one observation per record of data (i.e. data are not grouped) the line
Manual section 9.9
_i_ 20,0 40,20 60,40 80,60 100,80
effectively forms five overlapping "training samples" each of size 80 for 5- fold
cross validation: 20,0 is L1, the set of observations
excluding those indexed 1 to 20, 40,20 is L2 the set excluding those indexed 21 to 40 and so on.
Note that if data are in grouped form, this method will not necessarily produce
equal sized training samples.
If the observations in a data set are in a structured order, you can use the random
permutation index _r_ instead of _i_ to obtain random cross-validation samples.
The shorthand cvx
operator can be used to generate cross-validation group. For example use:
_r_ cv5
to create subgroups for 5-fold cross validation with 5 randomly chosen subgroups
according to the quintiles of _r_. i.e. the line is equivalent to
_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80
The resulting cross-validated estimate of misclassification is based on the misclassification
rates of V (possibly) different partitions. However, the estimate is, it
is assumed, a valid misclassification rate of the partition you are interested in,
that is, that constructed from the whole sample. This is partition is easily got
once the cross-validation run is done by de-selection Y:By subgroup/Cross-validate.
Top
How can I create an ROC curve?
An ROC (Receiver Operating Characteristic) curve is produced using Rank:ROC.
The
function is not enabled unless the Y variable, which corresponds to the variable
that defines the "disease" and "no disease" categories of the ROC, is binary.
The variable that defines the "test" (eg blood pressure) must be one of the other
variables and you must specify a sequence of cutpoints for it in the control file.
SPAN does not automatically generate cutpoints or use every unique data value as
a plotting position. Plotting positions correspond to attributes created in the
control file.
The "multiple cuts" (see 9.2.3) syntax is useful for specifying (up to 50) cutpoints:
BP >(80-120)2
specifies cutpoints 80, 82, 84 etc.
Top
How do I set the complexity penalising parameter b?
Partitions are penalised according to how complex they are by calculating G-bc, for effectiveness measure G
and complexity c. If b = 0 no penalising is done. When,
during a search, values of G are plotted against
c, the optimal partition is effectively
that which is furthest from a line drawn with slope
b.
In SPAN, b can be set automatically by
so requesting in the Criteria dialog. It also can be user specified in
the Criteria dialog or it can be altered in the Searching window, where
a red line is drawn with slope b, by clicking on the
Beta+ and Beta- buttons to
increment or decrement b (Note that this
feature only works when the Options: Monitor iterations is checked.)
The automatic setting feature is done as described in
15.4 of the SPAN manual.
Top
What is the point of the balancing parameter g?
The purpose of the balancing is to try and prevent a partition into A and
A- which
is imbalanced, in the sense, that the number of observations in
A and A- are grossly
disparate. It is achieved by multiplying the effectiveness measure (Gini, entropy,
or whatever) by
(pApA-)**g
where pA and
pA- are the proportions in A and A-.
For case-control data g = 1 has a theoretical justification (see Marshall, Statistics
in Medicine, 14, 2645-60, 1996).
Top
Can SPAN read character data?
Yes, but only the initial character of the character string is read and used to
define a category. So that data for a race variable with values "European" and "Estonian"
would be treated as the same value, so care is required in choosing words to represent
categories. Also the case of the initial character is ignored,
lower case is converted to upper. Categorical states are numerically encoded when
the data are read in. The numerical codes are used in some calculations. Encoding
is as follows : A=10, B=11,....Z=36.
Top
Does SPAN read other data formats?
No. Data is read into SPAN from plain ASCII text files. Can be fixed format
or free, space or comma delimited. Read the section on "Data Preparation" in the
manual.
Top
Why can't I close windows?
The program is written using Visual Fortran and its "QuickWin" procedures to create
the user interface. These do not give full Windows potential, though they are perfectly
adequate. One feature of QuickWin is that the X close button in a window opened
by SPAN is disabled.
Top
How are the ordinal classifier procedures in Statistics in Medicine (1999) done?
In this paper , it was suggested that
by collapsing categories of an ordinal variable, a multiway partition could be formed.
Statistics in Medicine, 18, 2723-2736 (1999)(118.5KB PDF)
Suppose the outcome IGT (impaired glucose tolerance, as in the paper) is ordinal
with states 0, 1, 2. The analysis
can be done
by selecting IGT as the Y variable. Because it will be seen as having
just 3 categories a "Discrete Y" dialog will open. Selecting the "Ordinal"
option will enable the ordinal analysis. Two consecutive (serial) searches
when "Search" will be done combining states 0,1 against 2 and then 0 against
1,2 forcing the result of the first search into the second as described in the paper.
Options for re-ordering states and doing a "descending" or "ascending" analysis
can be done in the "Discrete Y" dialog.
Top
How does SPAN compare to CART?
I originally developed SPAN because of the limitations I perceived in tree-based
analysis.
A critique of trees, in Journal of Clinical Epidemiology (2001)(92.4KB PDF)
Despite the criticisms, trees generally work quite well, in terms of their ability
to correctly classify objects, even if the combinations used for classification
are not always interpretable or meaningful.
A comparison of classification error rates by SPAN and tree, and other methods,
can be found in:
Comparison of misclassification rates of search partition analysis and other classification methods (2006)(96.8KB PDF)
Generally SPAN performs reasonably well. SPAN has advantages, but also some disadvantages.
The advantages
- The search is not local; it is achieved globally within the framework of a particular
class of binary partitions.
- It is less likely to produce combinations that are counter-intuitive, because the
class of binary partitions is based on the idea of combining together positive attributes.
Its disadvantages
- It applies to binary partitions and the two- way classification problem. Extending
the method to multi-way partitions is feasible (see eg, ordinal classifier procedures
above) but not well developed.
- SPAN requires more computer time.
Top
What is the difference between specifying multiple attributes from one line of a
control file and by separate lines?
For example, for a variable AGE, you could have the one line
AGE >20 >30 >40
which generates three AGE attributes. Or you could have three separate lines
AGE >20
AGE >30
AGE >40
which also generates the same three attributes.
In the former case, only one of the three can be chosen to be included in a search
attribute set and you have to specify which (from the Extent dialog). If
you run Rank the the best of the three (the optimum), in terms of partition effectiveness,
becomes the currently designated AGE attribute. However, you can let AGE "float"
so that all possible AGE cutpoints are tried whenever AGE occurs in a Boolean expression.
In the latter case, the three attributes are treated as distinct and so you could
include all three in the search attribute set. Different combinations of the AGE
attributes would be tried. Of course, some of these combinations may simplify, for
example (AGE>20 & AGE>30) = AGE>30,
so that it may be inefficient to
specify attributes in this way. However, in certain cases it may not. See the next
FAQ, for example.
Top
But I don't want to specify an attribute as either positive or negative?
One objection to SPAN that is sometimes raised is that it is restrictive to have
to define an attribute as positive. In different combinations, it is argued, an
attribute may be positive, in others it may be negative. For example,
AGE<20
may be positive for low birth weight (that is, increases the chances) but in conjunction
with another attribute, that is also positive for low birthweight, perhaps RACE=blue,
it might reduce the chance of a low birth weight baby. Reversals of this sort may
sometimes occur, but are generally unlikely. If anticipated, they can be accommodated
by specifying the attribute, and its complement, to both be positive. For example
in the above example, the control file lines
AGE >20
AGE <20
would create positive attributes AGE >20
and AGE <= 20. Including both in
the search attribute set would allow potential direction reversals to be generated.
From experience, this device only makes a search unnecessarily more extensive.
Top
What are rectangle diagrams and how do I make them?
The Process:Rectangle diagrams function produces mosaic plots or scaled
rectangle diagrams for a partition. Scaled rectangle diagrams are an invention of
the author and are useful for displaying attributes and the extent to which they
overlap. See:
Displaying clinical data relationships using scaled rectangle diagrams (2000)(509.5KB PDF)
Scaled rectangle diagrams can be used to visualize clinical and epidemiological data (2005)(599.3KB PDF)
In SPAN, the subgroups of a partition define the rectangles. So you can create a
rectangle diagram for any combination of attributes by using the manual facility
to create a partition.
For example, Agresti gives the 24
data on premarital sex, divorce, extramarital
sex and gender:
1 0 0 1 68
1 0 0 0 130
1 0 1 1 17
1 0 1 0 4
1 1 0 1 60
1 1 0 0 42
1 1 1 1 28
1 1 1 0 11
0 0 0 1 214
0 0 0 0 322
0 0 1 1 36
0 0 1 0 4
0 1 0 1 54
0 1 0 0 25
0 1 1 1 17
0 1 1 0 4
the last column being a frequency count, the first four being binary indicators
of gender, premarital sex, extramarital sex and divorce. Read the data from file
sexy.dat using the .spn file:
Extra-marital sex data.
GENDER PRESEX EXTRASEX DIVORCE FREQ
*
GENDER b 1=men
PRESEX b 1=yes
EXTRASEX b 1=yes
DIVORCE b 1=yes
*
free
sexy.dat
To create a diagram of the four attributes GENDER, PRESEX, EXTRASEX and DIVORCE
go to Process:Rectangles and enter the partition (GENDER=yes)
or (PRESEX=yes) or (EXTRASEX=yes) or (DIVORCE=yes) manually (which is achieved by entering (2)(3)(4)(5)
in the Manual dialog).
This will create a 4- rectangle diagram displaying the intersection of the four
attributes, in effect, a representation of the 24
table.
Top
How can I use SPAN for Boolean algebra?
SPAN does Boolean algebra but subject to certain restrictions. SPAN only simplifies
Boolean expressions that are in already in canonical (normal) form.
SPAN usually runs in the context of data analysis, but you can use it without any
data to do Boolean algebra. For example, using control (.SPN) file:
Control file for Boolean algebra only
A B C D X
*
A b 1
A b 0
B b 1
B b 0
C b 1
C b 0
D b 1
X i <50
X i >75 >50
*
will throw up error messages about there being no data, but you can still do Boolean
algebra via the Process:Create attribute feature.
Top
How can I use SPAN to do a tree-based analysis?
SPAN has an "add-on" feature to create classification and regression trees. Tree
splits are generated according to attributes specified in the control file. Choose
a splitting criterion from the Criteria dialog and check Tree in the Strategy
dialog.
The search is done at each tree node to whatever parameters have been set in the Extent dialog, so allowing "Boolean" node splits. You may want to specify q=1, p1=1
to dis-allow Boolean splits.
To create the tree, initiate the search with Search:Go!.
The Tree is displayed initially with just a split at the root node. Click on Auto grow in the Tree window grow the tree according to a stopping rule based on size
of a node and the P-value associated with the split. The stopping rule values are
set in the Options
dialog. You can expand the tree by clicking a terminal node and
prune it by clicking a non-terminal node.
You cannot do tree-based cross- validation because SPAN utilises its by-group facility
for both creating trees (as trees are an afterthought add-on) and for its cross-validation
procedures (see manual section 13.7).
Top
Which measure of effectiveness should I use?
Effectiveness is the criteria that is used to measure how well a binary partition
splits the data into two groups. There are 8 choices in the Criteria dialog box: Within MSE, Subgroup MSE, Entropy, Quality index,
Chi-square, Odds ratio, Log-rank, Gini diversity.
To choose an effectiveness measure you have to consider the nature of the outcome
variable Y:
If Y is categorical and can take 3, or possibly more, values the only sensible choices
are: entropy, chi-square and Gini diversity. These do not assume an ordinal or interval
structure to Y. The Gini and Entropy measures allow you to specify user defined
priors on Y. This is not allowed with any other option.
If Y is categorical with only two values (i.e binary) you can use the above three
measures and, in addition, the quality index and Odds ratio (Bayes) measures come
into effect.
If Y is a continuous measure the Within MSE, Subgroup MSE and log-rank measures
are appropriate. The log-rank is for survival data and allows data to be censored.
No other measure allows censoring.
The choice between the Within MSE and Subgroup MSE may depend on the objective of
the analysis - the latter was introduced with the intention that the purity of subgroups
could be achieved. Whether it achieves this is something that needs investigating.
Top
Does SPAN handle censored survival data?
Yes. If the outcome Y is censored (say a time variable), the log- rank criterion for partition effectiveness
can be used. The indicator of censoring has to be a created SPAN attribute.
When the log-rank criteria is used SPAN automatically enters
"epidemiology" mode and outputs incidence rates assuming Y is a time to event measure.
Top
How large can data sets be?
In theory there is no limit - data arrays are dimensioned dynamically to accommodate
whatever data is read in. In practice, with very large data sets the computation
burden of scanning through the data for each partition may be prohibitive. The largest
data set I have analysed consisted of over 50,000 records and creating 180 attributes.
If the outcome variable is is binary or discrete with just a few categories, data
are grouped internally using a "turbo" facility and savings are considerable. In
the 50,000 data set, the data on 10 attributes and a binary outcome collapsed to
just a few hundred groups and presented no computational problem. (See
SPAN manual
Appendix 4. Computational notes) . In practice it is better to collapse data
beforehand and then read into SPAN in grouped form. For example, using the Stata
collapse command.
SPAN manual
Appendix 4. Computational notes
Top
How can I test the "statistical significance" of a partition?
Statistical significance of a partition is often not an issue when SPAN is used
to create a decision rule - the pertinent question is whether the rule is a good
classifier. However, it is often useful to determine whether a rule is statistically
significant. For a discussion of the issues, see the SPAN manual Appendix 7.
The Process:Random feature allows randomisation tests of statistical significance
to be done. Effectively the feature generates binary partitions of the data at random
and constructs the (null) sampling distribution of the measure of effectiveness.
The effectiveness measure of a given partition can be compared against it to determine
whether it could itself be a random partition.
Top