The Process menu is used to analyse a particular partition, which may be
the optimal one from the result of a Search, or it may be manually entered or derived
from a control file specification. Clicking an item of the menu invokes the Select Partition
dialog to select, or make, a partition to be "processed":
The first Optimal complexity penalised selects the best partition on the last
search. The Choose from best complexity creates a subsequent dialog giving the
best partition at each complexity for the last search done (that is the points on
the complexity hull). Manual creates a partition "manually" as described in
16.9. The Current
option uses a partition previously selected or created by the other three options.
Control file combination allows Boolean combinations that have been written
in the control file to be analysed (see 8.8)
The Rectangle, Distribution, Tree, and Statistics
items each do an analysis for a specific dependent variable. If you have previously
specified a multiple Y dependent variable, you will be presented with a
menu to choose a specific Y, as all analysis (except fine tuning) done
under Process is for a uni-variate outcome. If you do not have multiple
Y, the current defined Y will be used, but you can go to the Y
menu to change it if required.
16.1 Rectangle Diagrams
Two types of diagram can be produced: scaled rectangle diagrams (SRDs) and mosaic
plots. Mosaic plots are standard way of viewing categorical data. Details of SRDs,
which give a new way to visualise categorical data, can be found in reference 8.
An SRD is like a Venn diagram but using rectangles rather than circles and with
rectangles, and overlapping cells, scaled according to frequency.
For a partition of size p_1 ....p_q an SRD or mosaic can be drawn to represent
the q subgroups of the associated disjunctive normal form (dnf) that is
represented by the "process" partition. A diagram can be created for either
the dnf representation of A or of A-; a dialog is presented for this choice. SRDs can be drawn for q = 1,2,...,6 subgroups. When q ³ 7 but less than or equal to 10, a q = 6 diagram is produced
from the largest six subgroup, in terms of frequencies
observations in each. The remaining q-6 are dropped. For q in excess
of 10, diagrams will not be drawn. Mosaic
plots can only be drawn for q up to 4.
Different ways to "fit" an SRD are allowed. Initially a Rectangle Diagram construction
dialog will be show:

In this dialog an initial configuration method can be selected
which sets a way to position the rectangles for the optimisation fitting method.
The Optimation criterion for fitting can also be chosen as eith log-likelihood,
least squares or least absolute differences (see below).
There are two penalising parameters which affect the optimisation criterion.
One is a cell thinness parameter to allow penalising for long skinny rectangles since diagrams with skinny
rectangles are harder to interpret.
The other penalises for absent cells, that is, cells where there are observations
but are not represented on the diagram.
The Permute check box re-orders the rectangles. With a Random initial configuation,
this will have no effect, but it may affect the other possibilities for intial configuration.
Also there is a scale parameter which is usually one, and is the parameter
a described and apples a power scaling
to frequencies.
Mouse clicks allowed in Rectangle window
As soon as an SRD is created and the Rectangle Diagram window is active for mouse
input. There are a number of buttons which allow changes to the configuration in
different ways, and also allow switching to a mosaic plot. Most of these button
shut down in Mosaic plot mode, where they have no relevance.
Further, left clicking on an
edge of one of the SRD rectangles, brings it to the front, that is, on top of the other
rectangles. Initially the rectangles are drawn with the largest on the bottom and
the smallest on the top. Right clicking on an
edge renders the rectangle "invisible". Although it is not drawn it is still "there".
To restore right click again on any edge.
If you click inside one of the rectangles a pointer line
is draw from outside the square with a label indicating its name. A second click of left mouse
button will increase intensity of the colour and a right click will decrease
colour intensity.
Here is an example of an SRD showing three subgroups. The colours used
are the default colours; you can change to balck and white with the Mono button,
or re-colour with the Recol button. Other shadings are possible with the Shade button.
The top row buttons are:
Esc releases expected mouse input
from the diagram (clicking the right mouse button is equivalent). In Mosaic mode,
Esc takes you back to the active scaled rectangle diagram.
----------------------
Mono alternates between colour and
monochrome.
---------------------
Skew produces a 3D view of the configuration.
The entire bitmapped
image is skewed. Other options do not work on the skewed image; clicking a button will revert to a flat image, which can then
be skewed. (In fact, though, the skew option is an add-on gimmick that I do not entirely endorse)
---------------------
Label Allows cells to be labelled
in different ways:
Cell numbers gives counts in each cell, cell error are values of discrepancy between
count and actual area. Standardised residuals are Pearson chi-square residuals.
Mean of Y(incidence rate) adds the actual cell values of mean values. If log-rank
criteria is selected incidence rates are added instead. Relative risk and odds ratios
are for binary outcomes giving risk relative to the "white space", absence of attributes.
Cell codes adds codes. The code 1**4, for example, refers to cell 1 and 4 and not
2 and not 3. Clicking on Code will also tell whether there are Absent cells
that a cell for which there is data, that is, fi > 0 but no area on the diagram ai = 0 to represent it. It is quite likely
that absent cells will be present for q ³
5. There may also be No data cells, that is, cells drawn on the diagram
but with no observations, zero frequency.
----------------------
Nums adds the observed frequency fi
of each cell of the diagram.
----------------------
Error adds the component of E from each each cell of the diagram. These
are the values:
100(ai-fia)
A negative value shows the cell to be too small, a positive a value to be too large.
The quantities ai,
fi and
a are defined below.
----------------------
Indep produces a diagram based on expected cell numbers assuming the characteristics
of the sub-groups are independent. The result of a chi- square test of independence
is output. (Note: it is not a test of independence of attributes, unless sub-group
rectangles are defined by single attributes).
-----------------------
Shade produces a dialog for options
on shading of the currently displayed diagram:
The default is "Subgroup colours". Adding shadow attempts (not very successfully) to
show the rectangles as superimposed tiles; "Cross-hatching" produces cross-hatched
shading. This is not very successful either, as cross-hatching seldom is (given
the nature of computer screen pixel representation).
"Fuzzy 99% edges" is an experimental way of smudging the edges of the
rectangles according to the sampling variability of the counts in each rectangle.
That is, the rectangle at the outer edge of the blur represents the upper 99% CI
of the binomial count while the inner edge of the blur represents the lower CI.
"Translucent" shows the rectangles
as semi-transparent tiles of coloured glass and works reasonably well. Here is an
example:
"Mean of Y(or incidence rate) " shading (Log-rank criterion gives Incidence rate) produces
shaded cells running from yellow through to deep red according to value of Y in
each cell. The levels of the shading ican be automatically determined, or
user specifed - a dialog is present for user entered values. Also clicking on
the highest (or lowest) level boxes of the legend will expand or decrease the level
spacing.
"Standardised residuals" show Pearson residuals for each cell from assumption
of independence i.e. (O-E)/sqrt(E), where O and E are observed and expected cell
counts.
----------------------
Recol This button is used to re-colour
the configuration. If the Attribute colour shading is used it changes the
colours - by selecting other colours (slightly pastel shades) at random.
If you Save a diagram the colours will be saved with it, so you can get back to
a set of colours you like, in effect save them. The Undo button reverts to previous colours. You can also get
back to the Default colours via the Subgroup and Subgroup+Shadow
options of the
Shade dialog.
If you have "Translucent" shading the Recol button
it re-orders the colours of the translucent cells and may improve the appearance.
Note also that the intensity of the Subgroup colours can be reduced or increased
by a second click inside the subgroup rectangle after clicking on it to add pointer
arrow. The left mouse button increases the tonal intensity. the right button decreases
it.
----------------------
Nudge gives a slight perturbation
of the rectangle coordinates and begins Powell's iterative procedure (see below)
again. This facility is introduced, since by experience I have found that Powell's
iterative procedure can become stuck at a local minimum. A
Nudge may produce a diagram with lower E; but it also may make
matters worse. Or it may make no difference at all. The
Undo button (see below) allows you to get back to the previous configuration
if you wish. Nudge may also provide a more aesthetically pleasing or easy-to-see
diagram. Alternatives to nudge to produce a new configuration are the
>next and
Renew buttons.
-----------------------
Renew This takes you back to the
Rectangle diagram construction dialog box.
----------------------
Mosaic/SRD Toggles between an SRD
representation and a mosaic plot. Mosaic plots can only be done for q up to
4.
----------------------
Undo You can get back to the previous
configuration with this button. The button only allows one step back. Once
done the button immediately changers to Redo
to take you back to the configuration just "undone".
----------------------
NoEnc
This removes the boundary enclosing rectangle and optimises fit ignoring the whitespace.
----------------------
>next This jumps to the next possibility
for a starting configuration. The next possibility will be whatever initial configuration
type has been chosen in the Rectangle diagram construction
dialog box (see above)
----------------------
Save Saves the current configuration
coordinates in a file with a .srd extension. The saved configuration can be retrieved
by going to Renew and selected Saved from the Rectangle diagram Construction dialog box. The
Saved item presents a dialog to Open a saved .srd file. Note that a .srd
file contains only rectangle coordinates of the current configuration. If you change
the input partition,
the chosen saved configuration will be applied and may not produce a sensible representation.
If the input partition has a different q from
that saved in the .srd file a fix-up will be attempted.
-----------------------
+ and - buttons. These increase and decrease the size of the configuration.
16.1.1 Rectangle construction method
SPAN creates rectangle
diagrams by beginning with an initial configuration and applying an iterative procedure
to minimise the discrepancy between area and cell frequency. The initial configuration
is determined by a certain method and a permutation of the groups (see below). Powell's
minimisation algorithm is used to minimise a measure, D, of the difference
between cell frequencies and area. There are three measures:
Least absolute difference
Least squares
Log-likelihood
where C = 2q
is the number of cells, fi is the observed frequency in cell i, ai is the geometric
area of region i, a = åai, and f
= åfia
is the total metric. Usually a = 1, but
can be altered in the dialog below.
An initial analysis is automatically performed with default "method" and "permutation"
of the rectangle order and with the log-likelihood measure for D.
By experience
I have found that D can be "flat" over regions of the parameter space (i.e. the
coordinates of the rectangles) and the optimal configuration may not be achieved.
You can nudge the configuration in the hope of improvement. Or a new
analysis can be tried when you click on
Renew in the rectangle diagram window, allowing
you to change the method(s) and permutation(s) used to establish an initial configuration
and D in
the Rectangle Diagram construction dialog box.
The overall % error of the displayed subgroup rectangle diagram is computed E
which is the same as unweighted D, expressed as a percentage. It is reported
for each diagram, as in the examples above. E_max is the largest
component over i
The Fit P value that is reported together with E at the head of
a diagram (see example) is the P-value assocaited with a chi-square test of the
agreement between frequency and area.
Note that the enclosing unit rectangle of a diagram represents the entire sample
(or a sub-sample, if using by-groups). This means that the "white space" may represent
not only observations that are not in any of the rectangles, but also observations
for which the partition has not be able to be evaluated due to missing attributes.
16.2 List
This option will produce a list of identifiers of the observations in A
and A- and in the intersections. For example, if #(
1,2)=5, that is, there are 5 observations in the intersection of subgroups
1 and 2, it will be followed by a list of the identifiers of these 5 observations.
The item is only active if an identifier _id_ has been specified in the control
file.
16.3 Distribution
The Process:Distribution item will create a graphic of the distribution of the dependent
variable in A and A- , the two plots being superimposed so you
can see the discrimination of the partition. As in 10.4, there are five types of
plot available: empirical probability density, bar-chart of relative frequencies,
bar chart of absolute frequencies, survival plot and distribution function. You will
be required to select which you want. If you select empirical probability density,
one of the two curves will first be drawn and the
Band+ and Band- buttons
appear to allow adjustment of the bandwidth smoother. When you
Esc the adjustment the second curve appears; it is calculated with the
same bandwidth as the first.
The frequency of missing values is shown on the distribution diagram to the left
of the frequency/probability axis.
Survival and distribution function plots are non-parametric. The Kaplan-Meier (product-limit) formula is used if the log-rank criterion has been selected and
there is censoring (see A3.7).
16.4 Statistics
This option will give you statistics associated with a partition. The information
is also written to the output log which is automatically displayed. (Other items
of the processing menu also write the information to the log, but do not display
the log window; you need to use View to display).
16.4.1 2x2 table statistics
If Y is discrete a frequency table of A,A- (as columns) by the
dependent variable (as rows) is produced. If there are two values of Y
various 2 by 2 table statistics will be shown:
These statistics are:
- Positive and negative predictive values treating A and A- as positive and negative
"tests" respectively, and the right hand column of the table as a positive outcome.
Confidence intervals for a proportion are done by Wilson's method.
- Sensitivity and specificity again treating
A and A- as positive and negative
"tests".
- Quality indices QI(0) and QI(1) and
QI(0.5), are as defined in Kraemer (reference 4). Confidence intervals
are calculated using standard error formulæ in Marshall (reference
5) . Var(O) is
estimate of variance of O=log(QI(0)/QI(1)) and is useful to calculate variance of
bias formula in reference 5.
- The odds ratio of the table. To avoid division by zero, 0.5 is always inserted in
empty cells. Confidence intervals calculated by the logit method.
- The relative risk, treating A (second column) as the "risk factor (exposure)" and
the second row as the "disease". Confidence intervals are calculated by the logit
method.
- If you have a by-group in place, the odds ratios and relative risks of each table
will be combined by the Mantel-Haenzsel and logit methods are also calculated. 95%
confidence intervals are calculated for the pooled logit estimates.
16.4.2 Chi-square, generalised error etc
Whether or not the table is 2 by 2 or k by 2, the following is also output:
- The value of the chosen effectiveness measure and its value, not complexity penalised,
is output. The word Balanced is pre-fixed
if g > 0 (see
11.2).
- The Penalised effectiveness measure
for given beta and for the complexity
of the partition.
- Correlation Correlation is the product-moment
correlation coefficient between the dependent variable and a binary variable that
is 0 or 1 depending on whether an observation is in A or A- .
In this calculation the dependent variable is treated numerically as its actual
value.
- Chi-sq statistic and
Phi-coefficient. The chi- square statistic
c2 of the k by 2 frequency table and associated P-value (not
corrected for continuity). The phi- coefficient is
f = Ö{c2/n}
where c2 is the chi- square statistic of
the frequency table and n the sample size. When the frequency table is 2 by 2, the
correlation and phi-coefficient are equal (in absolute value).
(Note that chi-square for general cross-tabulations can be obtained from Y:Scatterplot)
- Generalised error. The generalised
measure of error, E, is a measure of the discrepancy between of the values
of Y in A- and in A from the minimum and maximum respectively,
that is,
If, for example, Y is binary
then E is the misclassification rate of the 2 by 2 table.
If Y has k categories taking values 0,1,..., k-1 and
A- and A are coded 0 and 1 then
where {ni,j}
is the 2 by k table of counts.
The Statistics option will give a breakdown of the subgroups of A
and of A- :
The mean and standard deviation (SD)
of the dependent variable in each subgroup, treated strictly as its numerical value
and disregarding whether the variable is nominal or ordinal, is shown, as well as
the number in the subgroup and the number that are unique (U)
to the subgroup, in the sense that they do not appear in any other of the subgroups.
If the dependent variable is binary the mean is, of course, a proportion.
16.4.3 Risk matrix
The Process:Statistics option also writes a risk matrix to the
output log (only if Detailed output option is on). If there are q
subgroups, I1,
I2,..., Iq in A and q¢ subgroups, I1¢,I2¢,...,
Iq¢¢, in A-,
an element of the risk matrix is a measure of the risk of subgroup Ii in A relative to subgroup Ij in A-.
In addition, the risk measure is calculated for each of I1, I2,...,
Iq versus A-,
and for subgroups, I¢1, I¢2,..., I¢q versus
A, as well as A- versus A. The matrix is therefore of
dimension (q+1)(q¢+1).
For example, if Y is an interval variable the mean difference
is output, for each pairing. A 95% confidence interval is also output.
On output, a separate row is used for each element of the matrix. For example:
Here row 3 2- refers to the relative
risk between subgroup 3 of A and subgroup
2 of A-.
When Y is binary, a matrix of risk differences, relative risks (risk ratios)
and odds ratios is output. Confidence intervals are calculated for the odds ratios
and relative risks by the logit method.
16.4.4 Log-rank: incidence rates
If the Log-rank effectiveness criterion has been specified, the Process:Statistics feature
will also output incidence rates and incidence rate ratios.
As SPAN attempts to partition
such that A is associated with large Y, which in this case will
be greater survival than in A-, incidence rates will be expected to be
greater in A- (and its subgroups) than in A. In consequence, incidence
rate ratios are output comparing rates in A- (the numerator) to rates in
A (the denominator). Also when log-rank is specified Harrell's c-index
for concordance of predicted and actual survival times is output. This is computed
by pairwise compution of observations and may be slow. You can escape the
computation with Search:End.
16.4.5 Table of Prior adjusted pseudo counts
If prior probabilities are user specified (in the Criteria dialog) a table of pseudo
counts is also output. This table is based on adjusted values of the cell counts,
adjusted to ensure that row sums are in proportion to the specified prior distribution.
Cell entries of row j, corresponding to category j of the Y
variable with specified prior probability pj, are multiplied
by pj/(nj/n)
16.5 Random
This option generates binary partitions of the data at random and thereby allows
randomisation tests of a partition. Whether an individual is assigned to A
or A- is done by calling a random number generator. The purpose of the
routine is to assess the statistical significance of regular partitions by posing
the question: "could the partition, or partitions, have arisen by chance?". (More
details are in Appendix 7)
When the routine is entered you select a partition for analysis, as with other items
of the Process menu. The chosen partition is the one to be tested for significance.
You will be shown a Random partitions dialog box:
This gives the number of partitions generated on the last search, say N, and the
measure of effectiveness of the chosen partition (test
G). The statistical significance of test
G can be assessed by generating the sampling distribution of G
from random partitions of the data.
To generate a random partition requires specifying a probability pA that an observation will lie in A.
Specifying pA
can be done in three ways chosen by the appropriate radio button:
- Mimic split distribution. Sample pA from the frequency distribution of the empirical
pA's that occurred
in the previous search of size N.
- Random U(0,1) split distibution. Sample pA from a uniform distribution over 0 to 1.
- Fixed % split distribution. Fix pA at some value common to all partitions. You need
to enter an associated percentage.
As option 1 produces splits with similar proportions in A and A-
as in the search just done, it is probably to be preferred.
With a given pA and n observations, a partition with [npA]
observations in A will be generated, by sampling with probabilities conditional
on the pool of remaining unassigned observations.
You will need to enter the number of random partitions to generate. This can be
set to zero if you just wish to view the distribution of G in searches of
the data. The default value is 1000.
The survival distribution of G for random partitions as well as the actual
value of the G's on the previous search will be produced as follows: and
n observations, a partition with [npA] observations in A will be generated, by sampling
with probabilities conditional on the pool of remaining unassigned observations.
You will need to enter the number of random partitions to generate. This can be
set to zero if you just wish to view the distribution of G in searches
of the data. The default value is 1000.
The survival distribution of G for random partitions as well as the actual
value of the G's on the previous search will be produced as follows:
The Random sampling distribution is
the distribution of G by simulated partitions. The distribution of generated
criteria on the last search is the Search distribution
with G_test being the chosen partition
to test. Note that the search distribution and the
G_test value are not the complexity penalised values.
16.6 Tree
If a tree has been produced by Strategy:Tree (see
13.5), you can view
it and, if required prune or expand it, with this option without having to recreate
the tree as in 13.5.
If you have generated a tree from a training sample (produced by Edit:Split sample)
then switching to the test sample (using Process:Test sample) and running
Process:Tree will "drop" the test sample down the tree.
16.7 Test[training] sample
If the data have been randomly split into a test and training sample (see
7.3),
the Process: Test [training] sample item will be enabled in the Processing
menu to allow you to change to analyse the test sample. If you switch to the test
sample, the next time Process is called the item becomes Training sample
so that you can switch back and forth in this way.
7.3
16.8 Create added attribute
You can use the Process menu to create an attribute. It will be assigned
a number prefixed by &, i.e. &1, &2, &3 etc.
The created attributes and variables are not physically added to the data set; they
are lost when you exit SPAN or when a different data set is re-read
from File:Read .SPN file.
However, you can save the Boolean combination of the added attribute to the control
file and the attribute immediately recreated when you next read in the data from
the control file.
The added attribute value is added for all observations of the input data set, even if,
for example, you have restricted analysis to a subset, using Strategy:Constrain
the X space.
The number of added attributes is limited by the parameter
MAXADD (see Appendix 4). Once you reach this number
all but the last 10 created added attributes will be deleted and lost before proceeding.
Appendix 4
16.9 Manual partitions [Simplifying Boolean expressions]
To input your own Manual partition you will be shown a dialog box with
a list of attributes with a number assigned to each. You have to fill in a
partition expression in disjunctive normal form(dnf).
The dnf expression must be in the form of lists of attribute numbers grouped by parentheses
as follows
(list_1 ) (list_2 )......
where list_1, list_2,.. are lists
of attribute numbers. To enter the complement of a listed attribute enter the negative
of the list number.
You can type in the expression number directly or use the Add button to
an insert an attribute number that is highlighted.
The above dialog creates the following partition
A=(SEX_M) or (AGP>0.82) or (AGP>1.03) or (AGP>1.44)
which simplifies to
A=(SEX_M) or (AGP>0.82)
If the Representation of A- box is ticked the partition determined by
the combination is assumed to represent the complement A-
The manual input facility can be used to simplify Boolean expressions, as in the
example above. Enter a partition in manual form and it will automatically be simplified
if Option:Full Boolean reduction is checked. If Boolean simplifications are not wanted (which is sometimes useful
for creating Rectangle diagrams, for example
if you wanted to include the three nested rectangles for levels of AGP in above
example) ensure Option:Attributes are primitive
is checked.
[Back to table of contents]