Bookmark and Share Print this page
School of Population Health 16. Process

 

The Process menu is used to analyse a particular partition, which may be the optimal one from the result of a Search, or it may be manually entered or derived from a control file specification.  Clicking an item of the menu invokes the  Select Partition dialog to select, or make, a partition to be "processed":

 

The first Optimal complexity penalised selects the best partition on the last search. The Choose from best complexity creates a subsequent dialog giving the best partition at each complexity for the last search done (that is the points on the complexity hull). Manual creates a partition "manually" as described in 16.9. The Current option uses a partition previously selected or created by the other three options. Control file combination allows Boolean combinations that have been written in the control file to be analysed (see 8.8)

The Rectangle, Distribution, Tree, and Statistics items each do an analysis for a specific dependent variable. If you have previously specified a multiple Y dependent variable, you will be presented with a menu to choose a specific Y, as all analysis (except fine tuning) done under Process is for a uni-variate outcome. If you do not have multiple Y, the current defined Y will be used, but you can go to the Y menu to change it if required.

16.1 Rectangle Diagrams 

Two types of diagram can be produced: scaled rectangle diagrams (SRDs) and mosaic plots. Mosaic plots are standard way of viewing categorical data. Details of SRDs, which give a new way to visualise categorical data, can be found in reference 8.  An SRD is like a Venn diagram but using rectangles rather than circles and with rectangles, and overlapping cells,  scaled according to frequency.  

For a partition of size p_1 ....p_q an SRD or mosaic can be drawn to represent the q subgroups of the associated disjunctive normal form (dnf) that is represented by the "process" partition.  A diagram can be created for either the dnf representation of A or of A-; a dialog is presented for this choice.   SRDs can be drawn for q = 1,2,...,6 subgroups. When q ³ 7 but less than or equal to 10,  a q = 6 diagram is produced from the largest six subgroup, in terms of frequencies observations in each. The remaining q-6 are dropped.  For q in excess of 10, diagrams will not be drawn.  Mosaic plots can only be drawn for q up to 4.

Different ways to "fit" an SRD are allowed. Initially a Rectangle Diagram construction dialog will be show:  

 

 

In this dialog an  initial configuration method  can be selected which sets a way to position the rectangles for the optimisation fitting method. The Optimation criterion for fitting can also be chosen as eith log-likelihood, least squares or least absolute differences (see below).  

There are two penalising parameters which affect the optimisation criterion. One is a  cell thinness parameter to allow penalising for long skinny rectangles since diagrams with skinny rectangles are harder to interpret.  The other penalises for absent cells, that is, cells where there are observations but are not represented on the diagram.  

The Permute check box re-orders the rectangles. With a Random initial configuation, this will have no effect, but it may affect the other possibilities for intial configuration.

Also there is a scale parameter which is usually one, and is the parameter  a  described and apples a power scaling to frequencies.

 

Mouse clicks allowed in Rectangle window

As soon as  an SRD is created and the Rectangle Diagram window is active for mouse input. There are a number of buttons which allow changes to the configuration in different ways, and also allow switching to a mosaic plot. Most of these button shut down in Mosaic plot mode, where they have no relevance.

Further, left clicking on an edge of one of the SRD rectangles, brings it to the front, that is, on top of the other rectangles. Initially the rectangles are drawn with the largest on the bottom and the smallest on the top. Right clicking on an edge renders the rectangle "invisible". Although it is not drawn it is still "there". To restore right click again on any  edge.

If you click inside one of the rectangles a pointer line is draw from outside the square with a label indicating its name.  A second click of  left mouse button will increase intensity of the colour and  a right click will decrease colour intensity.

Here is an example of an SRD showing three subgroups. The colours used are the default colours; you can change to balck and white with the Mono button, or re-colour with the Recol button. Other shadings are possible with the Shade button.

 

 

 

The top row buttons are:

Esc releases expected mouse input from the diagram (clicking the right mouse button is equivalent). In Mosaic mode, Esc takes you back to the active scaled rectangle diagram.  

----------------------

Mono alternates between colour and monochrome. 

---------------------

Skew produces a 3D view of the configuration.  The entire bitmapped image is skewed.  Other options do not work on the skewed image; clicking a button will revert to a flat image, which can then be skewed. (In fact, though, the skew option is an add-on gimmick that I do not entirely endorse) 

---------------------

Label Allows cells to be labelled in different ways:

 

Cell numbers gives counts in each cell, cell error are values of discrepancy between count and actual area. Standardised residuals are Pearson chi-square residuals. Mean of Y(incidence rate) adds the actual cell values of mean values. If log-rank criteria is selected incidence rates are added instead. Relative risk and odds ratios are for binary outcomes giving risk relative to the "white space", absence of attributes.

Cell codes adds codes. The code 1**4, for example, refers to cell 1 and 4 and not 2 and not 3. Clicking on Code will also tell whether there are Absent cells that a cell for which there is data, that is, fi > 0 but no area on the diagram ai = 0 to represent it. It is quite likely that absent cells will be present for q ³ 5. There may also be No data cells, that is, cells drawn on the diagram but with no observations, zero frequency.  

----------------------

Nums adds the observed frequency fi of each cell of the diagram.

----------------------

Error adds the component of E from each each cell of the diagram. These are the values:

100(ai-fia)

A negative value shows the cell to be too small, a positive a value to be too large. The quantities ai, fi and a are defined below.

----------------------

Indep produces a diagram based on expected cell numbers assuming the characteristics of the sub-groups are independent. The result of a chi- square test of independence is output. (Note: it is not a test of independence of attributes, unless sub-group rectangles are defined by single attributes).

-----------------------

Shade produces a dialog for options on shading of the currently displayed diagram:

 

The default is "Subgroup colours". Adding shadow attempts (not very successfully) to show the rectangles as superimposed tiles; "Cross-hatching" produces cross-hatched shading. This is not very successful either, as cross-hatching seldom is (given the nature of computer screen pixel representation).

"Fuzzy 99% edges" is an experimental way of smudging the edges of the rectangles according to the sampling variability of the counts in each rectangle. That is, the rectangle at the outer edge of the blur represents the upper 99% CI of the binomial count while the inner edge of the blur represents the lower CI.

"Translucent" shows the rectangles as semi-transparent tiles of coloured glass and works reasonably well. Here is an example:

 

"Mean of Y(or incidence rate) " shading (Log-rank criterion gives Incidence rate) produces shaded cells running from yellow through to deep red according to value of Y in each cell. The levels of the shading ican be automatically determined, or user specifed - a dialog is present for user entered values. Also  clicking on the highest (or lowest) level boxes of the legend will expand or decrease the level spacing.

"Standardised residuals" show Pearson residuals for each cell from assumption of independence i.e. (O-E)/sqrt(E), where O and E are observed and expected cell counts.

----------------------

Recol This button is used to re-colour the configuration. If the Attribute colour shading is used it changes the  colours - by selecting other colours (slightly pastel shades) at random. If you Save a diagram the colours will be saved with it, so you can get back to a set of colours you like, in effect save them. The Undo button reverts to previous colours. You can also get back to the Default colours via the Subgroup and Subgroup+Shadow options of the Shade dialog. If you have "Translucent" shading the Recol button it re-orders the colours of the translucent cells and may improve the appearance.

 

Note also that the intensity of the Subgroup colours can be reduced or increased by a second click inside the subgroup rectangle after clicking on it to add pointer arrow. The left mouse button increases the tonal intensity. the right button decreases it.

----------------------

Nudge gives a slight perturbation of the rectangle coordinates and begins Powell's iterative procedure (see below) again. This facility is introduced, since by experience I have found that Powell's iterative procedure can become stuck at a local minimum. A Nudge may produce a diagram with lower E; but it also may make matters worse. Or it may make no difference at all. The Undo button (see below) allows you to get back to the previous configuration if you wish. Nudge may also provide a more aesthetically pleasing or easy-to-see diagram. Alternatives to nudge to produce a new configuration are the >next and Renew buttons.

-----------------------

Renew This takes you back to the Rectangle diagram construction  dialog box.

----------------------

Mosaic/SRD Toggles between an SRD representation and a mosaic plot. Mosaic plots can only be done for q up to 4.

----------------------

Undo You can get back to the previous configuration with this button.  The button only allows one step back. Once done the button immediately changers to  Redo to take you back to the configuration just "undone".

----------------------

NoEnc  This removes the boundary enclosing rectangle and optimises fit ignoring the whitespace. 

----------------------

>next This jumps to the next possibility for a starting configuration. The next possibility will be whatever initial configuration type has been chosen in the Rectangle diagram construction  dialog box (see above)

----------------------

Save Saves the current configuration coordinates in a file with a .srd extension. The saved configuration can be retrieved by going to Renew and selected Saved from the  Rectangle diagram Construction dialog box. The Saved item presents a dialog to Open a saved .srd file. Note that a .srd file contains only rectangle coordinates of the current configuration. If you change the input partition,  the chosen saved configuration will be applied and may not produce a sensible representation.  If the input partition has a different q from that saved in the .srd file a fix-up will be attempted.

-----------------------

+ and - buttons. These increase and decrease the size of the configuration.

16.1.1 Rectangle construction method

SPAN creates rectangle diagrams by beginning with an initial configuration and applying an iterative procedure to minimise the discrepancy between area and cell frequency. The initial configuration is determined by a certain method and a permutation of the groups (see below). Powell's minimisation algorithm is used to minimise a measure, D, of the difference between cell frequencies and area. There are three measures:

Least absolute difference

 

Least squares

 

Log-likelihood

 

where C = 2q is the number of cells, fi is the observed frequency in cell i, ai is the geometric area of region i, a = åai, and f = åfia is the total metric. Usually a = 1, but can be altered in the dialog below.

An initial analysis is automatically performed with default "method" and "permutation" of the rectangle order and with the log-likelihood measure for D.

By experience I have found that D can be "flat" over regions of the parameter space (i.e. the coordinates of the rectangles) and the optimal configuration may not be achieved. You can nudge  the configuration in the hope of improvement. Or a new analysis can be tried when you click on Renew in the rectangle diagram window, allowing you to change the method(s) and permutation(s) used to establish an initial configuration and D in the   Rectangle Diagram construction dialog box.

The overall % error of the displayed subgroup rectangle diagram is computed E

 

which is the same as unweighted D, expressed as a percentage. It is reported for each diagram, as in the examples above. E_max  is the largest component over i

The Fit P value that is reported together with E at the head of a diagram (see example) is the P-value assocaited with a chi-square test of the agreement between frequency and area.

Note that the enclosing unit rectangle of a diagram represents the entire sample (or a sub-sample, if using by-groups). This means that the "white space" may represent not only observations that are not in any of the rectangles, but also observations for which the partition has not be able to be evaluated due to missing attributes.

16.2 List

This option will produce a list of identifiers of the observations in A and A- and in the intersections. For example, if #( 1,2)=5, that is, there are 5 observations in the intersection of subgroups 1 and 2, it will be followed by a list of the identifiers of these 5 observations.  The item is only active if an identifier  _id_ has been specified in the control file.

16.3 Distribution

The Process:Distribution item will create a graphic of the distribution of the dependent variable in A and A- , the two plots being superimposed so you can see the discrimination of the partition. As in 10.4, there are five types of plot available: empirical probability density, bar-chart of relative frequencies, bar chart of absolute frequencies, survival plot and distribution function. You will be required to select which you want. If you select empirical probability density, one of the two curves will first be drawn and the Band+ and Band- buttons appear to allow adjustment of the bandwidth smoother. When you Esc the adjustment the second curve appears; it is calculated with the same bandwidth as the first.

The frequency of missing values is shown on the distribution diagram to the left of the frequency/probability axis.

Survival and distribution function plots are  non-parametric. The Kaplan-Meier (product-limit) formula is used if the log-rank criterion has been selected and there is censoring (see A3.7).

16.4 Statistics

This option will give you statistics associated with a partition. The information is also written to the output log which is automatically displayed. (Other items of the processing menu also write the information to the log, but do not display the log window; you need to use View to display).

16.4.1 2x2 table statistics

If Y is discrete a frequency table of A,A- (as columns) by the dependent variable (as rows) is produced. If there are two values of Y various 2 by 2 table statistics will be shown:

 

These statistics are:

  1. Positive and negative predictive values treating A and A- as positive and negative "tests" respectively, and the right hand column of the table as a positive outcome. Confidence intervals for a proportion are done by Wilson's method.
  2. Sensitivity and specificity again treating A and A- as positive and negative "tests".
  3. Quality indices QI(0) and QI(1) and QI(0.5), are as defined in Kraemer (reference 4). Confidence intervals are calculated using standard error formulæ in Marshall (reference 5) . Var(O) is estimate of variance of O=log(QI(0)/QI(1)) and is useful to calculate variance of bias formula in reference 5.
  4. The odds ratio of the table. To avoid division by zero, 0.5 is always inserted in empty cells. Confidence intervals calculated by the logit method.
  5. The relative risk, treating A (second column) as the "risk factor (exposure)" and the second row as the "disease". Confidence intervals are calculated by the logit method.
  6. If you have a by-group in place, the odds ratios and relative risks of each table will be combined by the Mantel-Haenzsel and logit methods are also calculated. 95% confidence intervals are calculated for the pooled logit estimates.

16.4.2 Chi-square, generalised error etc

Whether or not the table is 2 by 2 or k by 2, the following is also output:

  1. The value of the chosen effectiveness measure and its value, not complexity penalised, is output. The word Balanced is pre-fixed if g > 0 (see 11.2).
  2. The Penalised effectiveness measure for given beta and for the complexity of the partition.
  3. Correlation Correlation is the product-moment correlation coefficient between the dependent variable and a binary variable that is 0 or 1 depending on whether an observation is in A or A- . In this calculation the dependent variable is treated numerically as its actual value.
  4. Chi-sq statistic and Phi-coefficient. The chi- square statistic c2 of the k by 2 frequency table and associated P-value (not corrected for continuity). The phi- coefficient is f = Ö{c2/n} where c2 is the chi- square statistic of the frequency table and n the sample size. When the frequency table is 2 by 2, the correlation and phi-coefficient are equal (in absolute value).

    (Note that chi-square for general cross-tabulations can be obtained from Y:Scatterplot)
  5. Generalised error. The generalised measure of error, E, is a measure of the discrepancy between of the values of Y in A- and in A from the minimum and maximum respectively, that is,

     


    If, for example, Y is binary then E is the misclassification rate of the 2 by 2 table.

    If Y has k categories taking values 0,1,..., k-1 and A- and A are coded 0 and 1 then

     

    where {ni,j} is the 2 by k table of counts.

The Statistics option will give a breakdown of the subgroups of A and of A- :

 

The mean and standard deviation (SD) of the dependent variable in each subgroup, treated strictly as its numerical value and disregarding whether the variable is nominal or ordinal, is shown, as well as the number in the subgroup and the number that are unique (U) to the subgroup, in the sense that they do not appear in any other of the subgroups. If the dependent variable is binary the mean is, of course, a proportion.

16.4.3 Risk matrix

The Process:Statistics option also writes a risk matrix to the output log (only if Detailed output option is on). If there are q subgroups, I1, I2,..., Iq in A and q¢ subgroups, I1¢,I2¢,..., Iq¢¢, in A-, an element of the risk matrix is a measure of the risk of subgroup Ii in A relative to subgroup Ij in A-. In addition, the risk measure is calculated for each of I1, I2,..., Iq versus A-, and for subgroups, I¢1, I¢2,..., I¢q versus A, as well as A- versus A. The matrix is therefore of dimension (q+1)(q¢+1). For example, if Y is an interval variable the mean difference

 

is output, for each pairing. A 95% confidence interval is also output.

On output, a separate row is used for each element of the matrix. For example:

 

Here row 3 2- refers to the relative risk between subgroup 3 of A and subgroup 2 of A-.

When Y is binary, a matrix of risk differences, relative risks (risk ratios) and odds ratios is output. Confidence intervals are calculated for the odds ratios and relative risks by the logit method.

16.4.4 Log-rank: incidence rates

If the Log-rank effectiveness criterion has been specified, the Process:Statistics feature will also output incidence rates and incidence rate ratios.

 

As SPAN attempts to partition such that A is associated with large Y, which in this case will be greater survival than in A-, incidence rates will be expected to be greater in A- (and its subgroups) than in A. In consequence, incidence rate ratios are output comparing rates in A- (the numerator) to rates in A (the denominator).  Also when log-rank is specified Harrell's c-index for concordance of predicted and actual survival times is output. This is computed by pairwise compution of observations and may be slow.  You can escape the computation with Search:End.

16.4.5 Table of Prior adjusted pseudo counts

If prior probabilities are user specified (in the Criteria dialog) a table of pseudo counts is also output. This table is based on adjusted values of the cell counts, adjusted to ensure that row sums are in proportion to the specified prior distribution. Cell entries of row j, corresponding to category j of the Y variable with specified prior probability pj, are multiplied by pj/(nj/n)

16.5 Random

This option generates binary partitions of the data at random and thereby allows randomisation tests of a partition. Whether an individual is assigned to A or A- is done by calling a random number generator. The purpose of the routine is to assess the statistical significance of regular partitions by posing the question: "could the partition, or partitions, have arisen by chance?". (More details are in Appendix 7)

When the routine is entered you select a partition for analysis, as with other items of the Process menu. The chosen partition is the one to be tested for significance. You will be shown a Random partitions dialog box:

 

This gives the number of partitions generated on the last search, say N, and the measure of effectiveness of the chosen partition (test G). The statistical significance of test G can be assessed by generating the sampling distribution of G from random partitions of the data.

To generate a random partition requires specifying a probability pA that an observation will lie in A. Specifying pA can be done in three ways chosen by the appropriate radio button:

  1. Mimic split distribution. Sample pA from the frequency distribution of the empirical pA's that occurred in the previous search of size N.
  2. Random U(0,1) split distibution. Sample pA from a uniform distribution over 0 to 1.
  3. Fixed % split distribution. Fix pA at some value common to all partitions. You need to enter an associated percentage.

As option 1 produces splits with similar proportions in A and A- as in the search just done, it is probably to be preferred.

With a given pA and n observations, a partition with [npA] observations in A will be generated, by sampling with probabilities conditional on the pool of remaining unassigned observations.

You will need to enter the number of random partitions to generate. This can be set to zero if you just wish to view the distribution of G in searches of the data. The default value is 1000.

The survival distribution of G for random partitions as well as the actual value of the G's on the previous search will be produced as follows: and n observations, a partition with [npA] observations in A will be generated, by sampling with probabilities conditional on the pool of remaining unassigned observations.

You will need to enter the number of random partitions to generate. This can be set to zero if you just wish to view the distribution of G in searches of the data. The default value is 1000.

The survival distribution of G for random partitions as well as the actual value of the G's on the previous search will be produced as follows:

 

The Random sampling distribution is the distribution of G by simulated partitions. The distribution of generated criteria on the last search is the Search distribution with G_test being the chosen partition to test. Note that the search distribution and the G_test value are not the complexity penalised values.

16.6 Tree

If a tree has been produced by Strategy:Tree (see 13.5), you can view it and, if required prune or expand it, with this option without having to recreate the tree as in 13.5.

If you have generated a tree from a training sample (produced by Edit:Split sample) then switching to the test sample (using Process:Test sample) and running Process:Tree will "drop" the test sample down the tree.

16.7 Test[training] sample

If the data have been randomly split into a test and training sample (see 7.3), the Process: Test [training] sample item will be enabled in the Processing menu to allow you to change to analyse the test sample. If you switch to the test sample, the next time Process is called the item becomes Training sample so that you can switch back and forth in this way.

7.3

16.8 Create added attribute

You can use the Process menu to create an attribute. It will be assigned a number prefixed by &, i.e. &1, &2, &3 etc.

The created attributes and variables are not physically added to the data set; they are lost when you exit SPAN or when a different data set is re-read from File:Read .SPN file. However, you can save the Boolean combination of the added attribute to the control file and the attribute immediately recreated when you next read in the data from the control file.

The added attribute value is added for all observations of the input data set, even if, for example, you have restricted analysis to a subset, using Strategy:Constrain the X space.

The number of added attributes is limited by the parameter MAXADD (see Appendix 4).  Once you reach this number all but the last 10 created added attributes will be deleted and lost before proceeding.

Appendix 4

 

16.9 Manual partitions [Simplifying Boolean expressions]

To input your own Manual partition you will be shown a dialog box with a list of attributes with a number assigned to each. You have to fill in a partition expression in disjunctive normal form(dnf).

 

The dnf expression must be in the form of lists of attribute numbers grouped by parentheses as follows

(list_1 ) (list_2 )......

where list_1, list_2,.. are lists of attribute numbers. To enter the complement of a listed attribute enter the negative of the list number.

You can type in the expression number directly or use the Add button to an insert an attribute number that is highlighted.

The above dialog creates the following partition

 

A=(SEX_M) or (AGP>0.82) or (AGP>1.03) or (AGP>1.44)

which simplifies to

A=(SEX_M) or (AGP>0.82) 

If the Representation of A- box is ticked the partition determined by the combination is assumed to represent  the complement A-

The manual input facility can be used to simplify Boolean expressions, as in the example above. Enter a partition in manual form and it will automatically be simplified if Option:Full Boolean reduction is checked. If Boolean simplifications are not wanted (which is sometimes useful for creating Rectangle diagrams, for example if you wanted to include the three nested rectangles for levels of AGP in above example) ensure Option:Attributes are primitive is checked.

[Back to table of contents]



Please give us your feedback or ask us a question

This message is...


My feedback or question is...


My email address is...

(Only if you need a reply)