The Search:Go! item sets a search going in accordance with the parameters
determined in Y, Criteria, and Strategy
menus and dialogs.
15.1 Search
During a search a record is kept of the best partition at each complexity that is
generated and of the "top 10", but no record is kept of every generated
partition; if a partition does not supplant the current best partition of the same
complexity, it is discarded.
In general, search methodology has two fundamental maxims ensure every stone is
turned and ensure no stone is turned more than once. The former
is ensured by defining explicitly what is a stone, that is, by setting
the extent of the search parameters in Strategy and Extent. The
"lock and key" algorithm (Appendix 1) for enumerating regular partitions has been
extensively tested and will not fail to "turn every stone".
Appendix 1
Ensuring that "no stone is turned more than once" is not automatically guaranteed,
as it is possible that in certain circumstances after Boolean algebraic simplifications
the same partition is re-generated.
As no record is kept of partitions that have been generated, no test is made to
see whether a partition has previously been generated. However, there are certain
tests that SPAN invokes to skip partitions that are repeated to avoid unnecessary
work (Appendix 6).
It is possible that one partition may have identical effectiveness and complexity,
that is, it is "tied" with another. In this case SPAN has two options for which
to choose: either take the first found or the last found. The option is set by the
Options menu (see 17.9)
There is no limit on the size of the search. The program will just keep on keeping
on (thanks Bob) until the search ends or you intervene to halt it.
15.2 During the search
The I'm Thinking window will display progress of the search.
If the Option:Display Graphics while searching is checked the
Searching window is show and paused between iterations. A search may be interrupted by as described in
15.7.
15.3 The end of a search: complexity hull
When a search is ended (whether completed or is user terminated), or at the end
of each iteration, a complexity hull is drawn on the complexity versus effectiveness
diagram:
The trace drawn is the (upper) hull of the points generated on the search. The optimal
point is marked (in red) (the point with greatest vertical distance from the line
with slope b) and the partition to which
it corresponds is output in the Partition window. The circle points that
are the maxima for each generated complexity that do not lie on the hull.
The Partition window displays the optimal partition of the search. While
the Searching window is active for mouse input you can click on the points
of the complexity hull. The partition corresponding to the clicked point is written
in the Partition window. You can also adjust the
b penalty parameter, as described below (15.4).
In iterative mode a new
complexity versus effectiveness diagram is generated for each iteration. The optimal
partition on each iteration is lodged in memory as an added attribute and can be
accessed for analysis by Process.
The end of the search will also display the Search Summary table:
This gives the number of partitions generated, number of iterations, complexity
of the final optimal partition and its penalised measure of effectiveness. Also
the resubstitution measure of generalised misclassification error is given. In this
example, there are no by-groups and the data is not split into test/training sample.
If a split has been made into test and training sample (using Edit:Split sample)
the training sample is used to develop the optimal
partition, which is then applied
to the test sample. The Search summary table then also computes the test sample
estimate of misclassification and the sample size of the test portion of the data.
15.4 Penalty parameter setting
You can adjust the b penalty parameter.
A red line though the origin is is drawn with slope equal to the penalty parameter
b and, by clicking on
Beta+ and Beta- in the
top left hand corner of the window, b is
adjusted. Note that the new value becomes the default for subsequent searches.
You can also set b via the Criteria dialog
box.
If the Criteria dialog box has "automatic" entered for
b, its value is determined automatically as a fraction of the slope of
a line from the origin to the first point on the complexity hull at complexity c*
and effectiveness G*, that is, b
= f ×G*/c*. Usually, but not necessarily, c*
= 1. In iterative mode (see below), this value of
b is obtained on the first iteration and retained for subsequent iterations.
When automatic is used the retained
partitions in the Top 10 are not complexity penalised.
The fraction f = 0.05 has, from experience, been found appropriate and
is what is used.
15.5 Searching in iterative mode
In iterative mode, each iteration is a new search and a new complexity versus effectiveness
plot is generated. These are unseen unless the Option:Display Graphics while searching
is checked
If the best partition on the first iteration is a single attribute then the iterative
process will be deemed to have converged, as further iterations cannot lead to improvement.
15.6 Search interrupts
You can interrupt or abandon a search with Search menu items
as follows:
Search: Go!
This sets a search going (including a search to generate a tree)
Search: Pause
This temporarily suspends a search. It does not allow you to carry out any other
SPAN functions while paused. But you can invoke other software.
Search: Continue
This continues a search that has been temporarily suspended with Search:Pause.
Search: Escape
Escapes from a search but may not terminate the search. Specifically, if searching
with a specified p1,..pq, it abandons the
search with those parameters. Also, in iterative mode, will allow the search to
continue to the next iteration. However, in tree mode, this does abandon the entire
tree search.
Search: End
Terminates a search completely, including subsequent iterations if in iterative
mode.
15.7 Search with by-group/cross-validate
When you do Search with a Y:By-group/cross-validate in place a
separate search is done for each by-group and the optimal partition for that by-group
is then applied to the remainder, the test sample, that is not in the by group.
Both the re-substitution misclassification error (Resub_error)
and the misclassification on the test (Test_error)
are calculated and output when the search is completed over all by-groups in the
Search summary table:
Here there are 5 by-groups, so
effectively doing 5-fold cross-validation. In the
first by-group after 2 iterations and generating 5914 partitions an optimal partition
of complexity 3 was found. The re-substitution error is 14.1% on the by-group used
to generate the partition. When this partition is applied to the sample of 39 not
in the by-group the misclassification error was 28.2%. The cross-validated error
is averaged over the by-groups is 21.03%. To obtain the partition for which 21.03%
is supposed to be a measure of misclassification error, another search must be done
with Y:By-group/cross-validate unchecked.
Formally, a measure of error, say Ev¢, is computed for the portion of the data
that is not in by-group v for the optimal partition in by-group
v. It is accumulated into a cross-validated measure of error:
where nv¢ is the sample size not in by-group v.
15.8 Search in tree mode
When Strategy:tree has been checked, a tree is constructed down to only five
levels, the final level comprising 32 terminal nodes. A Tree Auto Stopping
rule
dialog box is presented which prunes the tree according to the rules.
This dialog sets the minumim size for a terminal node, and the minumum size of the
parent of a terminal node. For example, in the dialog, if a node has fewer than
40 observations, no attempt is made to split it. (When the splitting criterion is
log-rank and Y variable is survival time these numbers refer to number of
cases). Also a critical z value must be achieved for a split. Finally, if Y is nominal
and the predicted class of the child nodes are the same, a parent node is not split,
if Don't split if predicted same class is checked. If the outcome
is not discrete, a terminal node is highlighted and considered "high risk" if the
mean value at the node (or incidence rate) exceeds the mean value of the whole sample.
In this case if the Don't split if predicted same class is checked a node
that has the same level of risk is not split.
The dialog is also used to exit the tree window with EXIT TREE.
Here is an example of a tree:
The boxes at the terminal nodes are drawn in area proportional to the sample size
of the node. The statistics at each terminal node give the frequency of the categories
Y, if Y is discrete, otherwise the mean of Y is written. The bold
(blue) frequency is the most frequent category at the node (i.e. the Bayes rule
for classification). The classification table and Error are (re-substitution) estimates
based on tree and the Bayes rule classification. The terminal boxes are shaded according
to the value of the outcome variable. Deepest red the highest, palest yellow lowest.
Pruning and expansion of the tree can be done with the mouse. Clicking on a terminal
node (specifically where the branch meets the node rectangle) expands the tree to
give its two child nodes. Clicking on a non-terminal node (that is where a branch
splits) prunes the branches so that the clicked node becomes a terminal node.
The Rules button invokes the Tree Auto Stopping Rule dialog again to changes the rules, as required.
It also allows Exit from the tree window. When you Exit the tree the collective partition of
the unioned highlighted nodes is evaluated and simplified. It becomes the "Current"
partition for the Process menu items.
The
buttons + and - reduce or expand the size of the diagram, while the "up", "down",
"left", "right" buttons, move the image. "Undo" undoes these re-sizings.
The "Auto" re-grows a tree according to the criteria in the Stopping Rules dialog
(if it has been manually pruned or grown).
Terminal nodes can be highlighted (green border in colour as above) by clicking
inside the terminal node box.
The Mono button changes from colour to black and white.
[Back to table of contents]