Bookmark and Share Print this page
School of Population Health 15. Search

 

The Search:Go! item sets a search going in accordance with the parameters determined in Y, Criteria, and Strategy  menus and dialogs.

15.1 Search

During a search a record is kept of the best partition at each complexity that is generated and of the "top 10", but no record is kept of every generated partition; if a partition does not supplant the current best partition of the same complexity, it is discarded.

In general, search methodology has two fundamental maxims ensure every stone is turned and ensure no stone is turned more than once. The former is ensured by defining explicitly what is a stone, that is, by setting the extent of the search parameters in Strategy and Extent. The "lock and key" algorithm (Appendix 1) for enumerating regular partitions has been extensively tested and will not fail to "turn every stone".

Appendix 1

Ensuring that "no stone is turned more than once" is not automatically guaranteed, as it is possible that in certain circumstances after Boolean algebraic simplifications the same partition is re-generated. As no record is kept of partitions that have been generated, no test is made to see whether a partition has previously been generated. However, there are certain tests that SPAN invokes to skip partitions that are repeated to avoid unnecessary work (Appendix 6).

It is possible that one partition may have identical effectiveness and complexity, that is, it is "tied" with another. In this case SPAN has two options for which to choose: either take the first found or the last found. The option is set by the Options menu (see 17.9)

There is no limit on the size of the search. The program will just keep on keeping on (thanks Bob) until the search ends or you intervene to halt it.

15.2 During the search

The I'm Thinking  window will display progress of the search.  If the Option:Display Graphics while searching is checked  the Searching window is show and paused between iterations.  A search may be interrupted by as described in  15.7.

15.3 The end of a search: complexity hull

When a search is ended (whether completed or is user terminated), or at the end of each iteration, a complexity hull is drawn on the complexity versus effectiveness diagram:

 

The trace drawn is the (upper) hull of the points generated on the search. The optimal point is marked (in red) (the point with greatest vertical distance from the line with slope b) and the partition to which it corresponds is output in the Partition window. The circle points that are the maxima for each generated complexity that do not lie on the hull.

The Partition window displays the optimal partition of the search. While the Searching window is active for mouse input you can click on the points of the complexity hull. The partition corresponding to the clicked point is written in the Partition window. You can also adjust the b penalty parameter, as described below (15.4). In iterative mode a new complexity versus effectiveness diagram is generated for each iteration. The optimal partition on each iteration is lodged in memory as an added attribute and can be accessed for analysis by Process.

The end of the search will also display the Search Summary table:

 

This gives the number of partitions generated, number of iterations, complexity of the final optimal partition and its penalised measure of effectiveness. Also the resubstitution measure of generalised misclassification error is given. In this example, there are no by-groups and the data is not split into test/training sample. If a split has been made into test and training sample (using Edit:Split sample) the training sample is used to develop the optimal partition, which is then applied to the test sample. The Search summary table then also computes the test sample estimate of misclassification and the sample size of the test portion of the data.

 

15.4 Penalty parameter setting

You can adjust the b penalty parameter. A red line though the origin is is drawn with slope equal to the penalty parameter b and, by clicking on Beta+ and Beta- in the top left hand corner of the window, b is adjusted. Note that the new value becomes the default for subsequent searches.

You can also set b via the Criteria dialog box.

If the Criteria dialog box has "automatic" entered for b, its value is determined automatically as a fraction of the slope of a line from the origin to the first point on the complexity hull at complexity c* and effectiveness G*, that is, b = f ×G*/c*. Usually, but not necessarily, c* = 1. In iterative mode (see below), this value of b is obtained on the first iteration and retained for subsequent iterations. When automatic is used the retained partitions in the Top 10 are not complexity penalised.

The fraction f = 0.05 has, from experience, been found appropriate and is what is used.

15.5 Searching in iterative mode

In iterative mode, each iteration is a new search and a new complexity versus effectiveness plot is generated. These are unseen unless the Option:Display Graphics while searching is checked 

If the best partition on the first iteration is a single attribute then the iterative process will be deemed to have converged, as further iterations cannot lead to improvement.

15.6 Search interrupts

You can interrupt or abandon a search with  Search  menu items  as follows:

Search: Go!

This sets a search going (including a search to generate a tree)

Search: Pause

This temporarily suspends a search. It does not allow you to carry out any other SPAN functions while paused. But you can invoke other software.

Search: Continue

This continues a search that has been temporarily suspended with Search:Pause.

Search: Escape

Escapes from a search but may not terminate the search. Specifically, if searching with a specified p1,..pq, it abandons the search with those parameters. Also, in iterative mode, will allow the search to continue to the next iteration. However, in tree mode, this does abandon the entire tree search.

Search: End

Terminates a search completely, including subsequent iterations if in iterative mode.

15.7 Search with by-group/cross-validate

When you do Search with a Y:By-group/cross-validate in place a separate search is done for each by-group and the optimal partition for that by-group is then applied to the remainder, the test sample, that is not in the by group. Both the re-substitution misclassification error (Resub_error) and the misclassification on the test (Test_error) are calculated and output when the search is completed over all by-groups in the Search summary table:

 

Here there are 5 by-groups, so effectively doing 5-fold cross-validation. In the first by-group after 2 iterations and generating 5914 partitions an optimal partition of complexity 3 was found. The re-substitution error is 14.1% on the by-group used to generate the partition. When this partition is applied to the sample of 39 not in the by-group the misclassification error was 28.2%. The cross-validated error is averaged over the by-groups is 21.03%. To obtain the partition for which 21.03% is supposed to be a measure of misclassification error, another search must be done with Y:By-group/cross-validate unchecked.

Formally, a measure of error, say Ev¢, is computed for the portion of the data that is not in by-group v for the optimal partition in by-group v. It is accumulated into a cross-validated measure of error:

 

where nv¢ is the sample size not in by-group v.

15.8 Search in tree mode

When Strategy:tree has been checked, a tree  is constructed down to only five levels, the final level comprising 32 terminal nodes. A Tree Auto Stopping rule dialog box is presented which  prunes the tree according to the rules.

 

 

This dialog sets the minumim size for a terminal node, and the minumum size of the parent of a terminal node. For example, in the dialog, if a node has fewer than 40 observations, no attempt is made to split it. (When the splitting criterion is log-rank and Y variable is survival time  these numbers refer to number of cases). Also a critical z value must be achieved for a split. Finally, if Y is nominal and the predicted class of the child nodes are the same, a parent node is not split, if Don't split if predicted same class is checked.  If the outcome is not discrete, a terminal node is highlighted and considered "high risk" if the  mean value at the node (or incidence rate) exceeds the mean value of the whole sample.  In this case if the Don't split if predicted same class is checked a node that has the same level of risk is not split.

The dialog is also used to exit the tree window with EXIT TREE.

 Here is an example of a tree:

 

The boxes at the terminal nodes are drawn in area proportional to the sample size of the node. The statistics at each terminal node give the frequency of the categories Y, if Y is discrete, otherwise the mean of Y is written. The bold (blue) frequency is the most frequent category at the node (i.e. the Bayes rule for classification). The classification table and Error are (re-substitution) estimates based on tree and the Bayes rule classification. The terminal boxes are shaded according to the value of the outcome variable. Deepest red the highest, palest yellow lowest.

Pruning and expansion of the tree can be done with the mouse. Clicking on a terminal node (specifically where the branch meets the node rectangle) expands the tree to give its two child nodes. Clicking on a non-terminal node (that is where a branch splits) prunes the branches so that the clicked node becomes a terminal node.

The Rules button invokes the Tree Auto Stopping Rule dialog again to changes the rules, as required. It also allows Exit from the tree window.  When you Exit  the tree the collective partition of the unioned highlighted nodes is evaluated and simplified. It becomes the "Current" partition for the Process menu items.

The buttons + and - reduce or expand the size of the diagram, while the "up", "down", "left", "right" buttons, move the image. "Undo" undoes these re-sizings.

The "Auto" re-grows a tree according to the criteria in the Stopping Rules dialog (if it has been manually pruned or grown).

Terminal nodes can be highlighted (green border in colour as above) by clicking inside the terminal node box.

The Mono button changes from colour to black and white.

 

 

[Back to table of contents]



Please give us your feedback or ask us a question

This message is...


My feedback or question is...


My email address is...

(Only if you need a reply)