School of Population Health


Appendix 4. Computational notes

 

A4.1 Turbo facility and timings


The program will automatically invoke a "turbo" facility to speed the computations ifOption:Turbo on is checked. This  is done by pre-processing and collapsing the data into identical groups of observations according to the number of distinct values of Y, say nY, and the m binary elements of the attribute set. The process is only invoked if nY < 10 and m + nY - 1 ≤ 15, otherwise the storage requirements become heavy to implement the collapse and the benefits are minimal.

However, the computational gains  can be enormous when the data set is large. For example, one test data set with about 50,000 observations and binary Y collapsed to just 279 distinct groups, of the 215 = 32768 possible combinations with m + nY - 1 = 15. The effective number of records was therefore 279 rather than 50,000.

Searches with large sample sample sizes (e.g 50,000) can be slow, for any Y that has more than a few distinct values. The procedure may also not be invoked when the cuts of an attribute are allowed to float (see 14.4). Each possible cut is considered to be included in defining m. So that, for instance, in creating a tree where you would normally float cuts, the turbo facility may not be invoked.

If you have a very large data set for which records are not unique, it is a good idea to pre-process the data by collapsing it into grouped form and using the _freq_ input count variable.

Another factor that affects the time for a search is the complexity of Boolean expressions that define the partition. The additional time required for complex partitions may arise in the manipulation of Boolean expressions. If an iterative search is done that does not converge and continues to generate more and more complex partitions, performance will slow. In this case try increasing the complexity penalising parameter.

A4.2 Program limits


The amount of data that the program can handle is determined by setting dynamic array dimensions at run time. The control and data files are first scanned to establish the amount of storage required. There are, however, certain internal arrays which are not dynamically dimensioned and which are not data dependent. Error messages or crashes will occur if these are exceeded. The specific items to watch for are set by Fortran PARAMETER statements at compilation and are:

  1. The MAXADD parameter. This refers to the number of new added attributes that can be created at run time. Its value is MAXADD=100. In iterative mode a new attribute is added at each iteration so that it is possible this number may be reached if many searches are done. If so all but the last 10 added attributes will be deleted before processing continues.
  2. The MAXP parameter. This the largest value that pi or q may take in the representation of a disjunctive normal form Boolean expressions. Note that this is after Boolean manipulations and substitution and expansion of added attributes. The parameter may be exceeded in manipulation of very complex Boolean expressions. Its value isMAXP=100.
  3. The NBESTX and NBESTXX parameters. These are effectively the size of the stacks for optimal and sub-optimal partitions (see 16). The first is largest complexity that can be accommodated. Its value is NBESTX=25. The second, NBESTXX, is usually 10+1, giving the "Top 10" partitions.
  4. The NYMAX parameter. This gives the largest number of Y variables that can be specified for a multivariate effectiveness measure (see A3.10), i.e. the upper limit for k in equation (4) (section A3.10). Its value is NYMAX=5.
  5. MAXBYG. Gives the maximum number of  groups allowed for the By Group/Cross validation facility. Its value is 20.

If these bounds are exceeded SPAN error messages will be returned. The program may not proceed or, if it does, may subsequently crash and other SPAN or Fortran run-time messages will be issued.

Users wishing to increase these limits can request a new version of SPAN as required.

A4.3 Error Messages


Certain traps are in place in SPAN to detect errors and messages are issued. Unforeseen errors (see also A4.4 Bugs below)  may occur. I would be grateful if users inform me of their occurrence with details of the circumstances in which they occurred.

The program is compiled to detect array bound overflow. If array bound error messages occur, please report the details to me.

If an error consistently occurs, try downloading the latest version of SPAN (see 2.2) and re-running

A4.4 Bugs


Bugs SPAN undoubtably has bugs. Please report problems to me.

Known bugs include "QuickWin Internal error-unexpected error". Accompanied by a "file "E:\forrtl\...." message. These bugs seem to be machine-dependent. They are not programming errors but stem from the QuickWin shell in which SPAN fortran is embedded. They do not appear to affect running; merely irritating interrupts. Because they happen apparently at random, I have been unable to nail them, because they do not arise on my two personal machines (running Windows XP and Vista) and seem to happen at random. There is stuff on the Internet about these problems with QuickWin.

Other bugs may include sudden appearance of a blank "Graphics" window.

A4.5 Limitations


All the design, programming and development of SPAN has been done by me, Roger Marshall. Lack of time, stamina, but mainly Windows programming know-how, stop me from making some of the improvements I would like and overcoming some of SPAN's limitations and idiosyncracies, which are:

  • Inability to read data from other systems.
  • A .SPN file must be created to input data and create attributes. There is no "wizard" control file creation facility.
  • Windows are not dynamically linked and some are only active to accept mouse input when created.
  • Windows are not kept; a new tree analysis for example overwrites a previous Tree window.
  • Windows cannot be closed once open.
  • Scrolling the output log and data sets using the View facility is  awkward.
  • There is no good spreadsheet layout for browsing (or editing) data.
  • The Help:Topics facility is fairly crude.
  • Cross validation for trees is not allowed.
  • There are no toolbars.
  • Running SPAN twice. If SPAN is already running, you cannot simultaneously fire it up again in the same directory; the second one will fail.
  • Only limited facility to save data and created attributes
  • Cannot "save session" to get back to where you were.

[Back to table of contents]