A4.1 Turbo facility and timings
The program will automatically invoke a "turbo" facility to speed the computations
if Option:Turbo
on is checked.
This is done by pre-processing and collapsing the data into identical groups
of observations according to the number of distinct values of Y, say nY, and the m
binary elements of the attribute set. The process is only invoked if nY < 10 and m + nY - 1 £
15, otherwise the storage requirements become heavy to implement the collapse and
the benefits are minimal.
However, the computational gains can be enormous when the data set is
large. For example, one test data set with about 50,000 observations and binary
Y collapsed to just 279 distinct groups, of the 215 = 32768 possible combinations with m
+ nY - 1 =
15. The effective number of records was therefore 279 rather than 50,000.
Searches with large sample sample sizes (e.g 50,000) can be slow, for any Y
that has more than a few distinct values. The procedure may also not be invoked
when the cuts of an attribute are allowed to float (see 14.4). Each possible cut
is considered to be included in defining m. So that, for instance, in creating
a tree where you would normally float cuts, the turbo facility may not be invoked.
If you have a very large data set for which records are not unique, it is a good
idea to pre-process the data by collapsing it into grouped form and using the _freq_
input count variable.
Another factor that affects the time for a search is the complexity of Boolean expressions
that define the partition. The additional time required for complex partitions may
arise in the manipulation of Boolean expressions. If an iterative search
is done that does not converge and continues to generate more and more complex partitions,
performance will slow. In this case try increasing the complexity penalising parameter.
A4.2 Program limits
The amount of data that the program can handle is determined by setting dynamic
array dimensions at run time. The control and data files are first scanned to establish
the amount of storage required. There are, however, certain internal arrays which
are not dynamically dimensioned and which are not data dependent. Error messages
or crashes will occur if these are exceeded. The specific items to watch for are
set by Fortran PARAMETER statements at compilation and are:
- The MAXADD parameter. This refers
to the number of new added attributes that can be created at run time. Its value
is MAXADD=100. In iterative mode a
new attribute is added at each iteration so that it is possible this number may
be reached if many searches are done. If so all but the last 10 added attributes will be deleted before processing
continues.
- The MAXP parameter. This the largest
value that pi
or q may take in the representation of a disjunctive normal form Boolean
expressions. Note that
this is after Boolean manipulations and substitution and expansion of added
attributes. The parameter may be exceeded in manipulation of very complex Boolean
expressions. Its value is MAXP=100.
- The NBESTX and
NBESTXX parameters. These are effectively the size of the stacks for optimal
and sub-optimal partitions (see 16). The first is largest complexity that can
be accommodated. Its value is NBESTX=25.
The second, NBESTXX, is usually 10+1,
giving the "Top 10" partitions.
- The NYMAX parameter. This gives the largest number of Y variables that
can be specified for a multivariate effectiveness measure (see
A3.10), i.e. the
upper limit for k in equation (4) (section A3.10). Its value is
NYMAX=5.
- MAXBYG. Gives the maximum number of groups allowed for the By Group/Cross
validation facility. Its value is 20.
If these bounds are exceeded SPAN error messages will be returned. The program may
not proceed or, if it does, may subsequently crash and other SPAN or Fortran run-time
messages will be issued.
Users wishing to increase these limits can request a new version of SPAN as required.
A4.3 Error Messages
Certain traps are in place in SPAN to detect errors and messages are issued. Unforeseen
errors (see also A4.4 Bugs below) may occur. I would be
grateful if users inform me of their occurrence with details of the circumstances
in which they occurred.
The program is compiled to detect array bound overflow. If array bound error messages
occur, please report the details to me.
If an error consistently occurs, try downloading the latest version of SPAN (see
2.2) and re-running.
A4.4 Bugs
Bugs SPAN undoubtably has bugs. Please report problems to me.
Known bugs include "QuickWin Internal error-unexpected error". Accompanied by a
"file "E:\forrtl\...." message. These bugs seem to be machine-dependent. They are
not programming errors but stem from the QuickWin shell in which SPAN fortran is
embedded. They do not appear to affect running; merely irritating interrupts. Because
they happen apparently at random, I have been unable to nail them, because they
do not arise on my two personal machines (running Windows XP and Vista) and seem
to happen at random. There is stuff on the Internet about these problems with QuickWin.
Other bugs may include sudden appearance of a blank "Graphics" window.
A4.5 Limitations
All the design, programming and development of SPAN has been done
by me, Roger Marshall. Lack of time, stamina, but mainly Windows programming know-how, stop
me from
making some of the improvements I would like and overcoming some of SPAN's limitations
and idiosyncracies, which
are:
- Inability to read data from other systems.
- A .SPN file must be created to input data and create attributes. There is no "wizard"
control file creation facility.
- Windows are not dynamically linked and some are only active to accept mouse input
when created.
- Windows are not kept; a new tree analysis for example overwrites a previous Tree
window.
- Windows cannot be closed once open.
- Scrolling the output log and data sets using the View facility is
awkward.
- There is no good spreadsheet layout for browsing (or editing) data.
- The Help:Topics facility is fairly crude.
- Cross validation for trees is not allowed.
- There are no toolbars.
- Running SPAN twice. If SPAN is already running, you cannot simultaneously fire it
up again in the same directory; the second one will fail.
- Only limited facility to save data and created attributes
- Cannot "save session" to get back to where you were.
[Back to table of contents]