SPAN creates attributes from instructions in the control file. Only limited facilities
exist to create attributes interactively (see
10.2 and 16.10). In the example (8.1) attribute
creating lines are shown as those between the * delimiters. The order of the attribute
creating lines is unimportant and there does not have to be an attribute line for
each listed variable. Conversely, there can be more than one attribute line per variable.
10.2
16.10
8.1
In general an attribute, or a string of attributes, associated with a variable can
be created by a single line of the control file as follows:
name [type] c1
c2...
where:
- name is the name of a listed variable, or a special
variable
type is the type of variable (nominal, interval, ordinal, binary)
- c1
c2... is a sequence of attribute criteria.
Each is either a cutpoint, or pair of cutpoints, for an interval or ordinal variable,
or a combination of categories for a nominal variable. Henceforth, ci will
be referred to by the generic term cut
of the variable, with the term cutpoint for
the specific reference to the cut associated with an interval variable.
SPAN will form as many attributes for the variable name as there are ci values.
Details of the above structure for interval variables are given below in
9.2 and,
for nominal variables, in 9.3.
It is important to appreciate that the attributes that are specified in
the control file become the default positive attributes (see
12.2). A quick way
to create attributes initially is with the
_all_ special attribute (see 9.9.4)
9.1 Attribute representation
Attributes are represented by a string of up to 28 characters. For example, DIABP<=106.0
represents the attribute: diastolic blood pressure less than or equal to 106. Up
to 8 characters name the variable from which the attribute is derived (see 8.2), the remaining 20 give the value (or set of values) of the variable, or an attached
label, together with a connective (of 1 or two characters) which may be either:
8.2
The meaning of the last two is explained below (see 9.2.1).
9.2 Interval and ordinal variables
If a type designator is omitted the variable is assumed to be interval or ordinal
and the cuts are interpreted accordingly. Specifying the type
designator as either i (interval) or o (ordinal) is equivalent to omitting it. In terms of analysis,
SPAN actually makes no distinction between interval and ordinal specifications.
The cuts c1
c2... are each either a single number, optionally prefixed
by a <, = , or >, or a pair of numbers specifying a range of values.
In the example (8.1), the interval variable FTV is specified with the line:
FTV >0 >1 >2
which has the three cuts, >0, >1 and
>2. For each cutpoint a separate attribute
will be formed. In this case the attributes FTV>0,
FTV>1, and
FTV>2 will be created. You can omit the
> sign so that the line:
FTV 0 1 2
has precisely the same effect.
If you reverse the < sign as follows
FTV <0 <1 <2
the attributes FTV<=0,
FTV<=1, and FTV<=2
will be created. These are, of
course, the reverse of those above. The way the attributes are specified in the
control file become the default positive attributes (see 12.2). Note that
< is
always interpreted as less than or equal (<=).
12.2
An equal prefix is also allowed. For example,
FTV =1
creates the attribute FTV=1.
You can mix the prefixes:
FTV <0 =1 >2
creates the positive attributes FTV<=0,
FTV=1, and
FTV>2.
9.2.1 Range attributes
By specifying a single number as a cutpoint, as above, the above/below dichotomy
is used for the attribute. However, it is possible to also specify attributes by
a range of values of the variable, using "range cutpoints". This can be done by
representing a cutpoint by a pair of numbers, separated only by a comma (no spaces).
For example,
AGE 20,25
would create the attribute 20 < AGE £
25. In SPAN the created attribute would be represented with the
(] connective as AGE(]20,25.
The complement has the ]( connective,
AGE](20,25.
In the above, the attribute AGE(]20,25
is inclusive. If a smaller number follows
a bigger number, the range is
exclusive. For example
AGE 25,20
forms the attribute 25 < AGE £ 20, that
is, outside the interval specified above. It is represented as
AGE(]25,20 and is
formally identical to
AGE](20,25. A string of range cutpoints is allowable, possible
mixed with single cutpoints, for example,
AGE <20 20,25 25,30 >30
would create the mutually exclusive attributes
AGE<=20, AGE(]20,25 (20 < AGE
£ 25), AGE(]25,30 (25 < AGE £ 30) and
AGE>30.
9.2.2 Percentile and mean cutpoints
Mean cutpoints
You can also specify a cutpoint to be the mean of the variable by writing
mean for
the cut:
AGE mean
which will calculate the mean of the AGE as an above/below cutpoint. If a cut is
omitted from a variable, the mean is assumed to be the cutpoint, by default.
Percentile cutpoints
It is possible to determine attributes as cutpoints corresponding to percentiles
by specifying the cutpoint as Px, where x is a percentage. In
the example (8.1)
the line
AGE P20 P40 P60 P80
will create attributes corresponding to the 20%, 40%, 60%, and 80% percentiles of
AGE. The computer generated attributes are represented by, for example,
AGE>P20.
As data are finite and may have tied values, it is generally not possible to fix
a data value that has the exact specified percentage. The rule used to establish
percentiles is as follows: Suppose the distinct
values of the named variable in
ascending rank order are x(1), x(2),
... ,x(m)
with corresponding cumulative frequencies r(1),r(2),
... , rm. If the 100a% percentile is requested the value assigned is the
value of x(i)
such that ri/n
£ a
and ri+1/n >
a. The actual percentile corresponding to this value is 100ri/n and may therefore
not necessarily be exactly as requested; you can check the discrepancy by browsing
at the statistics section of the output where Actual
and Nominal percentages and
corresponding percentiles are given.
Note that the attribute remains specified in terms of its percentile format. For
example if P50 is specified for AGE, and the median value is actually
53 years,
by the above rule, the attribute is retained as AGE>P50 in the output log.
You can prefix the P with a
< or >, as for non- percentile cutpoints. If omitted,
the default
> is assumed so that
AGE >P20 >P40 >P60 >P80
is equivalent to the line above.
Range percentiles are allowable. For example,
AGE P5,P95
would create an attribute "between the 5th and 95th percentile". Note that you cannot
mix percentile cutpoints with numeric ones on the same control file line. For example,
AGE i 19 P60 is not allowed.
9.2.3 Multiple cuts
Instead of writing out a string of cuts such as
AGE 20 25 30 35 40 45 50
you can use the shorthand
AGE (20-50)5
where 20-50 is the range of values
and 5 is the
increment. In general the syntax is
name (c1-ck)z
which will generate cuts c1,
c1+z,c1+2z,...,c
for c
£ ck.
There is a maximum of 50 allowed cuts.
The syntax can be used for percentile cuts. For example,
AGE P(10-90)10
is equivalent to
AGE P10 P20 P30 P40 P50 P60 P70
P80 P90
A directional indicator is also allowed which assigns the same direction
to all cutpoints:
AGE <p (10-90) 10
This is equivalent to
AGE <P10 <P20 <P30 <P40 <P50 <P60 <P70 <P80 <P90
9.3 Nominal or binary variables
To create attributes for a nominal variable, a type designator must be included.
It can be the single character b (for binary), or a string of characters beginning
with n, for nominal, followed by a string of numerals that represent all the possible
categories (the universe) of the nominal variable, as described below (9.3.1).
The
universe is required for Boolean manipulations of attributes. The cuts c1 , c2 etc
are each a string of category combinations of the variable, as described below (9.3.1).
9.3.1 Category representation
If a variable is nominal SPAN expects to know the possible values it takes - its
universe - in order that Boolean expressions can be properly evaluated. These are attached
to the nominal type designator
n. In the example (8.1) n123 for the variable RACE means RACE
can take 3 possible values 1 2 3.
Categories are assumed to be represented by a
single digit; which effectively limits the number of categories to 10, i.e 0 to
9.
8.1
The ci value is a cut of the variable representing a combination of categories.
It is a concatenated string of individual categories. For example, setting c1 equal
to 124 would form an attribute indicating: either category 1, category 2 or category
4. The category values are also concatenated in the computer representation of the
attribute.
In the example (8.1)
the first attribute creating line for RACE
RACE n123 1 12
creates the two attributes
RACE=1 and RACE=12, the latter denoting either of the categories
1 and 2. As the possible categories of RACE are 1 2 3 (signified by the universe
designator is n123) the complement
attributes are determined to be RACE=23
and RACE=3
respectively.
If you omit the universe categories of a variable, for example, by just putting
n rather than
n123 SPAN assumes the universe to be the set of allowed digits for
nominal variables, that is, 0,1, to 9. So specifying just
n is interpreted as n0123456789.
If a b type designator is specified, the universe is assumed to be 0 and 1. That
is, a binary b type designator is equivalent to n01. The
b designator should not
be used for two-category variables coded other than 0 and 1; use the n descriptor
in this case.
You can use the universe descriptor to automatically exclude observations. For example,
if RACE takes values 1,2 and 3 in the data, and you use
n12, all values of RACE
equal to 3 are ignored - they are missing attributes (see 17.7).
Character data can be accomadated in the same way. For example for a gender variable
with possible values F and M a male attribute could be created:
SEX nMF M=male
9.4 Labels
Labels can be assigned to cuts by adding an equal symbol and a short character string
label. For example, the line
AGE >60=old
attaches the label old to over 60s. On output the attribute would be read as AGE_old
rather than AGE>60. The complement
would read as AGE^old. There is no facility
to attach any other label to the complement.
However, if a label is either no or
yes, implicit opposite
yes and no labels, respectively,
are attached to the complement. So, in the example (8.1)
an attribute SMOKE=yes is created
and the complement is the implicit SMOKE=no.
Other built in opposites
are low/high
and male/female.
The label must be a single word with no spaces. If it is blank, that is, nothing
follows the equal sign, the label is ignored. This is sometimes useful if
the name given to the variable is actually the name you want for the attribute
it represents. For example, if you have a binary variable
Smokers then putting
Smokers b 1=
creates an attribute called Smokers.
Note that an attribute specification will usually have a different interpretation
depending on the type designation given to the variable. For example suppose a variable
X takes values just 0 and 1 in the data. Then
X
1=yes and
X b 1=yes
do not have the same effect. In the former X
is assumed to be interval and the attribute is defined as
X>1 for which there are no observations.
9.5 Direction of association
SPAN assumes all attributes defined in the control file are positive with respect
to whatever outcome (Y) variable is to be specified. The Rank:Positive
(see 12.2)
item can be used to swap attributes from positive to negative and vice-versa.
When doing a search the Strategy:Decide positivity box can be used to
allow SPAN to decide the direction of association.
9.6 Multiple attributes
SPAN allows more than one attribute creating line per variable. As in the example
(8.1) where two attribute creating lines are specified for the variable RACE, so
generating the attributes
RACE=1 and RACE=12 from
the first line and RACE=13, a
combination of categories 1 and 3, from the second. These three attributes could
also have been created by the single line
8.1
RACE n 1 12 13
but, by creating the attribute RACE=13
on a separate line, SPAN recognises the
attribute as distinct in the sense that
RACE=13 and one of
RACE=1 and RACE=12 can
be distinct elements when conducting a search.
9.7 Using SPAN to edit or create a control file
Using Edit <file>.spn or Edit file from the
Edit menu
invokes the Microsoft text editor Notepad and any file, including a .SPN file can
be edited or created with it. SPAN does not have its own internal editing facility
and neither is there a "Wizard" control file maker (at present).
9.8 Missing value attributes
It is sometimes useful to investigate whether "missingness" of a variable is related
to an outcome. This can be done by including "missing" as a value in creating an
attribute.
9.8.1 Nominal variables
If a variable is nominal, then including the missing value indicator "-" in a category
string uses the missing value as an attribute value. For example, suppose a variable
x can take values, 0, 1 and any negative value denotes a missing value. Then using
the control file line
x n01- -
would create a missing value attribute x = -,
with complement, non-missing, x = 01.
Or, you could ensure that 0 and missing combined by using
x n01- 0-
creating the attribute x = 0-
and the complement x = 1. In both these cases there
are no missing attribute values, even though the variable values are missing, as
missing is a characteristic of the attribute itself.
In contrast, if you do not specify a missing value either in the type designator
or category combination, missing attributes will be created if a variable is missing.
For example, with
x n01 1
a missing value of x will assign a missing value to the attribute. But, if you use,
instead,
x n01- 1 a
ll missing values of x are allocated to the complement x = 0- as - is included in
the universe.
You can attach a label to missing nominal values, by using, for example,
x n01- -=missed
9.8.2 Interval variables
You can include the missing value as one of the cutpoints of the interval variable.
But in order to determine that the missing value is to be utilised as an actual
value you must include an interval type designator and postfix it with a
-, that is, use i- (which can be thought of as a means to include missing in the universe
of
x, which is otherwise implicitly a positive number). For example,
x i- -1 2
would create attribute x <= -1
(indicating missing), with complement x >
-1,
that is any non-missing, as well as the attribute
x <= 2 with complement x >2.
Here x <= 2 would include the missing values. Without the - postfix to i
the missing values would be ignored so that x
<= -1 would have no observations and missing values would not be included
in x <= 2.
9.9 Special attributes
9.9.1 The null
attribute
SPAN creates, automatically, a null
attribute. This is an attribute which is not possessed by any of the observations.
It is represented by .=null, with
complement
.^=null which is possessed
by all
the observations. The purpose of this attribute
is to conveniently represent null Boolean expressions (see
A5.5).
9.9.2 Index attribute
You can optionally create attributes determined in terms of the record number of
the data. For example, if you want an attribute to indicate the first 100 records
of data and another to index records 50 to 150. You can do this with the line
_i_ <100 49,150
A special variable _i_ is created running from 1 to n indexing the record number
and attributes
_i_<=100 and _i_(]49,150.
9.9.3 Random attribute
You can optionally create attributes determined in terms of a random permutation
of the record numbers of the data by using the special variable
_r_.
For example,
_r_ 100
would create a special variable _r_
which runs from 1 to n and is a random permutation {r1, r2,...,rn} of record number indices {1,2, ... ,n}. The above line would
effectively create an attribute which represents a random sub-sample of 100 records.
The permutation is non-repeatable, that is, the seed for the random number generator
is set by the computer clock.
This facility is quite powerful as it can be used to randomly divide data, as an
alternative to 7.3 Edit:Split sample. For example, in a data set of 300 observations
you could create three random subsamples each of size 100 with
7.3 Edit:Split sample
_r_ <100 100,200 200
Or you could create multiple random overlapping sub-samples:
_r_ 10,90 20,100 30,110
The first would be a split of the data into those with indices r11,..., r90,
the
second r21,...,
r100 and the third r31,...,r110.
Alternatively with
_r_ 20,0 40,20 60,40 80,60 100,80
you would also form five overlapping sub-samples. The first is all observations
that exclude indices r1,...,r20, the next all observations excluding r21,...,
r40
and so on. This construction is useful for cross- validation and can be simplified
further (see 9.9.5).
If there were 100 observations, the five sub-samples become the five training samples
in 5-fold cross-validation. See FAQ (How can cross-validation be done?).
The shorthand cvx operator can be used to generate cross-validation group. For example
use:
_r_ cv5
to create subgroups for 5-fold cross validation with 5 randomly chosen subgroups
according to the quintiles of _r_.
i.e. the line is equivalent to
_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80
The resulting cross-validated estimate of misclassification is based on the misclassification
rates of V (possibly) different partitions. However, the estimate is, it
is assumed, a valid misclassification rate of the partition you are interested in,
that is, that constructed from the whole sample. This is partition is easily got
once the cross-validation run is done by de-selection Y:By subgroup/Cross-validate
9.9.4 _all_ attribute
You can specify the same attribute construction for all listed variables that are
input using the _all_ attribute. For example, if all the variables are binary you
can do:
_all_ b 1=yes
which will create attribute with label "yes" for all variables. If you have mixed
type variables you can create attributes according to above/below mean value of
each variable with
_all_ mean
The _all_ designator is useful to quickly create attributes.
9.9.5 The cross-validation specification
As noted the line
_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80
will create 5 overlapping attributes which can be used for cross-validation. This
rather awkward construction can be replaced with:
_r_ cv5
allowing -5-group cross validation based on percentiles of
_r_. Or if you wish to
create attributes for 10-group cross-validation you can do
_r_ cv10
The construction allows the possibilities cv2,
cv3....cv10
only. Division into subgroups
can be done on the basis of any variable. e.g.
_i_ cv5
is allowable, so is
age cv5
9.10 Continuation indicator
You can continue an attribute creating line on to the next line with a
! indicator.
For example:
AGE 20 30 40 50 !
60 70 80 90
is the same as
AGE 20 30 40 50 60 70 80 90
You can only continue on up to two lines.
[Back to table of contents]