Bookmark and Share Print this page
School of Population Health 9. Control file: creating attributes

SPAN creates attributes from instructions in the control file. Only limited facilities exist to create attributes interactively (see 10.2 and 16.10). In the example (8.1) attribute creating lines are shown as those between the * delimiters. The order of the attribute creating lines is unimportant and there does not have to be an attribute line for each listed variable. Conversely, there can be more than one attribute line per variable.

10.2

16.10

8.1

In general an attribute, or a string of attributes, associated with a variable can be created by a single line of the control file as follows:

name [type] c1 c2...

where:

  • name is the name of a listed variable, or a special variable

    type is the type of variable (nominal, interval, ordinal, binary)
  • c1 c2... is a sequence of attribute criteria.

Each is either a cutpoint, or pair of cutpoints, for an interval or ordinal variable, or a combination of categories for a nominal variable. Henceforth, ci will be referred to by the generic term cut of the variable, with the term cutpoint for the specific reference to the cut associated with an interval variable.

SPAN will form as many attributes for the variable name as there are ci values. Details of the above structure for interval variables are given below in 9.2 and, for nominal variables, in 9.3.

It is important to appreciate that the attributes that are specified in the control file become the default positive attributes (see 12.2). A quick way to create attributes initially is with the _all_ special attribute (see 9.9.4)

9.1 Attribute representation

Attributes are represented by a string of up to 28 characters. For example, DIABP<=106.0 represents the attribute: diastolic blood pressure less than or equal to 106. Up to 8 characters name the variable from which the attribute is derived (see 8.2), the remaining 20 give the value (or set of values) of the variable, or an attached label, together with a connective (of 1 or two characters) which may be either:

8.2

<= > ^= = (] ](

    

The meaning of the last two is explained below (see 9.2.1).

9.2 Interval and ordinal variables

If a type designator is omitted the variable is assumed to be interval or ordinal and the cuts are interpreted accordingly. Specifying the type designator as either i (interval) or o (ordinal) is equivalent to omitting it. In terms of analysis, SPAN actually makes no distinction between interval and ordinal specifications.

The cuts c1 c2... are each either a single number, optionally prefixed by a <, = , or >, or a pair of numbers specifying a range of values.

In the example (8.1), the interval variable FTV is specified with the line:

FTV >0 >1 >2

which has the three cuts, >0, >1 and >2. For each cutpoint a separate attribute will be formed. In this case the attributes FTV>0, FTV>1, and FTV>2 will be created. You can omit the > sign so that the line:

FTV 0 1 2

has precisely the same effect.

If you reverse the < sign as follows

FTV <0 <1 <2

the attributes FTV<=0, FTV<=1, and FTV<=2 will be created. These are, of course, the reverse of those above. The way the attributes are specified in the control file become the default positive attributes (see 12.2). Note that < is always interpreted as less than or equal (<=).

12.2

An equal prefix is also allowed. For example,

FTV =1

creates the attribute FTV=1.

You can mix the prefixes:

FTV <0 =1 >2

creates the positive attributes FTV<=0, FTV=1, and FTV>2.

9.2.1 Range attributes

By specifying a single number as a cutpoint, as above, the above/below dichotomy is used for the attribute. However, it is possible to also specify attributes by a range of values of the variable, using "range cutpoints". This can be done by representing a cutpoint by a pair of numbers, separated only by a comma (no spaces). For example,

AGE 20,25

would create the attribute 20 < AGE £ 25. In SPAN the created attribute would be represented with the (] connective as AGE(]20,25. The complement has the ]( connective, AGE](20,25.

In the above, the attribute AGE(]20,25 is inclusive. If a smaller number follows a bigger number, the range is exclusive. For example

AGE 25,20

forms the attribute 25 < AGE £ 20, that is, outside the interval specified above. It is represented as AGE(]25,20 and is formally identical to AGE](20,25. A string of range cutpoints is allowable, possible mixed with single cutpoints, for example,

AGE <20 20,25 25,30 >30

would create the mutually exclusive attributes AGE<=20, AGE(]20,25 (20 < AGE £ 25), AGE(]25,30 (25 < AGE £ 30) and AGE>30.

9.2.2 Percentile and mean cutpoints

Mean cutpoints

You can also specify a cutpoint to be the mean of the variable by writing mean for the cut:

AGE mean

which will calculate the mean of the AGE as an above/below cutpoint. If a cut is omitted from a variable, the mean is assumed to be the cutpoint, by default.

Percentile cutpoints

It is possible to determine attributes as cutpoints corresponding to percentiles by specifying the cutpoint as Px, where x is a percentage. In the example (8.1) the line

AGE P20 P40 P60 P80

will create attributes corresponding to the 20%, 40%, 60%, and 80% percentiles of AGE. The computer generated attributes are represented by, for example, AGE>P20. As data are finite and may have tied values, it is generally not possible to fix a data value that has the exact specified percentage. The rule used to establish percentiles is as follows: Suppose the distinct values of the named variable in ascending rank order are x(1), x(2), ... ,x(m) with corresponding cumulative frequencies r(1),r(2), ... , rm. If the 100a% percentile is requested the value assigned is the value of x(i) such that ri/n £ a and ri+1/n > a. The actual percentile corresponding to this value is 100ri/n and may therefore not necessarily be exactly as requested; you can check the discrepancy by browsing at the statistics section of the output where Actual and Nominal percentages and corresponding percentiles are given.

Note that the attribute remains specified in terms of its percentile format. For example if P50 is specified for AGE, and the median value is actually 53 years, by the above rule, the attribute is retained as AGE>P50 in the output log.

You can prefix the P with a < or >, as for non- percentile cutpoints. If omitted, the default > is assumed so that

AGE >P20 >P40 >P60 >P80

is equivalent to the line above.

Range percentiles are allowable. For example,

AGE P5,P95

would create an attribute "between the 5th and 95th percentile". Note that you cannot mix percentile cutpoints with numeric ones on the same control file line. For example, AGE i 19 P60 is not allowed.

9.2.3 Multiple cuts

Instead of writing out a string of cuts such as

AGE 20 25 30 35 40 45 50

you can use the shorthand

AGE (20-50)5

where 20-50 is the range of values and 5 is the increment. In general the syntax is

name (c1-ck)z

which will generate cuts c1, c1+z,c1+2z,...,c for c £ ck. There is a maximum of 50 allowed cuts.

The syntax can be used for percentile cuts. For example,

AGE P(10-90)10

is equivalent to

AGE P10 P20 P30 P40 P50 P60 P70 P80 P90

A directional indicator is also allowed which assigns the same direction to all cutpoints:

AGE <p (10-90) 10

This is equivalent to

AGE <P10 <P20 <P30 <P40 <P50 <P60 <P70 <P80 <P90

9.3 Nominal or binary variables

To create attributes for a nominal variable, a type designator must be included. It can be the single character b (for binary), or a string of characters beginning with n, for nominal, followed by a string of numerals that represent all the possible categories (the universe) of the nominal variable, as described below (9.3.1). The universe is required for Boolean manipulations of attributes. The cuts c1 , c2 etc are each a string of category combinations of the variable, as described below (9.3.1).

9.3.1 Category representation

If a variable is nominal SPAN expects to know the possible values it takes - its universe - in order that Boolean expressions can be properly evaluated. These are attached to the nominal type designator n. In the example (8.1) n123 for the variable RACE means RACE can take 3 possible values 1 2 3. Categories are assumed to be represented by a single digit; which effectively limits the number of categories to 10, i.e 0 to 9.

8.1

The ci value is a cut of the variable representing a combination of categories. It is a concatenated string of individual categories. For example, setting c1 equal to 124 would form an attribute indicating: either category 1, category 2 or category 4. The category values are also concatenated in the computer representation of the attribute.

In the example (8.1) the first attribute creating line for RACE

RACE n123 1 12

creates the two attributes RACE=1 and RACE=12, the latter denoting either of the categories 1 and 2. As the possible categories of RACE are 1 2 3 (signified by the universe designator is n123) the complement attributes are determined to be RACE=23 and RACE=3 respectively.

If you omit the universe categories of a variable, for example, by just putting n rather than n123 SPAN assumes the universe to be the set of allowed digits for nominal variables, that is, 0,1, to 9. So specifying just n is interpreted as n0123456789.

If a b type designator is specified, the universe is assumed to be 0 and 1. That is, a binary b type designator is equivalent to n01. The b designator should not be used for two-category variables coded other than 0 and 1; use the n descriptor in this case.

You can use the universe descriptor to automatically exclude observations. For example, if RACE takes values 1,2 and 3 in the data, and you use n12, all values of RACE equal to 3 are ignored - they are missing attributes (see 17.7).

Character data can be accomadated in the same way. For example for a gender variable with possible values F and M a male attribute could be created:

SEX nMF M=male

9.4 Labels

Labels can be assigned to cuts by adding an equal symbol and a short character string label. For example, the line

AGE >60=old

attaches the label old to over 60s. On output the attribute would be read as AGE_old rather than AGE>60. The complement would read as AGE^old. There is no facility to attach any other label to the complement.

However, if a label is either no or yes, implicit opposite yes and no labels, respectively, are attached to the complement. So, in the example (8.1) an attribute SMOKE=yes is created and the complement is the implicit SMOKE=no. Other built in opposites are low/high and male/female.

The label must be a single word with no spaces.  If it is blank, that is, nothing follows the equal sign, the label is ignored. This is sometimes useful if  the name given to the variable is actually the name you want for  the attribute it represents. For example, if you have a binary variable  Smokers then putting

                 Smokers b 1= 

creates an attribute called Smokers.

Note that an attribute specification will usually have a different interpretation depending on the type designation given to the variable. For example suppose a variable X takes values just 0 and 1 in the data. Then

X  1=yes      and      X  b 1=yes

do not have the same effect. In the former X is assumed to be interval and the attribute is defined as X>1 for which there are no observations.

9.5 Direction of association

SPAN assumes all attributes defined in the control file are positive with respect to whatever outcome (Y) variable is to be specified. The Rank:Positive (see 12.2) item can be used to swap attributes from positive to negative and vice-versa.

When doing a search the Strategy:Decide positivity box can be used to  allow SPAN to decide the direction of association.

9.6 Multiple attributes

SPAN allows more than one attribute creating line per variable. As in the example (8.1) where two attribute creating lines are specified for the variable RACE, so generating the attributes RACE=1 and RACE=12 from the first line and RACE=13, a combination of categories 1 and 3, from the second. These three attributes could also have been created by the single line

8.1

RACE n 1 12 13

but, by creating the attribute RACE=13 on a separate line, SPAN recognises the attribute as distinct in the sense that RACE=13 and one of RACE=1 and RACE=12 can be distinct elements when conducting a search.

9.7 Using SPAN to edit or create a control file

Using Edit <file>.spn or Edit file from the Edit menu invokes the Microsoft text editor Notepad and any file, including a .SPN file can be edited or created with it. SPAN does not have its own internal editing facility and neither is there a "Wizard" control file maker (at present).

9.8 Missing value attributes

It is sometimes useful to investigate whether "missingness" of a variable is related to an outcome. This can be done by including "missing" as a value in creating an attribute.

9.8.1 Nominal variables

If a variable is nominal, then including the missing value indicator "-" in a category string uses the missing value as an attribute value. For example, suppose a variable x can take values, 0, 1 and any negative value denotes a missing value. Then using the control file line

x n01- -

would create a missing value attribute x = -, with complement, non-missing, x = 01. Or, you could ensure that 0 and missing combined by using

x n01- 0-

creating the attribute x = 0- and the complement x = 1. In both these cases there are no missing attribute values, even though the variable values are missing, as missing is a characteristic of the attribute itself.

In contrast, if you do not specify a missing value either in the type designator or category combination, missing attributes will be created if a variable is missing. For example, with

x n01 1

a missing value of x will assign a missing value to the attribute. But, if you use, instead,

x n01- 1 a

ll missing values of x are allocated to the complement x = 0- as - is included in the universe.

You can attach a label to missing nominal values, by using, for example,

x n01- -=missed

9.8.2 Interval variables

You can include the missing value as one of the cutpoints of the interval variable. But in order to determine that the missing value is to be utilised as an actual value you must include an interval type designator and postfix it with a -, that is, use i- (which can be thought of as a means to include missing in the universe of x, which is otherwise implicitly a positive number). For example,

x i- -1 2

would create attribute x <= -1 (indicating missing), with complement x > -1, that is any non-missing, as well as the attribute x <= 2 with complement x >2. Here x <= 2 would include the missing values. Without the - postfix to i the missing values would be ignored so that x <= -1 would have no observations and missing values would not be included in x <= 2.

9.9 Special attributes

9.9.1 The null attribute

SPAN creates, automatically, a null attribute. This is an attribute which is not possessed by any of the observations. It is represented by .=null, with complement .^=null which is possessed by all the observations. The purpose of this attribute is to conveniently represent null Boolean expressions (see A5.5).

9.9.2 Index attribute

You can optionally create attributes determined in terms of the record number of the data. For example, if you want an attribute to indicate the first 100 records of data and another to index records 50 to 150. You can do this with the line

_i_ <100 49,150

A special variable _i_ is created running from 1 to n indexing the record number and attributes _i_<=100 and _i_(]49,150.

9.9.3 Random attribute

You can optionally create attributes determined in terms of a random permutation of the record numbers of the data by using the special variable _r_.

For example,

_r_ 100

would create a special variable _r_ which runs from 1 to n and is a random permutation {r1, r2,...,rn} of record number indices {1,2, ... ,n}. The above line would effectively create an attribute which represents a random sub-sample of 100 records. The permutation is non-repeatable, that is, the seed for the random number generator is set by the computer clock.

This facility is quite powerful as it can be used to randomly divide data, as an alternative to 7.3 Edit:Split sample. For example, in a data set of 300 observations you could create three random subsamples each of size 100 with

7.3 Edit:Split sample

_r_ <100 100,200 200

Or you could create multiple random overlapping sub-samples:

_r_ 10,90 20,100 30,110

The first would be a split of the data into those with indices r11,..., r90, the second r21,..., r100 and the third r31,...,r110.

Alternatively with

_r_ 20,0 40,20 60,40 80,60 100,80

you would also form five overlapping sub-samples. The first is all observations that exclude indices r1,...,r20, the next all observations excluding r21,..., r40 and so on. This construction is useful for cross- validation and can be simplified further (see 9.9.5).

If there were 100 observations, the five sub-samples become the five training samples in 5-fold cross-validation. See FAQ (How can cross-validation be done?).

The shorthand cvx operator can be used to generate cross-validation group. For example use:

_r_ cv5

to create subgroups for 5-fold cross validation with 5 randomly chosen subgroups according to the quintiles of _r_. i.e. the line is equivalent to

_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80

The resulting cross-validated estimate of misclassification is based on the misclassification rates of V (possibly) different partitions. However, the estimate is, it is assumed, a valid misclassification rate of the partition you are interested in, that is, that constructed from the whole sample. This is partition is easily got once the cross-validation run is done by de-selection Y:By subgroup/Cross-validate

9.9.4 _all_ attribute

You can specify the same attribute construction for all listed variables that are input using the _all_ attribute. For example, if all the variables are binary you can do:

_all_ b 1=yes

which will create attribute with label "yes" for all variables. If you have mixed type variables you can create attributes according to above/below mean value of each variable with

_all_ mean

The _all_ designator is useful to quickly create attributes.

9.9.5 The cross-validation specification

As noted the line

_r_ p20,p0 p40,p20 p60,p40 p80,p60 p100,p80

will create 5 overlapping attributes which can be used for cross-validation. This rather awkward construction can be replaced with:

_r_ cv5

allowing -5-group cross validation based on percentiles of _r_. Or if you wish to create attributes for 10-group cross-validation you can do

_r_ cv10

The construction allows the possibilities cv2, cv3....cv10 only. Division into subgroups can be done on the basis of any variable. e.g.

_i_ cv5

is allowable, so is

age cv5

9.10 Continuation indicator

You can continue an attribute creating line on to the next line with a ! indicator. For example:

AGE 20 30 40 50 !

60 70 80 90

is the same as

AGE 20 30 40 50 60 70 80 90

You can only continue on up to two lines.

[Back to table of contents]



Please give us your feedback or ask us a question

This message is...


My feedback or question is...


My email address is...

(Only if you need a reply)