Bookmark and Share Print this page
School of Population Health 10. Y Value

 

The Y menu allows you to select the dependent variable(s) for the analysis. It also allows scatter and distribution plots, dichotomising, transforming variables and allowing sub-group analyses, which enable cross-validation to be done.

10.1 OK var

When you have defined the dependent variable(s), using Select Y (below), this item fixes the selection. If you have selected more than one dependent variable, the menu item changes to OK Multiple Y: vars for a list of variables.

10.2 Distribution

This item will create a graphic of the (univariate) distribution of Y. There are five options: Empirical pdf (probability density function), Bar chart frequency, Bar chart relative frequency, Survival and Empirical cdf (cumulative density function) plot. These options are activated by buttons in the Distribution window:  

 

Empirical  pdf

The empirical probability density plot is a kernel density estimate, using a normal bandwidth smoother. It is only appropriate for interval data and will be disallowed if there are fewer than 10 distinct values of the variable. The effect of censoring, if data are censored, is ignored. The width of the normal smoother can be adjusted by clicking on the Band+ and Band- buttons that become active in the window. Here is an example:

 

Bar Chart

Either Bar_n or Bar_%  options produce a bar chart for number or percentage with a bar for each unique value,  no grouping is done initially. Once this image image appears the top button change to Bin+ and Bin- which allow grouping in bins. The bin widths are automatically determined ; Bin+ increases and Bin- decreases the number of bins.  

 

Survival, empirical cdf

Survival and empitical cdf  plots are estimated non- parametrically. If the variable is censored (which is only allowed when the log-rank effectiveness criterion has been selected, see A3.7) the Kaplan-Meier product-limit method is used. To obtain a Kaplan-Meier plot you need to have already selected the log-rank in the Criteria dialog box which allows you to set the censoring attribute.  Survival is simply 1-empirical cdf

Note that if Y is multivariate, the distribution is done only for the first variable in the list of Y.

The number and % missing data are shown on distributions.

10.3 Scatterplot/crosstab

This items opens a dialog is used to produce a scatter diagram. You will be shown a dialog to choose X and Y variables to be -plotted. An example scatter diagram is as follows:

 

The scatter window become active for mouse input and a series of buttons shown  These act as follows:

Missd -  missing values (if there are any) are shown on the plot to the left of Y or below X axis.

X->Y  interchanges X and Y  axes

+/- Pts - increases or decreases size of plotting positions

Z -  allows a third variable or attribute to be displayed, as described more fully below.

Dense - allows the intensity or density of data to be visualised. Plotting positions that overlay existing plotting positions gradually increase in tonal intensity. Here is an example:

 

 

JittX, JittY - jitters the points in either the X or Y direction.

Box - if one of the X or Y variables is detected as being discrete, the other continuous, a box plot is produced, like this for the above data:

 

 

 

If data are grouped, the size of the plotting position size is  scaled according to the size of groups, i.e. such that area of square is proportional to FREQ (see 8.3). Also, if both variables are seen as discrete, rectangles are drawn scaled according to the number of observations. For example:

 

In this case a 2-way cross-tabulation is also written to the output log with associated chi-square statistic and P-value (not continuity corrected).

10.3.1 Z: Adding a third variable. Rotating 3D plots

A third variable can be included in the scatter function in two ways: first by distinguishing plotting points according to presence or not of an attribute. The attribute values are distinguished by colour: red or black.  Second the third variable can be included as a third axis in a 3D rotating plot.  If you click the  Z button on a 2D scatter diagram a "Third variable Z" dialog appears. For a rotating plot choose a "Z variable for rotating plot" from those listed. Buttons on the scatter window will then change to allow animation of the 3D plot. Here is an example, with density shading:

 


 If all three X,Y,Z variables are detected to be discrete, a cube proportional in volume to number of observations is produced:

 



10.4 Transform

Transform allows you to create new variables by some simple operations:

 

You need to highlight Y1 and Y2 and how they are to be combined. Also the name of the new variable must be entered. For log and square root transformations, you only need highlight Y1. If either Y1 or Y2 is missing (negative), the result is assumed to be missing too.

It is possible that the difference Y1-Y2 and the log operations will produce negative values, which SPAN would normally interpret as missing. To avoid this a constant will be added. A message informing of this, and the size of the constant, will be issued.

If the denominator in the division operation Y1/Y2 is zero, the result is assumed to be missing. Similarly for log(Y1) when Y1 is zero.

Y1 o Y2 is a transformation of binary Y1 and Y2 that assigns values (a,b,c) as follows (0,0)=a, (1,0)=b (1,1)=c where a, b , c values are user entered in right hand panel.

perm Y1 creates a random re-ordering (permutation) of the values of Y1. The new variable is statistically independent of all other variables.  

For each transformation  new variable is created. There is no mechanism to save the derived variables.

10.4.1 Creating attributes from binary transformations


The transformation Y1=a,b..  forms a binary variable from Y1 that has value 1 if  Y1 takes any of the values a,b...Also  a<Y1<=b forms a binary variable which has value 1 on the range a to b.  For both these transformations a corresponding attribute can, optionally, also be formed by response to a question dialog which automatically appears. Attributes formed this way are "primitive" in sense that the variable from which they are formed is not remembered when it comes to manipulating Boolean expressions.

10.5 Select Y

This option lets you pick new Y variable(s). A list of variables will appear and variables can be picked from it.

You can select more than one variable, so allowing a multivariate outcome measure (see 11.5). In this case the correlation matrix of the selected variables is written to the output log.  This feature is still experimental; in most cases it should not be used.

10.6 By-group/cross-validate

Separate analyses in different groups can be achieved with this option and it effectively allows cross-validation A By analysis dialog box will appear:

 

By-groups are defined in two possible ways: by value and by attribute, as indicated by radio buttons.

By value

In this option by-groups are defined in terms of distinct values of a selected variable of the data set. There will be as many groups as there are distinct values of the variable. A variable needs to be picked from the displayed menu. If there are more than 20 distinct values no by group analysis is allowed. With this option by-groups are mutually exclusive.

By attribute

In this option by-groups are defined according attributes formed from a selected variable. For example, suppose a control file line for a variable X is X i >20 >40. SPAN creates attributes X >20 and X>40 and a by-group is identified for each of these two attribute. As in this example, the by-groups formed from attributes are not necessarily mutually exclusive.

When you choose by attribute and there is more than one line in the control file defining attributes for the selected variable a Attribute groups box appears. This gives the lines of the control file that specify attributes for the variable. You need to choose a line to determine the by-groups. 

Cross-validation

The by-group facility allows cross-validation to be accomplished. When a Search: Go is done with by-group selected, a separate search is done for each by-group. The optimal partition found on each search is then immediately applied to the data not in search by-group, to establish misclassification error on the data not used to generate the partition. These misclassification rates are then averaged.

[Back to table of contents]



Please give us your feedback or ask us a question

This message is...


My feedback or question is...


My email address is...

(Only if you need a reply)