Chapter 4 Explanatory Variables and Regression

This chapter deals with the various forms the explanatory variables can take and their possible interplay. In principle, this applies to all forms of regression analysis, but here we concentrate on survival models.

Explanatory variables, or covariates, may be of essentially two different types, continuous and discrete. The discrete type usually takes only a finite number of distinct values and is called a factor in R. A special case of a factor is one that takes only two distinct values, say 0 and 1. Such a factor is called an indicator, because we can let the value 1 indicate the presence of a certain property and 0 denote its absence. To summarize, there is

Covariate: taking values in an interval (age, blood pressure, weight).
Factor: taking a finite number of values (civil status, occupation, cohort).
Indicator: a factor taking two values (sex, born in Umeå).

Usually, we do not distinguish between factors and indicators, they are all factors.

4.1 Continuous Covariates

We use the qualifier continuous to stress that factors are excluded, because often the term covariate is used as a synonym for explanatory variable.

Values taken by a continuous covariate are ordered. The effect on the response is by model definition ordered in the same or reverse order. On the other hand, values taken by a factor are unordered (but may be defined as ordered in R).

4.2 Factor Covariates

An explanatory variable that can take only a finite (usually small) number of distinct values is called a categorical variable. In R language, it is called a factor. Examples of such variables are gender, socio-economic status, and birth place. Students of statistics have long been taught to create dummy variables in such situations, in the following way:

Given a categorical variable \(F\) with \((k+1)\) levels \((f_0, f_1, f_2, \ldots f_k)\) (\(k+1\) levels),
Create \(k\) indicator (``dummy’’) variables \((I_1, I_2, \ldots I_k)\).

The level \(f_0\) is the reference category, characterized by that all indicator variables are zero for an individual with this value. Generally, for the level, \(f_i, \; i = 1,\ldots, k\), the indicator variable \(I_i\) is one, the rest are zero. In other words, for a single individual, at most one indicator is one, and the rest are zero.

In R, there is no need to explicitly create dummy variables, it is done behind the scenes by the functions factor and as.factor.

Note that a factor with two levels, i.e., an indicator variable, can always be treated as a continuous covariate, if coded numerically (e.g., 0 and 1).

Example 4.1 (Infant mortality and age of mother)

Consider a demographic example, the influence of mother’s age (a continuous covariate) on infant mortality. It is considered well-known that a young mother means high risk for the infant, and also that old mother means high risk, compared to “in-between-aged” mothers. So the risk order is not the same (or reverse) as the age order.

One solution (not necessarily the best) to this problem is to factorize: Let, for instance,

\[\begin{equation*} \mbox{mother's age} = \left\{\begin{array}{ll} \mbox{low}, & 15 < \mbox{age} \le 25 \\ \mbox{middle}, & 25 < \mbox{age} \le 35 \\ \mbox{high}, & 35 < \mbox{age} \le 50 \end{array} \right. \end{equation*}\]

In this layout, there will be two parameters measuring the deviation from the reference category, which will be the first category by default.

In R, this is easily achieved with the aid of the cut function. It works like this, illustrating it with the data set child:

age.group <- cut(child$m.age, c(15, 25, 35, 51))
table(age.group, useNA = "ifany")

age.group
(15,25] (25,35] (35,51] 
   3917   13805    8852

Note that the created intervals by default are closed to the right and open to the left. This has consequences for how observations exactly on a boundary are treated; they belong to the lower-valued interval. The argument right in the call to cut can be used to switch this behavior the other way around.

Note further that values falling below the smallest value (15 in our example) or above the largest value (51) are reported as missing values (NA in R terminology, Not Available). For further information about the use of the cut function, see its help page.

An infant mortality analysis with the created factor covariate may look like this:

ch <- child
ch$age.group <- age.group
ch <- age.window(ch, c(0, 1)) # Right censor at age one.
fit <- coxreg(Surv(enter, exit, event) ~ age.group, data = ch)
print(summary(fit), short = TRUE)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
age.group                                                    0.008 
         (15,25]      0.147     0         1 (reference)
         (25,35]      0.522    -0.056     0.945     0.064
         (35,51]      0.332     0.090     1.094     0.067

Obviously, it is safest for the infants whose mothers’ age lies in the span 25–35 years of age.\(\ \Box\)

4.3 Interactions

The meaning of interaction between two explanatory variables is described by an example, walking through the three possible combinations of covariate types.

Example 4.2 (Old age mortality)

We return to the oldmort data set, slightly modified, see Table ??, where the first five of a total to 6495 rows are shown.

The sampling frame is a rectangle in the Lexis diagram, see Figure 4.1. It can be described as all persons born between January 1, 1775 and January 1, 1821, and present in the solid rectangle at some moment.

FIGURE 4.1: Old age mortality sampling frame, Lexis diagram.

There are four possible explanatory variables behind survival after age 60 in the example.

\[\begin{eqnarray*} \mbox{farmer} & = & \left\{ \begin{array}{ll} \mbox{no} & \\ \mbox{yes} & \left(e^{\alpha}\right) \end{array} \right. \\ %\end{equation*} %\begin{equation*} \mbox{sex} & = & \left\{ \begin{array}{ll} \mbox{male} & \\ \mbox{female} & \left(e^{\beta}\right) \end{array} \right. \\ \mbox{birthdate } (x_1) & & \mspace{90mu} \left(e^{\gamma_1 x_1}\right) \\ \mbox{IMR at birth } (x_2) & & \mspace{90mu} \left(e^{\gamma_2 x_2}\right) \\ \end{eqnarray*}\] The two first are factors, and the last two are continuous covariates. We go through the three distinct combinations and illustrate the meaning of interaction in each. \(\ \Box\)

4.3.1 Two factors

Assume that we, in the oldmort example, only have the factors sex and farmer at our disposal as explanatory variables. We may fit a Cox regression model like this:

fit <- coxreg(Surv(enter, exit, event) ~ sex + farmer, 
              data = om)

The output from coxreg (the model fit) is saved in the object fit, which is investigated, first by the function summary, which here performs successive \(\chi^2\) tests of the significance of the estimated regression coefficients (the null hypothesis in each case is that the true coefficient is zero).

print(summary(fit), short = TRUE)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
sex                                                          0.000 
            male      0.407     0         1 (reference)
          female      0.593    -0.232     0.793     0.048
farmer                                                       0.025 
              no      0.772     0         1 (reference)
             yes      0.228    -0.127     0.880     0.058

The \(p\)-value corresponding to the coefficient for sex is very small, meaning that there is a highly significant difference in mortality between women and men. The difference in mortality between farmers and non-farmers is also statistically significant, but with a larger \(p\)-value.

Next, we want to see the size of the difference in mortality between women and men, and between farmers and non-farmers, so look at the parameter estimates, especially in the column “Rel. Risk.” It tells us that female mortality is 79.3 per cent of the male mortality, and that the farmer mortality is 88.0 per cent of the non-farmer mortality. Furthermore, this model is multiplicative (additive on the log scale), so we can conclude that the mortality of a female farmer is \(0.793 \times 0.880 \times 100\) = 69.8 per cent of that of a male non-farmer.

We can also illustrate the sex difference graphically by a stratified analysis, see Figure 4.2.

par(las = 1)
fit2 <- coxreg(Surv(enter, exit, event) ~ strata(sex) + farmer, 
              data = om)
plot(fit2, fun = "cumhaz", xlim = c(60, 85), 
     lty = 1:2, xlab = "Age")
abline(h = 0)

FIGURE 4.2: Cumulative hazards for men and women.

Obviously the proportionality assumption is well met.

An obvious question to ask is: Is the sex difference in mortality the same among farmers as among non-farmers? In other words: Is there an interaction between sex and farmer/non-farmer regarding the effect on mortality?

In order to test for an interaction in R, the plus (\(+\)) sign in the formula is changed to a multiplication sign (\(*\)):

fit4 <- coxreg(Surv(enter, exit, event) ~ sex * farmer, 
              data = om)
print(x <- summary(fit4), short = TRUE)

Single term deletions

Model:
Surv(enter, exit, event) ~ sex * farmer
           Df   AIC LRT Pr(>Chi)
<none>        25912             
sex:farmer  1 25916 6.4    0.011

Covariate             Mean       Coef     Rel.Risk   S.E.    Wald p
sex 
            male      0.407     0         1 (reference)
          female      0.593    -0.297     0.743     0.054    0.000 
farmer 
              no      0.772     0         1 (reference)
             yes      0.228    -0.252     0.777     0.076    0.001 
sex:farmer       
   female:yes                 0.291     1.338     0.114     0.011

Note first that the output looks different when interactions are present: The first part is a chisquare test of the significance of the interaction(s), then the usual table with parameter coefficients are presented with one difference: The Wald, and not the LR \(p\)-values are given. The small LR \(p\)-value (1.1 per cent) indicates that there is a statistically significant (at the 5 per cent level) interaction effect. The size of the interaction effect is seen in the table of estimated coefficients. Note that the reference is a non-farmer man, so the “main” effect for sex == female is in fact comparing a non-farming woman to a non-farming man with the result that the woman has a mortality that is \(74.3\) per cent of that of a corresponding man. When comparing a farming woman to a farming man, we have to take the interaction into account, and the result is that a farming woman has a risk that is \(0.743 \times 1.338 \times 100 = 99.4\) per cent of the risk for the farming man. Conclusion: Among farmers, the sex difference in mortality is negligible, but in the rest of the population it is large. How the causality works here is of course a completely different matter. We may guess that farmers are often married, for instance, we do not find any unmarried female farmers in our data.

In cases like this, it is often best to perform separate analyses for women and men, or, alternatively, for farmers and non-farmers. Do that as an exercise and interpret your results.

4.3.2 One factor and one continuous covariate

Now the covariates sex (factor) and birthdate (continuous) are in focus. First the model without interaction is fitted.

fit5 <- coxreg(Surv(enter, exit, event) ~ sex + birthdate, 
               data = om)
print(summary(fit5), short = TRUE)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
sex                                                          0.000 
            male      0.407     0         1 (reference)
          female      0.593    -0.207     0.813     0.047
birthdate          1802.908    -0.006     0.994     0.004    0.135

Obviously sex is statistically significant, while birthdate is not. We go directly to the estimated coefficients after adding an interaction term to the model:

fit6 <- coxreg(Surv(enter, exit, event) ~ sex * birthdate, 
               data = om)
round(summary(fit6)$coefficients[, 1:2], 5)

                        coef  exp(coef)
sexfemale            9.12285 9162.24554
birthdate           -0.00327    0.99673
sexfemale:birthdate -0.00519    0.99483

What is going on here? The main effect of sex is huge! The answer lies in the interpretation of the coefficients: (i) To evaluate the effect on mortality for an individual with given values of the covariates, we must involve the interaction coefficient. For instance, a female born on January 2, 1801 (birthdate = 1801.003) will have the relative risk

\[\exp(9.12285 - 0.00327 \times 1801.003 - 0.00519 \times 1801.003) = 0.00185\]

compared to a man with birthdate= 0, the reference values! Obviously, it is ridiculous to extrapolate the model 1800 years back in time, but this is nevertheless the correct way to interpret the coefficients. So, at birthdate = 0, women’s mortality is 9162 times higher than that of men! (ii) With interactions involved, it is strongly recommended to center continuous covariates. In our example, we could do that by subtracting 1810 (say) from birthdate. That makes sex = "male", birthdate = 1800 the new reference values. The result is

                        coef exp(coef)
sexfemale           -0.26429   0.76775
birthdate           -0.00327   0.99673
sexfemale:birthdate -0.00519   0.99483

This is more reasonable-looking: At the beginning of the year 1810, woman had a mortality of 76.8 per cent of that of men.

To summarize: The effects can be illustrated by two curves, one for men and one for women. The coefficients for sex, 0 and \(-0.26429\), are the respective intercepts, and \(-00327\) and \(-00846\) are the slopes. See Figure 4.3 for a graphical illustration.

FIGURE 4.3: The effect of birthdate on relative mortality for men and women. Reference: Men, birthdate = 1810.

4.3.3 Two continuous covariates

In our example data, the covariates birthdate and imr.birth are continuous, and we get

fit8 <- coxreg(Surv(enter, exit, event) ~ birthdate * imr.birth, 
               data = om)
(res <- round(summary(fit8)$coefficients[, 1:2], 5))

                        coef exp(coef)
birthdate           -0.03024   0.97021
imr.birth            0.02004   1.02024
birthdate:imr.birth  0.00158   1.00158

The interpretation of the coefficients are: The reference point is birthdate = 0 (1810) and imr.birth = 0, and if birthdate = x and imr.birth = y, then the relative risk is
\[ \exp(-0.03024 x + 0.02004 y + 0.00158 x y),\] analogous to the calculation in the case with one continuous and one factor covariate.

4.4 Interpretation of Parameter Estimates

In the proportional hazards model, the parameter estimates are logarithms of risk ratios relative to the baseline hazard. The precise interpretations of the coefficients for the two types of covariates are discussed. The conclusion is that \(e^\beta\) has a more direct and intuitive interpretation than \(\beta\) itself.

4.4.1 Continuous covariate

If \(x\) is a continuous covariate, and \(h(t; x) = h_0(t)e^{\beta x}\), then

\[\begin{equation*} \frac{h(t; x+1)}{h(t; x)} = \frac{h_0(t)e^{\beta (x+1)}} {h_0(t)e^{\beta x}} = e^\beta. \end{equation*}\] so the risk increases with a factor \(e^\beta\), when \(x\) is increased by one unit. In other words, \(e^\beta\) is a relative risk (or a hazard ratio, which often is a preferred term in certain professions).

4.4.2 Factor

For a factor covariate, in the usual coding with a reference category, \(e^\beta\) is the relative risk compared to that reference category.

4.5 Model Selection

In regression modeling, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines. For a precise definition of this concept, see Appendix A.

4.5.1 Model selection in general

Some general advise regarding model selection is given here.

Remember, there are no true models, only some useful ones. This statement is attributed to G.E.P. Box.
More than one model may be useful.
Keep important covariates in the model.
Avoid automatic stepwise procedures!
If interaction terms are present, the corresponding main terms must be there. (There are some exceptions to this rule, see for instance the example about infant and maternal mortality in Chapter 10.)