Chapter 4 Explanatory Variables and Regression
This chapter deals with the various forms the explanatory variables can take and their possible interplay. In principle, this applies to all forms of regression analysis, but here we concentrate on survival models.
Explanatory variables, or covariates, may be of
essentially two
different types, continuous and discrete. The discrete type
usually takes only a finite number of distinct values and is called a
factor
in R. A special case of a factor is one that takes
only two distinct values, say 0 and 1. Such a factor is called an
indicator, because we can let the value 1 indicate the presence of a
certain property and 0 denote its absence. To summarize, there is
- Covariate: taking values in an interval (age, blood pressure, weight).
- Factor: taking a finite number of values (civil status, occupation, cohort).
- Indicator: a factor taking two values (sex, born in Umeå).
Usually, we do not distinguish between factors and indicators, they are all factors.
4.1 Continuous Covariates
We use the qualifier continuous to stress that factors are excluded, because often the term covariate is used as a synonym for explanatory variable.
Values taken by a continuous covariate are ordered. The effect on the response is by model definition ordered in the same or reverse order. On the other hand, values taken by a factor are unordered (but may be defined as ordered in R).
4.2 Factor Covariates
An explanatory variable that can take only a finite (usually small) number of distinct values is called a categorical variable. In R language, it is called a factor. Examples of such variables are gender, socio-economic status, and birth place. Students of statistics have long been taught to create dummy variables in such situations, in the following way:
- Given a categorical variable \(F\) with \((k+1)\) levels \((f_0, f_1, f_2, \ldots f_k)\) (\(k+1\) levels),
- Create \(k\) indicator (``dummy’’) variables \((I_1, I_2, \ldots I_k)\).
The level \(f_0\) is the reference category, characterized by that all indicator variables are zero for an individual with this value. Generally, for the level, \(f_i, \; i = 1,\ldots, k\), the indicator variable \(I_i\) is one, the rest are zero. In other words, for a single individual, at most one indicator is one, and the rest are zero.
In R, there is no need to explicitly create dummy variables, it is done
behind the scenes by the functions
factor
and
as.factor
.
Note that a factor with two levels, i.e., an indicator variable, can always be treated as a continuous covariate, if coded numerically (e.g., 0 and 1).
Consider a demographic example, the influence of mother’s age (a continuous covariate) on infant mortality. It is considered well-known that a young mother means high risk for the infant, and also that old mother means high risk, compared to “in-between-aged” mothers. So the risk order is not the same (or reverse) as the age order.
One solution (not necessarily the best) to this problem is to factorize: Let, for instance,
\[\begin{equation*} \mbox{mother's age} = \left\{\begin{array}{ll} \mbox{low}, & 15 < \mbox{age} \le 25 \\ \mbox{middle}, & 25 < \mbox{age} \le 35 \\ \mbox{high}, & 35 < \mbox{age} \le 50 \end{array} \right. \end{equation*}\]
In this layout, there will be two parameters measuring the deviation from the reference category, which will be the first category by default.
In R, this is easily achieved with the aid of the
cut
function. It
works like this, illustrating it with the data set child
:
<- cut(child$m.age, c(15, 25, 35, 51))
age.group table(age.group, useNA = "ifany")
age.group
(15,25] (25,35] (35,51]
3917 13805 8852
Note that the created intervals by default are closed to the right and open
to the left. This has consequences for how observations exactly on a
boundary are treated; they belong to the lower-valued interval. The
argument right
in the call to cut
can be used to switch this
behavior the other way around.
Note further that values falling below the smallest value (15 in our
example) or above the largest value (51) are reported as missing values
(NA
in R terminology, Not Available
).
For further information about the use of the cut
function, see its
help page.
An infant mortality analysis with the created factor covariate may look like this:
<- child
ch $age.group <- age.group
ch<- age.window(ch, c(0, 1)) # Right censor at age one.
ch <- coxreg(Surv(enter, exit, event) ~ age.group, data = ch)
fit print(summary(fit), short = TRUE)
Covariate Mean Coef Rel.Risk S.E. LR p
age.group 0.008
(15,25] 0.147 0 1 (reference)
(25,35] 0.522 -0.056 0.945 0.064
(35,51] 0.332 0.090 1.094 0.067
Obviously, it is safest for the infants whose mothers’ age lies in the span 25–35 years of age.\(\ \Box\)
4.3 Interactions
The meaning of interaction between two explanatory variables is described by an example, walking through the three possible combinations of covariate types.
We return to the oldmort
data set, slightly modified, see Table ??,
where the first five of a total to 6495 rows are shown.
The sampling frame is a rectangle in the Lexis diagram, see Figure 4.1. It can be described as all persons born between January 1, 1775 and January 1, 1821, and present in the solid rectangle at some moment.
FIGURE 4.1: Old age mortality sampling frame, Lexis diagram.
There are four possible explanatory variables behind survival after age 60 in the example.
\[\begin{eqnarray*} \mbox{farmer} & = & \left\{ \begin{array}{ll} \mbox{no} & \\ \mbox{yes} & \left(e^{\alpha}\right) \end{array} \right. \\ %\end{equation*} %\begin{equation*} \mbox{sex} & = & \left\{ \begin{array}{ll} \mbox{male} & \\ \mbox{female} & \left(e^{\beta}\right) \end{array} \right. \\ \mbox{birthdate } (x_1) & & \mspace{90mu} \left(e^{\gamma_1 x_1}\right) \\ \mbox{IMR at birth } (x_2) & & \mspace{90mu} \left(e^{\gamma_2 x_2}\right) \\ \end{eqnarray*}\] The two first are factors, and the last two are continuous covariates. We go through the three distinct combinations and illustrate the meaning of interaction in each. \(\ \Box\)
4.3.1 Two factors
Assume that we, in the oldmort example, only have the factors sex
and farmer
at our
disposal as explanatory variables. We may fit a Cox regression model like this:
<- coxreg(Surv(enter, exit, event) ~ sex + farmer,
fit data = om)
The output from coxreg
(the model fit) is saved in the object fit
, which is investigated,
first by the function summary
, which here performs successive \(\chi^2\) tests of the significance of
the estimated regression coefficients (the null hypothesis in each case is that the true coefficient
is zero).
print(summary(fit), short = TRUE)
Covariate Mean Coef Rel.Risk S.E. LR p
sex 0.000
male 0.407 0 1 (reference)
female 0.593 -0.232 0.793 0.048
farmer 0.025
no 0.772 0 1 (reference)
yes 0.228 -0.127 0.880 0.058
The \(p\)-value corresponding to the coefficient for sex
is very small, meaning that there
is a highly significant difference in mortality between women and men. The difference in mortality
between farmers and non-farmers is also statistically significant, but with a larger \(p\)-value.
Next, we want to see the size of the difference in mortality between women and men, and between farmers and non-farmers, so look at the parameter estimates, especially in the column “Rel. Risk.” It tells us that female mortality is 79.3 per cent of the male mortality, and that the farmer mortality is 88.0 per cent of the non-farmer mortality. Furthermore, this model is multiplicative (additive on the log scale), so we can conclude that the mortality of a female farmer is \(0.793 \times 0.880 \times 100\) = 69.8 per cent of that of a male non-farmer.
We can also illustrate the sex difference graphically by a stratified analysis, see Figure 4.2.
par(las = 1)
<- coxreg(Surv(enter, exit, event) ~ strata(sex) + farmer,
fit2 data = om)
plot(fit2, fun = "cumhaz", xlim = c(60, 85),
lty = 1:2, xlab = "Age")
abline(h = 0)
FIGURE 4.2: Cumulative hazards for men and women.
Obviously the proportionality assumption is well met.
An obvious question to ask is: Is the sex difference in mortality the same among farmers as among non-farmers? In other words: Is there an interaction between sex and farmer/non-farmer regarding the effect on mortality?
In order to test for an interaction in R, the plus (\(+\)) sign in the formula is changed to a multiplication sign (\(*\)):
<- coxreg(Surv(enter, exit, event) ~ sex * farmer,
fit4 data = om)
print(x <- summary(fit4), short = TRUE)
Single term deletions
Model:
Surv(enter, exit, event) ~ sex * farmer
Df AIC LRT Pr(>Chi)
<none> 25912
sex:farmer 1 25916 6.4 0.011
Covariate Mean Coef Rel.Risk S.E. Wald p
sex
male 0.407 0 1 (reference)
female 0.593 -0.297 0.743 0.054 0.000
farmer
no 0.772 0 1 (reference)
yes 0.228 -0.252 0.777 0.076 0.001
sex:farmer
female:yes 0.291 1.338 0.114 0.011
Note first that the output looks different when interactions are present: The first part is a chisquare test of the significance of the interaction(s), then the usual table with parameter coefficients are presented with one difference: The Wald, and not the LR \(p\)-values are given. The small LR \(p\)-value (1.1 per cent) indicates that there is a statistically significant (at the 5 per cent level) interaction effect. The size of the interaction effect is seen in the table of estimated coefficients. Note that the reference is a non-farmer man, so the “main” effect for sex == female
is in fact comparing a non-farming woman to a non-farming man with the result that the woman has a mortality that is \(74.3\) per cent of that of a corresponding man. When comparing a farming woman to a farming man, we have to take the interaction into account, and the result is that a farming woman has a risk that is \(0.743 \times 1.338 \times 100 = 99.4\) per cent of the risk for the farming man. Conclusion: Among farmers, the sex difference in mortality is negligible, but in the rest of the population it is large. How the causality works here is of course a completely different matter. We may guess that farmers are often married, for instance, we do not find any unmarried female farmers in our data.
In cases like this, it is often best to perform separate analyses for women and men, or, alternatively, for farmers and non-farmers. Do that as an exercise and interpret your results.
4.3.2 One factor and one continuous covariate
Now the covariates sex
(factor) and birthdate
(continuous) are in focus. First the model without interaction is fitted.
<- coxreg(Surv(enter, exit, event) ~ sex + birthdate,
fit5 data = om)
print(summary(fit5), short = TRUE)
Covariate Mean Coef Rel.Risk S.E. LR p
sex 0.000
male 0.407 0 1 (reference)
female 0.593 -0.207 0.813 0.047
birthdate 1802.908 -0.006 0.994 0.004 0.135
Obviously sex
is statistically significant, while birthdate
is not. We go
directly to the estimated coefficients after adding an interaction term to the model:
<- coxreg(Surv(enter, exit, event) ~ sex * birthdate,
fit6 data = om)
round(summary(fit6)$coefficients[, 1:2], 5)
coef exp(coef)
sexfemale 9.12285 9162.24554
birthdate -0.00327 0.99673
sexfemale:birthdate -0.00519 0.99483
What is going on here? The main effect of sex
is huge! The answer lies in the
interpretation of the coefficients:
(i) To evaluate the effect on mortality for an
individual with given values of the covariates, we must involve the interaction
coefficient. For instance, a female born on January 2, 1801 (birthdate
= 1801.003)
will have the relative risk
\[\exp(9.12285 - 0.00327 \times 1801.003 - 0.00519 \times 1801.003) = 0.00185\]
compared to a man with birthdate
= 0, the reference values! Obviously, it is ridiculous to extrapolate the model 1800 years back in time, but this is nevertheless the correct way to interpret the coefficients. So, at birthdate
= 0, women’s mortality is 9162 times higher than that of men!
(ii) With interactions involved, it is strongly recommended to center continuous covariates. In our example, we could do that by subtracting 1810 (say) from birthdate
. That makes sex = "male", birthdate = 1800
the new reference values. The result is
coef exp(coef)
sexfemale -0.26429 0.76775
birthdate -0.00327 0.99673
sexfemale:birthdate -0.00519 0.99483
This is more reasonable-looking: At the beginning of the year 1810, woman had a mortality of 76.8 per cent of that of men.
To summarize: The effects can be illustrated by two curves, one for men and one for women. The coefficients for sex
, 0 and
\(-0.26429\), are the respective intercepts, and \(-00327\) and \(-00846\) are the slopes. See Figure 4.3 for a graphical illustration.
FIGURE 4.3: The effect of birthdate on relative mortality for men and women. Reference: Men, birthdate = 1810.
4.3.3 Two continuous covariates
In our example data, the covariates birthdate
and imr.birth
are continuous, and we get
<- coxreg(Surv(enter, exit, event) ~ birthdate * imr.birth,
fit8 data = om)
<- round(summary(fit8)$coefficients[, 1:2], 5)) (res
coef exp(coef)
birthdate -0.03024 0.97021
imr.birth 0.02004 1.02024
birthdate:imr.birth 0.00158 1.00158
The interpretation of the coefficients are: The reference point is birthdate = 0
(1810) and imr.birth = 0
, and if birthdate = x
and imr.birth = y
, then the relative risk is
\[ \exp(-0.03024 x + 0.02004 y + 0.00158 x y),\]
analogous to the calculation in the case with one continuous and one factor covariate.
4.4 Interpretation of Parameter Estimates
In the proportional hazards model, the parameter estimates are logarithms of risk ratios relative to the baseline hazard. The precise interpretations of the coefficients for the two types of covariates are discussed. The conclusion is that \(e^\beta\) has a more direct and intuitive interpretation than \(\beta\) itself.
4.4.1 Continuous covariate
If \(x\) is a continuous covariate, and \(h(t; x) = h_0(t)e^{\beta x}\), then
\[\begin{equation*} \frac{h(t; x+1)}{h(t; x)} = \frac{h_0(t)e^{\beta (x+1)}} {h_0(t)e^{\beta x}} = e^\beta. \end{equation*}\] so the risk increases with a factor \(e^\beta\), when \(x\) is increased by one unit. In other words, \(e^\beta\) is a relative risk (or a hazard ratio, which often is a preferred term in certain professions).
4.4.2 Factor
For a factor covariate, in the usual coding with a reference category, \(e^\beta\) is the relative risk compared to that reference category.
4.5 Model Selection
In regression modeling, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines. For a precise definition of this concept, see Appendix A.
4.5.1 Model selection in general
Some general advise regarding model selection is given here.
- Remember, there are no true models, only some useful ones. This statement is attributed to G.E.P. Box.
- More than one model may be useful.
- Keep important covariates in the model.
- Avoid automatic stepwise procedures!
- If interaction terms are present, the corresponding main terms must be there. (There are some exceptions to this rule, see for instance the example about infant and maternal mortality in Chapter 10.)