Chapter 3 Cox Regression
We move slowly from the two-sample case, via the \(k\)-sample case, to the general regression model. On this route, we start with the log-rank test and end up with Cox regression and the proportional hazards model (Cox 1972).
3.1 Proportional Hazards
The property of proportional hazards is fundamental in Cox regression. It is in fact the essence of Cox’s simple yet ingenious idea. The definition is as follows:
Definition 3.1 (Proportional hazards) If \(h_1(t)\) and \(h_0(t)\) are hazard functions from two separate distributions, we say that they are proportional if
\[\begin{equation} h_1(t) = \psi h_0(t), \quad \text{for all } t \ge 0, \tag{3.1} \end{equation}\] for some positive constant \(\psi\) and all \(t \ge 0\). Further, if holds, then the same property holds for the corresponding cumulative hazard functions \(H_1(t)\) and \(H_0(t)\).
\[\begin{equation} H_1(t) = \psi H_0(t), \quad \text{for all } t \ge 0, \tag{3.2} \end{equation}\] with the same proportionality constant \(\psi\) as in (3.1).\(\Box\)Strictly speaking, the second part of this definition follows easily from the first (and vice versa), so more correct would be to state one part as a definition and the other as a corollary. The important part of this definition is “\(\textit{for all }t \ge 0\)”, and that the constant \(\psi\) does not depend on \(t\). Think of the hazard functions as age-specific mortality for two groups, e.g., women and men. It is “well known” that women have lower mortality than men in all ages. It would therefore be reasonable to assume proportional hazards in that case. It would mean that the female relative advantage is equally large in all ages. See Example 3.1 for a demonstration of this example of proportional hazards.
This is a continuation of Example ??, and here we involve the male mortality alongside with the female, see Figure 3.1. We also expand the calendar time period to the 21st century, as far as we have data now. The reason for this is the random fluctuations in low ages, caused by the extremely low mortality rates.
FIGURE 3.1: Female and male mortality by age, Sweden 2001-2019.
As you can see, the “common knowledge” of proportional hazards is disproved by the right hand panel in Figure 3.1. On the other hand, the left hand panel gives a rather different impression, and this illustrates the need to choose good approaches to graphical presentation. \(\Box\)
It must be emphasized that the proportional hazards assumption is an assumption that always must be carefully checked. In many situations, it would not be reasonable to assume proportional hazards. If in doubt, check data by plotting the Nelson-Aalen estimates for each group in the same plot. The left hand panel of Figure 3.1 would suit this purpose better if drawn on a log scale, see Figure 3.2.
FIGURE 3.2: Female and male mortality by age, Sweden 2001-2019. Log scale.
The advantage of the log scale is twofold: (i) Small numbers are magnified so you can see them, and (ii) on the log scale, proportional hazards implies a constant vertical distance between the curves, which is easier to see for the human eye.
For an example of a perfect fit to the proportional hazards model, see Figure 3.3 (two Weibull hazard functions with the proportionality constant \(\psi = 2\)).
FIGURE 3.3: Two hazard functions that are proportional. The proportionality constant is 2.
The proportionality is often difficult to judge by eye, so in order make it easier to see, the plot can be made on a log scale as in the right hand panel of Figure 3.3. Note that both dimensions are on a log scale. This type of plot, constructed from empirical data, is called a Weibull plot in reliability applications: If the lines are straight lines, then data are well fitted by a Weibull distribution. Additionally, if the the slope of the line is 1 (45 degrees), then an exponential model fits well.
To summarize Figure 3.3: (i) The hazard functions are proportional because on the log-log scale, the vertical distance is constant, (ii) Both hazard functions represent a Weibull distribution, because both lines are straight lines, and (iii) neither represents an exponential distribution, because the slopes are not one. This latter fact may be difficult to see because of the different scales on the axes (the aspect ratio is not one).
Figure 3.4 shows the relationships between the cumulative hazards functions, the density functions, and the survival functions when the hazard functions are proportional. Note that the cumulative hazards functions are proportional by implication, with the same proportionality constant (\(\psi = 2\) in this case). On the other hand, for the density and survival functions, proportionality does not hold; it is in fact theoretically impossible except in the trivial case that the proportionality constant is unity.
FIGURE 3.4: The effect of proportional hazards on the density and survival functions.
3.2 The Log-Rank Test
The log-rank test is a \(k\)-sample test of equality of survival functions. We first look at the two-sample case, that is, \(k = 2\).
Suppose that we have the small data set illustrated in
Figure 3.5. There are two samples, the letters (A, B,
C, D, E) and the numbers (1, 2, 3, 4, 5).
FIGURE 3.5: Two-sample data, the letters (dashed) and the numbers (solid). Circles denote censored observations, plusses events.
The data in Figure 3.5 can be presented i tabular form, see Table 3.1.
| group | time | event |
|---|---|---|
| numbers | 4.0 | TRUE |
| numbers | 2.0 | FALSE |
| numbers | 6.0 | TRUE |
| numbers | 1.0 | TRUE |
| numbers | 3.5 | FALSE |
| letters | 5.0 | TRUE |
| letters | 3.0 | TRUE |
| letters | 6.0 | FALSE |
| letters | 1.0 | TRUE |
| letters | 2.5 | FALSE |
We are interested in investigating whether letters and numbers
have the same survival chances or not. Therefore, the hypothesis
\[\begin{equation*} H_0: \text{No difference in survival between numbers and letters} \end{equation*}\] is formulated. In order to test \(H_0\), we make five tables, one for each observed event time, see Table 3.2, where the the first table, relating to failure time \(t_{(1)} = 1\), is shown.
| Deaths | Survivals | Total | |
|---|---|---|---|
| numbers | 1 | 4 | 5 |
| letters | 1 | 4 | 5 |
| Total | 2 | 8 | 10 |
Let us look at the table at failure time \(t_{(1)} = 1\), Table 3.2,
from the viewpoint of the numbers.
- The observed number of deaths among
numbers: \(1\). - The expected number of deaths among
numbers: \(2 \times 5 / 10 = 1\).
The expected number is calculated under \(H_0\), i.e., as if there is
no difference between letters and numbers regarding mortality. It is further
assumed that the two margins (Total) are given (fixed).
Then, given two deaths in
total and five out of ten observations are from the group numbers,
the expected number of deaths is calculated as above.
This procedure is repeated for each of the five tables, and the results are summarized in Table 3.3.
| Observed | Expected | Difference | Variance | |
|---|---|---|---|---|
| t(1) | 1 | 1.0 | 0.0 | 0.44 |
| t(2) | 0 | 0.5 | -0.5 | 0.25 |
| t(3) | 1 | 0.5 | 0.5 | 0.25 |
| t(4) | 0 | 0.3 | -0.3 | 0.22 |
| t(5) | 1 | 0.5 | 0.5 | 0.25 |
| Sum | 3 | 2.8 | 0.2 | 1.41 |
Finally, the observed test statistic \(T\) is calculated as
\[\begin{equation*}
T = \frac{0.2^2}{1.41} \approx 0.028
\end{equation*}\]
Under the null hypothesis, this is an observed value from a
\(\chi^2(1)\)
distribution, and \(H_0\) should be rejected for large values of
\(T\). Using a level of significance of 5%, the cutting point for the
value of \(T\) is 3.84, far from our observed value of 1.41. The
conclusion is
therefore that there is no (statistically significant) difference in
survival chances between letters and numbers.
Note, however, that this result depends on
asymptotic (large sample)
properties, and
in this toy example, these properties are not valid.
For more detail about the underlying theory, see Appendix A.
Let us now look at a real data example, the old age mortality data set
oldmort in eha. See Table 3.4 for a sample of five records with selected columns.
| id | enter | exit | event | sex | civ |
|---|---|---|---|---|---|
| 793001208 | 66.498 | 67.988 | 0 | male | married |
| 793001208 | 67.988 | 72.820 | 0 | male | married |
| 793001208 | 72.820 | 75.542 | 1 | male | widow |
| 793001209 | 66.446 | 76.568 | 1 | female | married |
| 793001210 | 66.446 | 67.936 | 0 | female | married |
We are interested in comparing male and female mortality in the ages 60–85 with a logrank test, and for that purpose we use the logrank function:
The log-rank test
Call:
logrank(Y = Surv(enter, exit, event), group = sex, data = om)
X-squared = 19.1971 , df = 1 , p-value = 1.1789e-05
The result of the logrank test is given by the \(p\)-value 1.178899410^{-5} , a very small number. Thus, there is a very significant difference in mortality between men and women. But how large is the difference? The log-rank test has no answer to this question.
Remember that this result depends on the proportional hazards assumption. We can graphically check it, see Figure 3.6
FIGURE 3.6: Cumulative hazards for women and men in ages 60-85, 19th century Sweden.
The proportionality assumption seems to be a good description from 60 to 85 years of age, but it is more doubtful in the very high ages (not shown). One reason for this may be that the high-age estimates are based on few observations (most of the individuals in the sample died earlier), so random fluctuations have a large impact in the high ages, but it is also an often observed phenomenon that differences in mortality between groups tend to vanish in the extremely high ages.
3.2.1 Several samples
The result for the two-sample case is easily extended to the \(k\)-sample
case. Instead of one \(2 \times 2\) table per observed event time we get one
\(k\times 2\) table per observed event time and we have to calculate expected
and observed numbers of events for \((k-1)\) groups at each failure time. The
resulting test statistic will have \((k-1)\) degrees of freedom and still be
approximately \(\chi^2\) distributed. This is illustrated with the same data
set, oldmort, as above, but with the covariate civ,
which is a factor with three levels (unmarried, married, widow), instead of sex.
Furthermore, the investigation is limited to male mortality.
The log-rank test
Call:
logrank(Y = Surv(enter, exit, event), group = civ, data = om[om$sex ==
"male", ])
X-squared = 19.8622 , df = 2 , p-value = 4.86372e-05
The degrees of freedom for the score test
is now 2, equal to the number
of levels in civ
minus one. Being unmarried seem to have great impact on
old age mortality. It is however recommended to check the proportionality
assumption graphically, see Figure 3.7.
FIGURE 3.7: Old age male mortality by civil status, cumulative hazards.
There is obviously nothing much that indicates non-proportionality in this case either. Furthermore, the ordering is what one would expect.
We do not go deeper into this matter here, mainly because the logrank test is a special case of Cox regression, which will be described in detail later in this chapter.
3.3 Cox Regression Models
Starting with the definition of proportional hazards in Section 3.1, the concept of Cox regression is introduced in steps.
3.3.1 Two Groups
The definition of proportionality, given in equation (3.1), can equivalently be written as
\[\begin{equation} h_x(t) = \psi^x h_0(t), \quad t > 0, \; x = 0, 1, \; \psi > 0. \tag{3.3} \end{equation}\]
It is easy to see that this “trick” is equivalent to (3.1): When \(x = 0\) it simply says that \(h_0(t) = h_0(t)\), and when \(x = 1\) it says that \(h_1(t) = \psi h_0(t)\). Since \(\psi > 0\), we can calculate \(\beta = \log(\psi)\), and rewrite (3.3) as
\[\begin{equation*} h_x(t) = e^{\beta x} h_0(t), \quad t > 0; \; x = 0, 1; \; -\infty < \beta < \infty, \end{equation*}\] or with a slight change in notation,
\[\begin{equation} h(t; x) = e^{\beta x} h_0(t), \quad t > 0; \; x = 0, 1; \; -\infty < \beta < \infty. \tag{3.4} \end{equation}\]
The sole idea by this rewriting is to pave the way for the introduction of Cox’s regression model (Cox 1972), which in its elementary form is a proportional hazards model. In fact, we can already interpret equation (3.4) as a Cox regression model with an explanatory variable \(x\) with corresponding regression coefficient \(\beta\) (to be estimated from data). The covariate \(x\) is still only a dichotomous variate, but we will now show how it is possible to generalize this to a situation with explanatory variables of any form. The first step is to go from the two-sample situation to a \(k\)-sample one.
3.3.2 Many Groups
Can we generalize to \((k + 1)\) groups, \(k \ge 2\)? Yes, by expanding the procedure in the previous subsection:
\[\begin{eqnarray*} h_0(t) &\sim& \mbox{group 0} \\ h_1(t) &\sim& \mbox{group 1} \\ \cdots & & \cdots \\ h_k(t) &\sim& \mbox{group k} \end{eqnarray*}\]
The underlying model is: \(h_j(t) = \psi_j h_0(t), \quad t \ge 0\), \(j = 1, 2, \ldots, k\). That is, with \((k+1)\) groups, we need \(k\) proportionality constants \(\psi_1, \ldots, \psi_k\) in order to define proportional hazards. Note also that in this formulation (there are others), one group is “marked” as a reference group, that is, to this group there is no proportionality constant attached. All relations are relative to the reference group. Note also that it doesn’t matter which group is chosen as the reference. This choice does not change the model itself, only its representation.
With \((k + 1)\) groups, we need \(k\) indicators. Let
\[\begin{equation*} \mathbf{x} = (x_{1}, x_{2}, \ldots, x_{k}). \end{equation*}\] Then
\[\begin{eqnarray*} \mathbf{x} = (0, 0, \ldots, 0) & \Rightarrow & \mbox{group 0} \\ \mathbf{x} = (1, 0, \ldots, 0) & \Rightarrow & \mbox{group 1} \\ \mathbf{x} = (0, 1, \ldots, 0) & \Rightarrow & \mbox{group 2} \\ \cdots & & \cdots \\ \mathbf{x} = (0, 0, \ldots, 1) & \Rightarrow & \mbox{group k} \end{eqnarray*}\]
and
\[\begin{equation*} h(t; \mathbf{x}) = h_0(t) \prod_{\ell = 1}^k \psi_{\ell}^{x_{\ell}} = \left\{ \begin{array}{ll} h_0(t), & \mathbf{x} = (0, 0, \ldots, 0) \\ h_0(t) \psi_j, & x_{j} = 1, \quad j = 1, \ldots k \end{array}\right. \end{equation*}\] With \(\psi_j = e^{\beta_j}, \; j = 1, \ldots, k\), we get
\[\begin{equation} h(t; \mathbf{x}) = h_0(t) e^{x_{1}\beta_1 + x_{2} \beta_2 + \cdots + x_{k} \beta_k} = h_0(t) e^{\mathbf{x} \boldsymbol{\beta}}, \tag{3.5} \end{equation}\]
where \(\boldsymbol{\beta} = (\beta_1, \beta_2, \ldots, \beta_k)\).
3.3.3 The General Cox Regression Model
We may now generalize (3.5) by letting the components of \(\mathbf{x}_i\) take any value. Let data and model take the following form:
Data:
\[\begin{equation} (t_{i0}, t_i, d_i, \mathbf{x}_i), \; i = 1, \ldots, n, \end{equation}\]
where \(t_{i0}\) is the left truncation time point (if \(y_{i0} = 0\) for all \(i\), then this variable may be omitted), \(t_i\) is the end time point, \(d_i\) is the “event indicator” (\(1\) or TRUE if event, else \(0\) or FALSE), and \(\mathbf{x}_i\) is a vector of explanatory variables.
Model:
\[\begin{equation} h(t; \mathbf{x}_i) = h_0(t) e^{\mathbf{x}_i \boldsymbol{\beta}}, \quad t > 0. \tag{3.6} \end{equation}\]
This is a regression model where the response variable is \((t_0, t, d)\) (we will call it a survival object) and the explanatory variable is \(\mathbf{x}\), possibly (often) vector valued.
In (3.6) there are two components to estimate, the regression coefficients \(\boldsymbol{\beta}\), and the baseline hazard function \(h_0(t), \; t > 0\). For the former task, the partial likelihood (Cox 1975) is used. See Appendix A for a brief summary.
3.4 Estimation of the Baseline Cumulative Hazard Function
The usual estimator (continuous time) of the baseline cumulative hazard function is
\[\begin{equation} \hat{H}_0(t) = \sum_{j:t_j \le t} \frac{d_j}{\sum_{m \in R_j} e^{\mathbf{x}_m \hat{\boldsymbol{\beta}}}}, \tag{3.7} \end{equation}\]
where \(d_j\) is the number of events at \(t_j\) and \(\hat{\boldsymbol{\beta}}\) is the result of maximizing the partial likelihood. Note that if \(\hat{\boldsymbol{\beta}} = 0\), this reduces to
\[\begin{equation} \hat{H}_0(t) = \sum_{j:t_j \le t} \frac{d_j}{n_j}, \tag{3.8} \end{equation}\]
the Nelson-Aalen estimator. In (3.8), \(n_j\) is the size of \(R_j\).
In the R package eha, the baseline hazard is estimated at the value zero of the covariates (but at the reference category for a factor covariate). This is different from practice in other packages, where some average value of covariate values are chosen as reference. However, this is generally a bad habit, because it will frequently lead to a variation in the reference value as subsets of the original data set is taken. Instead, the practitioner must exercise judgment in choosing relevant reference values for (continuous) covariates. There are two instances when these choices make a difference: (i) When interactions are present, main effect estimates will vary with choice (more about this later), and (ii) estimation of baseline hazards and survival functions.
In order to calculate the cumulative hazards function for an individual with a specific covariate vector \(\mathbf{x}\), use the formula
\[\begin{equation*} \hat{H}(t; \mathbf{x}) = \hat{H}_0(t) e^{(\mathbf{x})\hat{\boldsymbol{\beta}}}. \end{equation*}\]
The corresponding survival functions may be estimated by the relation
\[\begin{equation*} \hat{S}(t; \mathbf{x}) = \exp\bigl(-\hat{H}(t; \mathbf{x})\bigr) \end{equation*}\]
It is also possible to use the terms in the sum (3.7) to build an estimator analogous to the Kaplan-Meier estimator (??). In practice, there is no big difference between the two methods.
3.5 Explanatory Variables
Explanatory variables, or covariates, may be of
essentially two
different types continuous and discrete. The discrete type
usually takes only a finite number of distinct values and is called a
factor in R. A special case of a factor is one that takes
only two distinct values, say 0 and 1. Such a factor is called an
indicator, because we can let the value 1 indicate the presence of a
certain property and 0 denote its absence. To summarize, there is
- Covariate: taking values in an interval (e.g., age, blood pressure).
- Factor: taking a finite number of values (e.g., civil status, occupation).
- Indicator: a factor taking two values (e.g., gender).
3.5.1 Continuous Covariates
We use the qualifier continuous to stress that factors are excluded, because often the term covariate is used as a synonym for explanatory variable.
Values taken by a continuous covariate are ordered. The effect on the response is by model definition ordered in the same or reverse order. On the other hand, values taken by a factor are unordered (but may be defined as ordered in R).
3.5.2 Factor Covariates
An explanatory variable that can take only a finite (usually small) number of distinct values is called a categorical variable. In R language, it is called a factor. Examples of such variables are gender, socio-economic status, birth place. Students of statistics have long been taught to create dummy variables in such situations, in the following way:
- Given a categorical variable \(F\) with \((k+1)\) levels \((f_0, f_1, f_2, \ldots f_k)\) (\(k+1\) levels),
- Create \(k\) indicator (``dummy’’) variables \((I_1, I_2, \ldots I_k)\).
The level \(f_0\) is the reference category, characterized by that all indicator variables are zero for an individual with this value. Generally, for the level, \(f_i, \; i = 1,\ldots, k\), the indicator variable \(I_i\) is one, the rest are zero. In other words, for a single individual, at most one indicator is one, and the rest are zero.
In R, there is no need to explicitly create dummy variables, it is done
behind the scenes by the functions
factor and
as.factor.
Note that a factor with two levels, i.e., an indicator variable, can always be treated as a continuous covariate, if coded numerically (e.g., 0 and 1).
Consider a demographic example, the influence of mother’s age (a continuous covariate) on infant mortality. It is considered well-known that a young mother means high risk for the infant, and also that old mother means high risk, compared to “in-between-aged” mothers. So the risk order is not the same (or reverse) as the age order.
One solution (not necessarily the best) to this problem is to factorize: Let, for instance,
\[\begin{equation*} \mbox{mother's age} = \left\{\begin{array}{ll} \mbox{low}, & 15 < \mbox{age} \le 25 \\ \mbox{middle}, & 25 < \mbox{age} \le 35 \\ \mbox{high}, & 35 < \mbox{age} \end{array} \right. \end{equation*}\]
In this layout, there will be two parameters measuring the deviation from the reference category, which will be the first category by default.
In R, this is easily achieved with the aid of the
cut function. It
works like this:
(15,25] (25,35] (35,48]
16 64 20
Note that the created intervals by default are closed to the right and open
to the left. This has consequences for how observations exactly on a
boundary are treated; they belong to the lower-valued interval. The
argument right in the call to cut can be used switch this
behaviour the other way around.
Note further that values falling below the smallest value (15 in our
example) or above the largest value (48) are reported as missing values
(NA in R terminology, Not Available).
For further information about the use of the cut function, see the
help page. \(\Box\)
3.6 Interactions
The meaning of interaction between two explanatory variables is described by an example, walking through the three possible combinations of covariate types.
We return to the oldmort data set, slightly modified, see Table 3.5,
where the first five of a total to 6495 rows are shown.
| birthdate | enter | exit | event | sex | farmer | imr.birth |
|---|---|---|---|---|---|---|
| 1775.286 | 84.714 | 85 | 0 | female | no | 14.20722 |
| 1775.092 | 84.908 | 85 | 0 | female | yes | 11.97183 |
| 1775.374 | 84.626 | 85 | 0 | female | no | 16.93548 |
| 1775.859 | 84.141 | 85 | 0 | male | yes | 16.93548 |
| 1775.111 | 84.889 | 85 | 0 | female | no | 12.70903 |
The sampling frame is a rectangle in the Lexis diagram, see Figure 3.8. It can be described as all persons born between January 1, 1775 and January 1, 1881, and present in the solid rectangle at some moment.
FIGURE 3.8: Old age mortality sampling frame, Lexis diagram.
There are four possible explanatory variables behind survival after age 60 in the example. \[\begin{eqnarray*} \mbox{farmer} & = & \left\{ \begin{array}{ll} \mbox{no} & \\ \mbox{yes} & \left(e^{\alpha}\right) \end{array} \right. \\ %\end{equation*} %\begin{equation*} \mbox{sex} & = & \left\{ \begin{array}{ll} \mbox{male} & \\ \mbox{female} & \left(e^{\beta}\right) \end{array} \right. \\ \mbox{birthdate } (x_1) & & \mspace{90mu} \left(e^{\gamma_1 x_1}\right) \\ \mbox{IMR at birth } (x_2) & & \mspace{90mu} \left(e^{\gamma_2 x_2}\right) \\ \end{eqnarray*}\] The two first are factors, and the last two are continuous covariates. We go through the three distinct combinations and illustrate the meaning of interaction in each. \(\Box\)
3.6.1 Two factors
Assume that we, in the oldmort example, only have the factors sex and farmer at our
disposal as explanatory variables. We may fit a Cox regression model like this:
Covariate Mean Coef Rel.Risk S.E. LR p
sex 0.000
male 0.407 0 1 (reference)
female 0.593 -0.232 0.793 0.048
farmer 0.025
no 0.772 0 1 (reference)
yes 0.228 -0.127 0.880 0.058
Events 1823
Total time at risk 37244
Max. log. likelihood -12956
LR test statistic 23.97
Degrees of freedom 2
Overall p-value 6.25076e-06
The output from coxreg (the model fit) is saved in the object fit, which is investigated,
first by the the function summary, which here performs successive \(\chi^2\) tests of the significance of
the estimated regression coefficients (the null hypothesis in each case is that the true coefficient
is zero).
The \(p\)-value corresponding to the coefficient for sex is very small, meaning that there
is a highly significant difference in mortality between women and men. The difference in mortality
between farmers and non-farmers is also statistically significant, but with a larger \(p\)-value.
Next, we want to see the size of the difference in mortality between women and men, and between farmers and non-farmers, so look at the parameter estimates, especially in the column “Rel. Risk”. It tells us that female mortality is 79.3 per cent of the male mortality, and that the farmer mortality is 88.0 per cent of the non-farmer mortality. Furthermore, this model is multiplicative (additive on the log scale), so we can conclude that the mortality of a female farmer is \(0.793 \times 0.880 \times 100\) = 69.8 per cent of that of a male non-farmer.
We can also illustrate the sex difference graphically by a stratified analysis, see Figure 3.9.
fit2 <- coxreg(Surv(enter, exit, event) ~ strata(sex) + farmer,
data = om)
plot(fit2, fun = "cumhaz", xlim = c(60, 85),
lty = 1:2, xlab = "Age")
abline(h = 0)FIGURE 3.9: Cumulative hazards for men and women.
Obviously the proportionality assumption is well met.
An obvious question to ask is: Is the sex difference in mortality the same among farmers as among non-farmers? In other words: Is there an interaction between sex and farmer/non-farmer regarding the effect on mortality?
In order to test for an interaction in R, the plus (\(+\)) sign in the formula is changed to a multiplication sign (\(*\)):
Single term deletions
Model:
Surv(enter, exit, event) ~ sex * farmer
Df AIC LRT Pr(>Chi)
<none> 25912
sex:farmer 1 25916 6.4 0.011
Covariate Mean Coef Rel.Risk S.E. Wald p
sex
male 0.407 0 1 (reference)
female 0.593 -0.297 0.743 0.054 0.000
farmer
no 0.772 0 1 (reference)
yes 0.228 -0.252 0.777 0.076 0.001
sex:farmer
female:yes 0.291 1.338 0.114 0.011
Events 1823
Total time at risk 37244
Max. log. likelihood -12953
LR test statistic 30.37
Degrees of freedom 3
Overall p-value 1.15606e-06
The small \(p\)-value (1.1 per cent) indicates that there is a statistically significant (at the 5 per cent level) interaction effect. The size of the interaction effect is seen in the table of estimated coefficients (rounded to three decimal places):
coef exp(coef)
sexfemale -0.297 0.743
farmeryes -0.252 0.777
sexfemale:farmeryes 0.291 1.338
Interpretation: The mortality of female farmers is \(100 \times 0.743 = 74.3\) per cent of the mortality of male farmers. Hovever, moving to the non-farmer group, this relation has to be multiplied by \(1.338\), the interaction effect. The result is that among non-farmers, the female mortality is 99.4 per cent of the male mortality.
In cases like this, it is often recommended to perform separate analyses for women and men, or, alternatively, for farmers and non-farmers. Do that as an exercise and interpret your results.
3.6.2 One factor and one continuous covariate
Now the covariates sex (factor) and birthdate (continuous) are in focus. First the model without interaction is fitted.
Single term deletions
Model:
Surv(enter, exit, event) ~ sex + birthdate
Df AIC LRT Pr(>Chi)
<none> 25919
sex 1 25936 18.9812 1.32e-05
birthdate 1 25919 2.2364 0.1348
Obviously sex is statistically significant, while imr.birth is not. We go directly to the estimated coefficients after adding an interaction term to the model:
fit6 <- coxph(Surv(enter, exit, event) ~ sex * birthdate, data = om)
round(summary(fit6)$coefficients[, 1:2], 5) coef exp(coef)
sexfemale 9.12285 9162.24554
birthdate -0.00327 0.99673
sexfemale:birthdate -0.00519 0.99483
What is going on here? The main effect of sex is huge! The answer lies in the interpretation of the coefficients:
(i) To evaluate the effect on mortality for an
individual with given values of the covariates, you must involve the interaction coefficient. For instance, a female born on January 2, 1801 (birthdate = 1801.003)
will have the relative risk
\[\exp(9.12285 - 0.00327 \times 1801.003 - 0.00519 \times 1801.003) = 0.00185\]
compared to a man with birthdate= 0, the reference values! Obviously, it is ridiculous to extrapolate the model 1800 years back in time, but this is nevertheless the correct way to interpret the coefficients. So, at birthdate = 0, women’s mortality is 9162 times higher than that of men!
(ii) With interactions involved, it is strongly recommended to center continuous covariates. In our example, we could do that by subtracting 1810 (say) from birthdate. That makes sex = "male", birthdate = 1800 the new reference values. The result is
coef exp(coef)
sexfemale -0.26429 0.76775
birthdate -0.00327 0.99673
sexfemale:birthdate -0.00519 0.99483
This is more reasonable-looking: At the beginning of the year 1810, woman had a mortality of 76.8 per cent of that of men.
To summarize: The effects can be illustrated by two curves, one for men and one for women. The coefficients for sex, 0 and -0.26429, are the respective intercepts, and -0.00327 and
-0.00846 are the slopes. See Figure 3.10 for a graphical illustration.
FIGURE 3.10: The effect of birthdate on relative mortality for men and women. Reference: Men, birthdate = 1810.
3.6.3 Two continuous covariates
In our example data, the covariates birthdate and imr.birth are continuous, and we get
fit8 <- coxph(Surv(enter, exit, event) ~ birthdate * imr.birth, data = om)
(res <- round(summary(fit8)$coefficients[, 1:2], 5)) coef exp(coef)
birthdate -0.03024 0.97021
imr.birth 0.02004 1.02024
birthdate:imr.birth 0.00158 1.00158
The interpretation of the coefficients are: The reference point is birthdate = 0 (1810) and imr.birth = 0, and if birthdate = x and imr.birth = y, then the relative risk is
\[ \exp(-0.03024 x + 0.02004 y + 0.00158 x y),\]
analogous to the calculation in the case with one continuous and one factor covariate.
3.7 Interpretation of parameter estimates
In the proportional hazards model the parameter estimates are logarithms of risk ratios relative to the baseline hazard. The precise interpretations of the coefficients for the two types of covariates are discussed. The conclusion is that \(e^\beta\) has a more direct and intuitive interpretation than \(\beta\) itself.
3.7.1 Continuous covariate
If \(x\) is a continuous covariate, and \(h(t; x) = h_0(t)e^{\beta x}\), then
\[\begin{equation*} \frac{h(t; x+1)}{h(t; x)} = \frac{h_0(t)e^{\beta (x+1)}} {h_0(t)e^{\beta x}} = e^\beta. \end{equation*}\] so the risk increases with a factor \(e^\beta\), when \(x\) is increased by one unit. In other words, \(e^\beta\) is a relative risk (or a hazard ratio, which often is a preferred term in certain professions).
3.7.2 Factor
For a factor covariate, in the usual coding with a reference category, \(e^\beta\) is the relative risk compared to that reference category.
3.8 Proportional hazards in discrete time
In discrete time, the hazard function is, as we saw earlier, a set of conditional probabilities, and so its range is restricted to the interval \((0, 1)\). Therefore, the definition of proportional hazards used for continuous time is unpractical; the multiplication of a probability by a constant may result in a quantity larger than one.
One way of introducing proportional hazards in discrete time is to regard the discreteness as a result of grouping true continuous time data, for which the proportional hazards assumption hold. For instance, in a follow-up study of human mortality, we may only have data recorded once a year, and so life length can only be measured in years. Thus, we assume that there is a true exact life length \(T\), but we can only observe that it falls in an interval \((t_i, t_{i+1})\).
Assume continuous proportional hazards, and a partition of time:
\[ 0 = t_0 < t_1 < t_2 < \cdots < t_k = \infty. \] Then \[\begin{multline}\label{eq:dischaz} P(t_{j-1} \le T < t_j \mid T \ge t_{j-1}; \; \mathbf{x}) = \frac{S(t_{j-1}\mid \mathbf{x}) - S(t_j\mid \mathbf{x})}{S(t_{j-1} \mid \mathbf{x})} \\ = 1 - \frac{S(t_j \mid \mathbf{x})}{S(t_{j-1} \mid \mathbf{x})} = 1 - \left(\frac{S_0(t_j)}{S_0(t_{j-1})}\right)^{\exp(\boldsymbol{\beta}\mathbf{x})} \\ = 1 - (1 - h_i)^{\exp(\boldsymbol{\beta}\mathbf{x})} \end{multline}\] with \(h_i = P(t_{j-1} \le T < t_j \mid T \ge t_{j-1}; \; \mathbf{x} = \mathbf{0})\), \(j = 1, \ldots, k\). We take as the of proportional hazards in discrete time.
3.8.1 Logistic regression
It turns out that a proportional hazards model in discrete time, according to definition , is nothing else than a logistic regression model with the cloglog link (cloglog is short for “complementary log-log” or \(\boldsymbol{\beta}\mathbf{x} =\log(-\log(p))\)). In order to see that, let
\[\begin{equation} (1 - h_j) = \exp(-\exp(\alpha_j)), \; j = 1, \ldots, k \end{equation}\] and
\[\begin{equation*} X_j = \left\{ \begin{array}{ll} 1, & t_{j-1} \le T < t_j \\ 0, & \text{otherwise}%t_1 \le T < t_2 \\ \end{array} \right., \quad j = 1, \ldots, k \end{equation*}\] Then
\[\begin{equation*} \begin{split} P(X_1 = 1; \; \mathbf{x}) &= 1 - \exp(-\exp\big(\alpha_1 + \boldsymbol{\beta}\mathbf{x})\big) \\ P(X_j = 1 \mid X_1 = \cdots = X_{j-1} = 0; \; \mathbf{x}) &= 1 - \exp(-\exp\big(\alpha_j + \boldsymbol{\beta}\mathbf{x})\big), \\ \quad j = 2, \ldots, k. \end{split} \end{equation*}\] This is logistic regression with a cloglog link. Note that extra parameters \(\alpha_1, \ldots, \alpha_k\) are introduced, one for each potential event age. They correspond to the baseline hazard function in continuous time, but are be estimated simultaneously with the regression parameters.
3.9 Model selection
In regression modeling, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines. For a precise definition of this concept, see Appendix A.
3.9.1 Model selection in general
Some general advise regarding model selection is given here.
- Remember, there are no models, only some useful ones. This statement is attributed to G.E.P. Box.
- More than one model may be useful.
- Keep covariates in the model.
- Avoid automatic stepwise procedures!
- If interaction effects are present, the corresponding main effects must be there.
3.10 Doing it in R
We utilize the male mortality data, Skellefteå 1800–1820, to illustrate the aspects of an ordinary Cox regression. Males aged 20 (exact) are sampled between 1820 and 1840 and followed to death or reaching the age of 40, when they are right censored. So in effect male mortality in the ages 20–40 are studied. The survival time starts counting at zero for each individual at his 21st birthdate (that is, at exact age 20 years).
There are two R functions that are handy for a quick look at a data
frame, str and head. The first give a
summary description of the structure of an R object, often a
data frame
'data.frame': 1208 obs. of 6 variables:
$ id : int 1 2 3 3 4 5 5 6 7 7 ...
$ enter : num 0 3.48 0 13.46 0 ...
$ exit : num 20 17.6 13.5 20 20 ...
$ event : int 0 1 0 0 0 0 0 0 0 1 ...
$ birthdate: num 1800 1800 1800 1800 1800 ...
$ ses : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...
First, there is information about the data frame: It is a
data.frame, with six variables measured on 1208 objects. Then
each
variable is individually described: name, type, and a few of the first values.
The values are usually rounded to a few digits. The Factor line is
worth noticing: It is of course a factor covariate, it takes two
levels, lower and upper. Internally, the levels are coded 1, 2,
respectively. Then (by default), lower (with internal code equal to
1) is the reference category. If desired, this can be changed with the
relevel function.
'data.frame': 1208 obs. of 7 variables:
$ id : int 1 2 3 3 4 5 5 6 7 7 ...
$ enter : num 0 3.48 0 13.46 0 ...
$ exit : num 20 17.6 13.5 20 20 ...
$ event : int 0 1 0 0 0 0 0 0 0 1 ...
$ birthdate: num 1800 1800 1800 1800 1800 ...
$ ses : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...
$ ses2 : Factor w/ 2 levels "upper","lower": 1 2 1 2 2 2 1 1 1 2 ...
Comparing the information about the variables ses and ses2, you
find that the codings, and also the reference categories, are
reversed. Otherwise the variables are identical, and any analysis involving
any of
them as an explanatory variable would lead to identical conclusions,
but not identical parameter estimates, see below.
The function head prints the first few lines (observations) of a data
frame (there is also a corresponding tail function that prints a few
of the last rows).
id enter exit event birthdate ses ses2
1 1 0.000 20.000 0 1800.010 upper upper
2 2 3.478 17.562 1 1800.015 lower lower
3 3 0.000 13.463 0 1800.031 upper upper
4 3 13.463 20.000 0 1800.031 lower lower
5 4 0.000 20.000 0 1800.064 lower lower
6 5 0.000 0.089 0 1800.084 lower lower
Apparently, the variables ses and ses2 are identical, but we know
that behind the scenes they are formally different; different coding and
(therefore) different reference category. This shows up in analyzes
involving the two variables, one at a time.
First the analysis is run with ses.
Call:
coxreg(formula = Surv(enter, exit, event) ~ ses + birthdate,
data = mort)
Covariate Mean Coef Rel.Risk S.E. Wald p
ses
lower 0.416 0 1 (reference)
upper 0.584 -0.484 0.616 0.121 0.000
birthdate 1811.640 0.029 1.029 0.011 0.008
Events 276
Total time at risk 17038
Max. log. likelihood -1841.7
LR test statistic 23.08
Degrees of freedom 2
Overall p-value 9.72167e-06
Then the variable ses2 is inserted instead.
Call:
coxreg(formula = Surv(enter, exit, event) ~ ses2 + birthdate,
data = mort)
Covariate Mean Coef Rel.Risk S.E. Wald p
ses2
upper 0.584 0 1 (reference)
lower 0.416 0.484 1.623 0.121 0.000
birthdate 1811.640 0.029 1.029 0.011 0.008
Events 276
Total time at risk 17038
Max. log. likelihood -1841.7
LR test statistic 23.08
Degrees of freedom 2
Overall p-value 9.72167e-06
Notice the difference; it is only formal. The substantial results are completely equivalent.
3.10.1 Likelihood Ratio Test
If we judge the result by the Wald \(p\)-values, it seems as if both
variables are highly significant. To be sure it is advisable to apply the
likelihood ratio test. In R, this is easily achieved by applying the
drop1 function on the result, res, of the Cox regression.
Single term deletions
Model:
Surv(enter, exit, event) ~ ses + birthdate
Df AIC LRT Pr(>Chi)
<none> 3687.3
ses 1 3701.4 16.1095 5.978e-05
birthdate 1 3692.6 7.2752 0.006991
In fact, we have performed two likelihood ratio tests here. First,
the effect of removing ses from the full model. The \(p\)-value for
this test is very small, \(p = 0.00005998\). The scientific notation,
5.998e-05 means that the decimal point should be moved five steps to the
left (that’s the minus sign). The reason for this notation is essentially
space-saving. Then, the second LR test concerns the absence or presence of
birthdate in the full model. This also gives a small \(p\)-value, \(p = 0.006972\).
The earlier conclusion is not altered; both variables are highly significant.
3.10.2 The estimated baseline cumulative hazard function
The estimated baseline cumulative function (see Equation (3.7)) is best
plotted by the plot command, see Figure 3.11.
FIGURE 3.11: Estimated cumulative baseline hazard function.
The figure shows the baseline cumulative hazard function for an average individual (i.e., with the value zero on all covariates when all covariates are centered around their weighted means.
References
Cox, D. R. 1972. “Regression Models and Life Tables.” Journal of the Royal Statistical Society Series B (with Discussion) 34: 187–220.
Cox, D. 1975. “Partial Likelihood.” Biometrika 62: 269–76.