Chapter 3 Cox Regression

We move slowly from the two-sample case, via the \(k\)-sample case, to the general regression model. On this route, we start with the log-rank test and end up with Cox regression and the proportional hazards model (Cox 1972).

3.1 Proportional Hazards

The property of proportional hazards is fundamental in Cox regression. It is in fact the essence of Cox’s simple yet ingenious idea. The definition is as follows:

Definition 3.1 (Proportional hazards) If \(h_1(t)\) and \(h_0(t)\) are hazard functions from two separate distributions, we say that they are proportional if

\[\begin{equation} h_1(t) = \psi h_0(t), \quad \text{for all } t \ge 0, \tag{3.1} \end{equation}\] for some positive constant \(\psi\) and all \(t \ge 0\). Further, if holds, then the same property holds for the corresponding cumulative hazard functions \(H_1(t)\) and \(H_0(t)\).

\[\begin{equation} H_1(t) = \psi H_0(t), \quad \text{for all } t \ge 0, \tag{3.2} \end{equation}\] with the same proportionality constant \(\psi\) as in (3.1).\(\Box\)

Strictly speaking, the second part of this definition follows easily from the first (and vice versa), so more correct would be to state one part as a definition and the other as a corollary. The important part of this definition is “\(\textit{for all }t \ge 0\)”, and that the constant \(\psi\) does not depend on \(t\). Think of the hazard functions as age-specific mortality for two groups, e.g., women and men. It is “well known” that women have lower mortality than men in all ages. It would therefore be reasonable to assume proportional hazards in that case. It would mean that the female relative advantage is equally large in all ages. See Example 3.1 for a demonstration of this example of proportional hazards.

Example 3.1 (Male and female mortality, Sweden 2001-2019)

This is a continuation of Example ??, and here we involve the male mortality alongside with the female, see Figure 3.1. We also expand the calendar time period to the 21st century, as far as we have data now. The reason for this is the random fluctuations in low ages, caused by the extremely low mortality rates.

FIGURE 3.1: Female and male mortality by age, Sweden 2001-2019.

As you can see, the “common knowledge” of proportional hazards is disproved by the right hand panel in Figure 3.1. On the other hand, the left hand panel gives a rather different impression, and this illustrates the need to choose good approaches to graphical presentation. \(\Box\)

It must be emphasized that the proportional hazards assumption is an assumption that always must be carefully checked. In many situations, it would not be reasonable to assume proportional hazards. If in doubt, check data by plotting the Nelson-Aalen estimates for each group in the same plot. The left hand panel of Figure 3.1 would suit this purpose better if drawn on a log scale, see Figure 3.2.

FIGURE 3.2: Female and male mortality by age, Sweden 2001-2019. Log scale.

The advantage of the log scale is twofold: (i) Small numbers are magnified so you can see them, and (ii) on the log scale, proportional hazards implies a constant vertical distance between the curves, which is easier to see for the human eye.

For an example of a perfect fit to the proportional hazards model, see Figure 3.3 (two Weibull hazard functions with the proportionality constant \(\psi = 2\)).

FIGURE 3.3: Two hazard functions that are proportional. The proportionality constant is 2.

The proportionality is often difficult to judge by eye, so in order make it easier to see, the plot can be made on a log scale as in the right hand panel of Figure 3.3. Note that both dimensions are on a log scale. This type of plot, constructed from empirical data, is called a Weibull plot in reliability applications: If the lines are straight lines, then data are well fitted by a Weibull distribution. Additionally, if the the slope of the line is 1 (45 degrees), then an exponential model fits well.

To summarize Figure 3.3: (i) The hazard functions are proportional because on the log-log scale, the vertical distance is constant, (ii) Both hazard functions represent a Weibull distribution, because both lines are straight lines, and (iii) neither represents an exponential distribution, because the slopes are not one. This latter fact may be difficult to see because of the different scales on the axes (the aspect ratio is not one).

Figure 3.4 shows the relationships between the cumulative hazards functions, the density functions, and the survival functions when the hazard functions are proportional. Note that the cumulative hazards functions are proportional by implication, with the same proportionality constant (\(\psi = 2\) in this case). On the other hand, for the density and survival functions, proportionality does not hold; it is in fact theoretically impossible except in the trivial case that the proportionality constant is unity.

FIGURE 3.4: The effect of proportional hazards on the density and survival functions.

3.2 The Log-Rank Test

The log-rank test is a \(k\)-sample test of equality of survival functions. We first look at the two-sample case, that is, \(k = 2\).

Suppose that we have the small data set illustrated in Figure 3.5. There are two samples, the letters (A, B, C, D, E) and the numbers (1, 2, 3, 4, 5).

FIGURE 3.5: Two-sample data, the letters (dashed) and the numbers (solid). Circles denote censored observations, plusses events.

The data in Figure 3.5 can be presented i tabular form, see Table 3.1.

TABLE 3.1: Example data for the log-rank test..
group	time	event
numbers	4.0	TRUE
numbers	2.0	FALSE
numbers	6.0	TRUE
numbers	1.0	TRUE
numbers	3.5	FALSE
letters	5.0	TRUE
letters	3.0	TRUE
letters	6.0	FALSE
letters	1.0	TRUE
letters	2.5	FALSE

We are interested in investigating whether letters and numbers have the same survival chances or not. Therefore, the hypothesis

\[\begin{equation*} H_0: \text{No difference in survival between numbers and letters} \end{equation*}\] is formulated. In order to test \(H_0\), we make five tables, one for each observed event time, see Table 3.2, where the the first table, relating to failure time \(t_{(1)} = 1\), is shown.

TABLE 3.2: The 2x2 table for the first event time.
	Deaths	Survivals	Total
numbers	1	4	5
letters	1	4	5
Total	2	8	10

Let us look at the table at failure time \(t_{(1)} = 1\), Table 3.2, from the viewpoint of the numbers.

The observed number of deaths among numbers: \(1\).
The expected number of deaths among numbers: \(2 \times 5 / 10 = 1\).

The expected number is calculated under \(H_0\), i.e., as if there is no difference between letters and numbers regarding mortality. It is further assumed that the two margins (Total) are given (fixed).

Then, given two deaths in total and five out of ten observations are from the group numbers, the expected number of deaths is calculated as above.

This procedure is repeated for each of the five tables, and the results are summarized in Table 3.3.

TABLE 3.3: Observed and expected number of deaths for numbers at event times.
	Observed	Expected	Difference	Variance
t(1)	1	1.0	0.0	0.44
t(2)	0	0.5	-0.5	0.25
t(3)	1	0.5	0.5	0.25
t(4)	0	0.3	-0.3	0.22
t(5)	1	0.5	0.5	0.25
Sum	3	2.8	0.2	1.41

Finally, the observed test statistic \(T\) is calculated as

\[\begin{equation*} T = \frac{0.2^2}{1.41} \approx 0.028 \end{equation*}\] Under the null hypothesis, this is an observed value from a \(\chi^2(1)\) distribution, and \(H_0\) should be rejected for large values of \(T\). Using a level of significance of 5%, the cutting point for the value of \(T\) is 3.84, far from our observed value of 1.41. The conclusion is therefore that there is no (statistically significant) difference in survival chances between letters and numbers. Note, however, that this result depends on asymptotic (large sample) properties, and in this toy example, these properties are not valid.

For more detail about the underlying theory, see Appendix A.

Let us now look at a real data example, the old age mortality data set oldmort in eha. See Table 3.4 for a sample of five records with selected columns.

TABLE 3.4: Old age mortality.
id	enter	exit	event	sex	civ
793001208	66.498	67.988	0	male	married
793001208	67.988	72.820	0	male	married
793001208	72.820	75.542	1	male	widow
793001209	66.446	76.568	1	female	married
793001210	66.446	67.936	0	female	married

We are interested in comparing male and female mortality in the ages 60–85 with a logrank test, and for that purpose we use the logrank function:

library(eha)
fit <- logrank(Surv(enter, exit, event), group = sex, data = om)
fit


     The log-rank test

Call:
logrank(Y = Surv(enter, exit, event), group = sex, data = om)

 X-squared =  19.1971 , df =  1 , p-value =  1.1789e-05

The result of the logrank test is given by the \(p\)-value 1.178899410^{-5} , a very small number. Thus, there is a very significant difference in mortality between men and women. But how large is the difference? The log-rank test has no answer to this question.

Remember that this result depends on the proportional hazards assumption. We can graphically check it, see Figure 3.6

FIGURE 3.6: Cumulative hazards for women and men in ages 60-85, 19th century Sweden.

The proportionality assumption seems to be a good description from 60 to 85 years of age, but it is more doubtful in the very high ages (not shown). One reason for this may be that the high-age estimates are based on few observations (most of the individuals in the sample died earlier), so random fluctuations have a large impact in the high ages, but it is also an often observed phenomenon that differences in mortality between groups tend to vanish in the extremely high ages.

3.2.1 Several samples

The result for the two-sample case is easily extended to the \(k\)-sample case. Instead of one \(2 \times 2\) table per observed event time we get one \(k\times 2\) table per observed event time and we have to calculate expected and observed numbers of events for \((k-1)\) groups at each failure time. The resulting test statistic will have \((k-1)\) degrees of freedom and still be approximately \(\chi^2\) distributed. This is illustrated with the same data set, oldmort, as above, but with the covariate civ, which is a factor with three levels (unmarried, married, widow), instead of sex. Furthermore, the investigation is limited to male mortality.

fit <- logrank(Surv(enter, exit, event), group = civ,  
                     data = om[om$sex == "male", ])
fit


     The log-rank test

Call:
logrank(Y = Surv(enter, exit, event), group = civ, data = om[om$sex == 
    "male", ])

 X-squared =  19.8622 , df =  2 , p-value =  4.86372e-05

The degrees of freedom for the score test is now 2, equal to the number of levels in civ minus one. Being unmarried seem to have great impact on old age mortality. It is however recommended to check the proportionality assumption graphically, see Figure 3.7.

FIGURE 3.7: Old age male mortality by civil status, cumulative hazards.

There is obviously nothing much that indicates non-proportionality in this case either. Furthermore, the ordering is what one would expect.

We do not go deeper into this matter here, mainly because the logrank test is a special case of Cox regression, which will be described in detail later in this chapter.

3.3 Cox Regression Models

Starting with the definition of proportional hazards in Section 3.1, the concept of Cox regression is introduced in steps.

3.3.1 Two Groups

The definition of proportionality, given in equation (3.1), can equivalently be written as

\[\begin{equation} h_x(t) = \psi^x h_0(t), \quad t > 0, \; x = 0, 1, \; \psi > 0. \tag{3.3} \end{equation}\]

It is easy to see that this “trick” is equivalent to (3.1): When \(x = 0\) it simply says that \(h_0(t) = h_0(t)\), and when \(x = 1\) it says that \(h_1(t) = \psi h_0(t)\). Since \(\psi > 0\), we can calculate \(\beta = \log(\psi)\), and rewrite (3.3) as

\[\begin{equation*} h_x(t) = e^{\beta x} h_0(t), \quad t > 0; \; x = 0, 1; \; -\infty < \beta < \infty, \end{equation*}\] or with a slight change in notation,

\[\begin{equation} h(t; x) = e^{\beta x} h_0(t), \quad t > 0; \; x = 0, 1; \; -\infty < \beta < \infty. \tag{3.4} \end{equation}\]

The sole idea by this rewriting is to pave the way for the introduction of Cox’s regression model (Cox 1972), which in its elementary form is a proportional hazards model. In fact, we can already interpret equation (3.4) as a Cox regression model with an explanatory variable \(x\) with corresponding regression coefficient \(\beta\) (to be estimated from data). The covariate \(x\) is still only a dichotomous variate, but we will now show how it is possible to generalize this to a situation with explanatory variables of any form. The first step is to go from the two-sample situation to a \(k\)-sample one.

3.3.2 Many Groups

Can we generalize to \((k + 1)\) groups, \(k \ge 2\)? Yes, by expanding the procedure in the previous subsection:

\[\begin{eqnarray*} h_0(t) &\sim& \mbox{group 0} \\ h_1(t) &\sim& \mbox{group 1} \\ \cdots & & \cdots \\ h_k(t) &\sim& \mbox{group k} \end{eqnarray*}\]

The underlying model is: \(h_j(t) = \psi_j h_0(t), \quad t \ge 0\), \(j = 1, 2, \ldots, k\). That is, with \((k+1)\) groups, we need \(k\) proportionality constants \(\psi_1, \ldots, \psi_k\) in order to define proportional hazards. Note also that in this formulation (there are others), one group is “marked” as a reference group, that is, to this group there is no proportionality constant attached. All relations are relative to the reference group. Note also that it doesn’t matter which group is chosen as the reference. This choice does not change the model itself, only its representation.

With \((k + 1)\) groups, we need \(k\) indicators. Let

\[\begin{equation*} \mathbf{x} = (x_{1}, x_{2}, \ldots, x_{k}). \end{equation*}\] Then

\[\begin{eqnarray*} \mathbf{x} = (0, 0, \ldots, 0) & \Rightarrow & \mbox{group 0} \\ \mathbf{x} = (1, 0, \ldots, 0) & \Rightarrow & \mbox{group 1} \\ \mathbf{x} = (0, 1, \ldots, 0) & \Rightarrow & \mbox{group 2} \\ \cdots & & \cdots \\ \mathbf{x} = (0, 0, \ldots, 1) & \Rightarrow & \mbox{group k} \end{eqnarray*}\]

and

\[\begin{equation*} h(t; \mathbf{x}) = h_0(t) \prod_{\ell = 1}^k \psi_{\ell}^{x_{\ell}} = \left\{ \begin{array}{ll} h_0(t), & \mathbf{x} = (0, 0, \ldots, 0) \\ h_0(t) \psi_j, & x_{j} = 1, \quad j = 1, \ldots k \end{array}\right. \end{equation*}\] With \(\psi_j = e^{\beta_j}, \; j = 1, \ldots, k\), we get

\[\begin{equation} h(t; \mathbf{x}) = h_0(t) e^{x_{1}\beta_1 + x_{2} \beta_2 + \cdots + x_{k} \beta_k} = h_0(t) e^{\mathbf{x} \boldsymbol{\beta}}, \tag{3.5} \end{equation}\]

where \(\boldsymbol{\beta} = (\beta_1, \beta_2, \ldots, \beta_k)\).

3.3.3 The General Cox Regression Model

We may now generalize (3.5) by letting the components of \(\mathbf{x}_i\) take any value. Let data and model take the following form:

Data:

\[\begin{equation} (t_{i0}, t_i, d_i, \mathbf{x}_i), \; i = 1, \ldots, n, \end{equation}\]

where \(t_{i0}\) is the left truncation time point (if \(y_{i0} = 0\) for all \(i\), then this variable may be omitted), \(t_i\) is the end time point, \(d_i\) is the “event indicator” (\(1\) or TRUE if event, else \(0\) or FALSE), and \(\mathbf{x}_i\) is a vector of explanatory variables.

Model:

\[\begin{equation} h(t; \mathbf{x}_i) = h_0(t) e^{\mathbf{x}_i \boldsymbol{\beta}}, \quad t > 0. \tag{3.6} \end{equation}\]

This is a regression model where the response variable is \((t_0, t, d)\) (we will call it a survival object) and the explanatory variable is \(\mathbf{x}\), possibly (often) vector valued.

In (3.6) there are two components to estimate, the regression coefficients \(\boldsymbol{\beta}\), and the baseline hazard function \(h_0(t), \; t > 0\). For the former task, the partial likelihood (Cox 1975) is used. See Appendix A for a brief summary.

3.4 Estimation of the Baseline Cumulative Hazard Function

The usual estimator (continuous time) of the baseline cumulative hazard function is

\[\begin{equation} \hat{H}_0(t) = \sum_{j:t_j \le t} \frac{d_j}{\sum_{m \in R_j} e^{\mathbf{x}_m \hat{\boldsymbol{\beta}}}}, \tag{3.7} \end{equation}\]

where \(d_j\) is the number of events at \(t_j\) and \(\hat{\boldsymbol{\beta}}\) is the result of maximizing the partial likelihood. Note that if \(\hat{\boldsymbol{\beta}} = 0\), this reduces to

\[\begin{equation} \hat{H}_0(t) = \sum_{j:t_j \le t} \frac{d_j}{n_j}, \tag{3.8} \end{equation}\]

the Nelson-Aalen estimator. In (3.8), \(n_j\) is the size of \(R_j\).

In the R package eha, the baseline hazard is estimated at the value zero of the covariates (but at the reference category for a factor covariate). This is different from practice in other packages, where some average value of covariate values are chosen as reference. However, this is generally a bad habit, because it will frequently lead to a variation in the reference value as subsets of the original data set is taken. Instead, the practitioner must exercise judgment in choosing relevant reference values for (continuous) covariates. There are two instances when these choices make a difference: (i) When interactions are present, main effect estimates will vary with choice (more about this later), and (ii) estimation of baseline hazards and survival functions.

In order to calculate the cumulative hazards function for an individual with a specific covariate vector \(\mathbf{x}\), use the formula

\[\begin{equation*} \hat{H}(t; \mathbf{x}) = \hat{H}_0(t) e^{(\mathbf{x})\hat{\boldsymbol{\beta}}}. \end{equation*}\]

The corresponding survival functions may be estimated by the relation

\[\begin{equation*} \hat{S}(t; \mathbf{x}) = \exp\bigl(-\hat{H}(t; \mathbf{x})\bigr) \end{equation*}\]

It is also possible to use the terms in the sum (3.7) to build an estimator analogous to the Kaplan-Meier estimator (??). In practice, there is no big difference between the two methods.

3.5 Explanatory Variables

Explanatory variables, or covariates, may be of essentially two different types continuous and discrete. The discrete type usually takes only a finite number of distinct values and is called a factor in R. A special case of a factor is one that takes only two distinct values, say 0 and 1. Such a factor is called an indicator, because we can let the value 1 indicate the presence of a certain property and 0 denote its absence. To summarize, there is

Covariate: taking values in an interval (e.g., age, blood pressure).
Factor: taking a finite number of values (e.g., civil status, occupation).
Indicator: a factor taking two values (e.g., gender).

3.5.1 Continuous Covariates

We use the qualifier continuous to stress that factors are excluded, because often the term covariate is used as a synonym for explanatory variable.

Values taken by a continuous covariate are ordered. The effect on the response is by model definition ordered in the same or reverse order. On the other hand, values taken by a factor are unordered (but may be defined as ordered in R).

3.5.2 Factor Covariates

An explanatory variable that can take only a finite (usually small) number of distinct values is called a categorical variable. In R language, it is called a factor. Examples of such variables are gender, socio-economic status, birth place. Students of statistics have long been taught to create dummy variables in such situations, in the following way:

Given a categorical variable \(F\) with \((k+1)\) levels \((f_0, f_1, f_2, \ldots f_k)\) (\(k+1\) levels),
Create \(k\) indicator (``dummy’’) variables \((I_1, I_2, \ldots I_k)\).

The level \(f_0\) is the reference category, characterized by that all indicator variables are zero for an individual with this value. Generally, for the level, \(f_i, \; i = 1,\ldots, k\), the indicator variable \(I_i\) is one, the rest are zero. In other words, for a single individual, at most one indicator is one, and the rest are zero.

In R, there is no need to explicitly create dummy variables, it is done behind the scenes by the functions factor and as.factor.

Note that a factor with two levels, i.e., an indicator variable, can always be treated as a continuous covariate, if coded numerically (e.g., 0 and 1).

Example 3.2 (Infant mortality and age of mother)

Consider a demographic example, the influence of mother’s age (a continuous covariate) on infant mortality. It is considered well-known that a young mother means high risk for the infant, and also that old mother means high risk, compared to “in-between-aged” mothers. So the risk order is not the same (or reverse) as the age order.

One solution (not necessarily the best) to this problem is to factorize: Let, for instance,

\[\begin{equation*} \mbox{mother's age} = \left\{\begin{array}{ll} \mbox{low}, & 15 < \mbox{age} \le 25 \\ \mbox{middle}, & 25 < \mbox{age} \le 35 \\ \mbox{high}, & 35 < \mbox{age} \end{array} \right. \end{equation*}\]

In this layout, there will be two parameters measuring the deviation from the reference category, which will be the first category by default.

In R, this is easily achieved with the aid of the cut function. It works like this:

age <- rnorm(100, 30, 6)
age.group <- cut(age, c(15, 25, 35, 48))
summary(age.group)

(15,25] (25,35] (35,48] 
     16      64      20

Note that the created intervals by default are closed to the right and open to the left. This has consequences for how observations exactly on a boundary are treated; they belong to the lower-valued interval. The argument right in the call to cut can be used switch this behaviour the other way around.

Note further that values falling below the smallest value (15 in our example) or above the largest value (48) are reported as missing values (NA in R terminology, Not Available).

For further information about the use of the cut function, see the help page. \(\Box\)

3.6 Interactions

The meaning of interaction between two explanatory variables is described by an example, walking through the three possible combinations of covariate types.

Example 3.3 (Old age mortality)

We return to the oldmort data set, slightly modified, see Table 3.5, where the first five of a total to 6495 rows are shown.

TABLE 3.5: Old age mortality, Sundsvall 1860-1880
birthdate	enter	exit	sex	farmer	imr.birth
1775.286	84.714	85	female	no	14.20722
1775.092	84.908	85	female	yes	11.97183
1775.374	84.626	85	female	no	16.93548
1775.859	84.141	85	male	yes	16.93548
1775.111	84.889	85	female	no	12.70903

The sampling frame is a rectangle in the Lexis diagram, see Figure 3.8. It can be described as all persons born between January 1, 1775 and January 1, 1881, and present in the solid rectangle at some moment.

FIGURE 3.8: Old age mortality sampling frame, Lexis diagram.

There are four possible explanatory variables behind survival after age 60 in the example. \[\begin{eqnarray*} \mbox{farmer} & = & \left\{ \begin{array}{ll} \mbox{no} & \\ \mbox{yes} & \left(e^{\alpha}\right) \end{array} \right. \\ %\end{equation*} %\begin{equation*} \mbox{sex} & = & \left\{ \begin{array}{ll} \mbox{male} & \\ \mbox{female} & \left(e^{\beta}\right) \end{array} \right. \\ \mbox{birthdate } (x_1) & & \mspace{90mu} \left(e^{\gamma_1 x_1}\right) \\ \mbox{IMR at birth } (x_2) & & \mspace{90mu} \left(e^{\gamma_2 x_2}\right) \\ \end{eqnarray*}\] The two first are factors, and the last two are continuous covariates. We go through the three distinct combinations and illustrate the meaning of interaction in each. \(\Box\)

3.6.1 Two factors

Assume that we, in the oldmort example, only have the factors sex and farmer at our disposal as explanatory variables. We may fit a Cox regression model like this:

fit <- coxreg(Surv(enter, exit, event) ~ sex + farmer, data = om)
summary(fit)

Covariate           Mean       Coef     Rel.Risk   S.E.    LR p
sex                                                        0.000 
            male    0.407     0         1 (reference)
          female    0.593    -0.232     0.793     0.048
farmer                                                     0.025 
              no    0.772     0         1 (reference)
             yes    0.228    -0.127     0.880     0.058

Events                    1823 
Total time at risk         37244 
Max. log. likelihood      -12956 
LR test statistic         23.97 
Degrees of freedom        2 
Overall p-value           6.25076e-06

The output from coxreg (the model fit) is saved in the object fit, which is investigated, first by the the function summary, which here performs successive \(\chi^2\) tests of the significance of the estimated regression coefficients (the null hypothesis in each case is that the true coefficient is zero).

The \(p\)-value corresponding to the coefficient for sex is very small, meaning that there is a highly significant difference in mortality between women and men. The difference in mortality between farmers and non-farmers is also statistically significant, but with a larger \(p\)-value.

Next, we want to see the size of the difference in mortality between women and men, and between farmers and non-farmers, so look at the parameter estimates, especially in the column “Rel. Risk”. It tells us that female mortality is 79.3 per cent of the male mortality, and that the farmer mortality is 88.0 per cent of the non-farmer mortality. Furthermore, this model is multiplicative (additive on the log scale), so we can conclude that the mortality of a female farmer is \(0.793 \times 0.880 \times 100\) = 69.8 per cent of that of a male non-farmer.

We can also illustrate the sex difference graphically by a stratified analysis, see Figure 3.9.

fit2 <- coxreg(Surv(enter, exit, event) ~ strata(sex) + farmer, 
              data = om)
plot(fit2, fun = "cumhaz", xlim = c(60, 85), 
     lty = 1:2, xlab = "Age")
abline(h = 0)

FIGURE 3.9: Cumulative hazards for men and women.

Obviously the proportionality assumption is well met.

An obvious question to ask is: Is the sex difference in mortality the same among farmers as among non-farmers? In other words: Is there an interaction between sex and farmer/non-farmer regarding the effect on mortality?

In order to test for an interaction in R, the plus (\(+\)) sign in the formula is changed to a multiplication sign (\(*\)):

fit4 <- coxreg(Surv(enter, exit, event) ~ sex * farmer, 
              data = om)
(x <- summary(fit4, digits = 3))

Single term deletions

Model:
Surv(enter, exit, event) ~ sex * farmer
           Df   AIC LRT Pr(>Chi)
<none>        25912             
sex:farmer  1 25916 6.4    0.011

Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
sex 
            male    0.407     0         1 (reference)
          female    0.593    -0.297     0.743     0.054    0.000 
farmer 
              no    0.772     0         1 (reference)
             yes    0.228    -0.252     0.777     0.076    0.001 
sex:farmer       
   female:yes                 0.291     1.338     0.114     0.011 

Events                    1823 
Total time at risk         37244 
Max. log. likelihood      -12953 
LR test statistic         30.37 
Degrees of freedom        3 
Overall p-value           1.15606e-06

The small \(p\)-value (1.1 per cent) indicates that there is a statistically significant (at the 5 per cent level) interaction effect. The size of the interaction effect is seen in the table of estimated coefficients (rounded to three decimal places):

round(summary(fit4)$coefficients[, 1:2], 3)

                      coef exp(coef)
sexfemale           -0.297     0.743
farmeryes           -0.252     0.777
sexfemale:farmeryes  0.291     1.338

Interpretation: The mortality of female farmers is \(100 \times 0.743 = 74.3\) per cent of the mortality of male farmers. Hovever, moving to the non-farmer group, this relation has to be multiplied by \(1.338\), the interaction effect. The result is that among non-farmers, the female mortality is 99.4 per cent of the male mortality.

In cases like this, it is often recommended to perform separate analyses for women and men, or, alternatively, for farmers and non-farmers. Do that as an exercise and interpret your results.

3.6.2 One factor and one continuous covariate

Now the covariates sex (factor) and birthdate (continuous) are in focus. First the model without interaction is fitted.

fit5 <- coxph(Surv(enter, exit, event) ~ sex + birthdate, data = om)
drop1(fit5, test = "Chisq")

Single term deletions

Model:
Surv(enter, exit, event) ~ sex + birthdate
          Df   AIC     LRT Pr(>Chi)
<none>       25919                 
sex        1 25936 18.9812 1.32e-05
birthdate  1 25919  2.2364   0.1348

Obviously sex is statistically significant, while imr.birth is not. We go directly to the estimated coefficients after adding an interaction term to the model:

fit6 <- coxph(Surv(enter, exit, event) ~ sex * birthdate, data = om)
round(summary(fit6)$coefficients[, 1:2], 5)

                        coef  exp(coef)
sexfemale            9.12285 9162.24554
birthdate           -0.00327    0.99673
sexfemale:birthdate -0.00519    0.99483

What is going on here? The main effect of sex is huge! The answer lies in the interpretation of the coefficients: (i) To evaluate the effect on mortality for an individual with given values of the covariates, you must involve the interaction coefficient. For instance, a female born on January 2, 1801 (birthdate = 1801.003) will have the relative risk

\[\exp(9.12285 - 0.00327 \times 1801.003 - 0.00519 \times 1801.003) = 0.00185\]

compared to a man with birthdate= 0, the reference values! Obviously, it is ridiculous to extrapolate the model 1800 years back in time, but this is nevertheless the correct way to interpret the coefficients. So, at birthdate = 0, women’s mortality is 9162 times higher than that of men! (ii) With interactions involved, it is strongly recommended to center continuous covariates. In our example, we could do that by subtracting 1810 (say) from birthdate. That makes sex = "male", birthdate = 1800 the new reference values. The result is

                        coef exp(coef)
sexfemale           -0.26429   0.76775
birthdate           -0.00327   0.99673
sexfemale:birthdate -0.00519   0.99483

This is more reasonable-looking: At the beginning of the year 1810, woman had a mortality of 76.8 per cent of that of men.

To summarize: The effects can be illustrated by two curves, one for men and one for women. The coefficients for sex, 0 and -0.26429, are the respective intercepts, and -0.00327 and -0.00846 are the slopes. See Figure 3.10 for a graphical illustration.

FIGURE 3.10: The effect of birthdate on relative mortality for men and women. Reference: Men, birthdate = 1810.

3.6.3 Two continuous covariates

In our example data, the covariates birthdate and imr.birth are continuous, and we get

fit8 <- coxph(Surv(enter, exit, event) ~ birthdate * imr.birth, data = om)
(res <- round(summary(fit8)$coefficients[, 1:2], 5))

                        coef exp(coef)
birthdate           -0.03024   0.97021
imr.birth            0.02004   1.02024
birthdate:imr.birth  0.00158   1.00158

The interpretation of the coefficients are: The reference point is birthdate = 0 (1810) and imr.birth = 0, and if birthdate = x and imr.birth = y, then the relative risk is
\[ \exp(-0.03024 x + 0.02004 y + 0.00158 x y),\] analogous to the calculation in the case with one continuous and one factor covariate.

3.7 Interpretation of parameter estimates

In the proportional hazards model the parameter estimates are logarithms of risk ratios relative to the baseline hazard. The precise interpretations of the coefficients for the two types of covariates are discussed. The conclusion is that \(e^\beta\) has a more direct and intuitive interpretation than \(\beta\) itself.

3.7.1 Continuous covariate

If \(x\) is a continuous covariate, and \(h(t; x) = h_0(t)e^{\beta x}\), then

\[\begin{equation*} \frac{h(t; x+1)}{h(t; x)} = \frac{h_0(t)e^{\beta (x+1)}} {h_0(t)e^{\beta x}} = e^\beta. \end{equation*}\] so the risk increases with a factor \(e^\beta\), when \(x\) is increased by one unit. In other words, \(e^\beta\) is a relative risk (or a hazard ratio, which often is a preferred term in certain professions).

3.7.2 Factor

For a factor covariate, in the usual coding with a reference category, \(e^\beta\) is the relative risk compared to that reference category.

3.8 Proportional hazards in discrete time

In discrete time, the hazard function is, as we saw earlier, a set of conditional probabilities, and so its range is restricted to the interval \((0, 1)\). Therefore, the definition of proportional hazards used for continuous time is unpractical; the multiplication of a probability by a constant may result in a quantity larger than one.

One way of introducing proportional hazards in discrete time is to regard the discreteness as a result of grouping true continuous time data, for which the proportional hazards assumption hold. For instance, in a follow-up study of human mortality, we may only have data recorded once a year, and so life length can only be measured in years. Thus, we assume that there is a true exact life length \(T\), but we can only observe that it falls in an interval \((t_i, t_{i+1})\).

Assume continuous proportional hazards, and a partition of time:

\[ 0 = t_0 < t_1 < t_2 < \cdots < t_k = \infty. \] Then \[\begin{multline}\label{eq:dischaz} P(t_{j-1} \le T < t_j \mid T \ge t_{j-1}; \; \mathbf{x}) = \frac{S(t_{j-1}\mid \mathbf{x}) - S(t_j\mid \mathbf{x})}{S(t_{j-1} \mid \mathbf{x})} \\ = 1 - \frac{S(t_j \mid \mathbf{x})}{S(t_{j-1} \mid \mathbf{x})} = 1 - \left(\frac{S_0(t_j)}{S_0(t_{j-1})}\right)^{\exp(\boldsymbol{\beta}\mathbf{x})} \\ = 1 - (1 - h_i)^{\exp(\boldsymbol{\beta}\mathbf{x})} \end{multline}\] with \(h_i = P(t_{j-1} \le T < t_j \mid T \ge t_{j-1}; \; \mathbf{x} = \mathbf{0})\), \(j = 1, \ldots, k\). We take as the of proportional hazards in discrete time.

3.8.1 Logistic regression

It turns out that a proportional hazards model in discrete time, according to definition , is nothing else than a logistic regression model with the cloglog link (cloglog is short for “complementary log-log” or \(\boldsymbol{\beta}\mathbf{x} =\log(-\log(p))\)). In order to see that, let

\[\begin{equation} (1 - h_j) = \exp(-\exp(\alpha_j)), \; j = 1, \ldots, k \end{equation}\] and

\[\begin{equation*} X_j = \left\{ \begin{array}{ll} 1, & t_{j-1} \le T < t_j \\ 0, & \text{otherwise}%t_1 \le T < t_2 \\ \end{array} \right., \quad j = 1, \ldots, k \end{equation*}\] Then

\[\begin{equation*} \begin{split} P(X_1 = 1; \; \mathbf{x}) &= 1 - \exp(-\exp\big(\alpha_1 + \boldsymbol{\beta}\mathbf{x})\big) \\ P(X_j = 1 \mid X_1 = \cdots = X_{j-1} = 0; \; \mathbf{x}) &= 1 - \exp(-\exp\big(\alpha_j + \boldsymbol{\beta}\mathbf{x})\big), \\ \quad j = 2, \ldots, k. \end{split} \end{equation*}\] This is logistic regression with a cloglog link. Note that extra parameters \(\alpha_1, \ldots, \alpha_k\) are introduced, one for each potential event age. They correspond to the baseline hazard function in continuous time, but are be estimated simultaneously with the regression parameters.

3.9 Model selection

In regression modeling, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines. For a precise definition of this concept, see Appendix A.

3.9.1 Model selection in general

Some general advise regarding model selection is given here.

Remember, there are no models, only some useful ones. This statement is attributed to G.E.P. Box.
More than one model may be useful.
Keep covariates in the model.
Avoid automatic stepwise procedures!
If interaction effects are present, the corresponding main effects must be there.

3.10 Doing it in R

We utilize the male mortality data, Skellefteå 1800–1820, to illustrate the aspects of an ordinary Cox regression. Males aged 20 (exact) are sampled between 1820 and 1840 and followed to death or reaching the age of 40, when they are right censored. So in effect male mortality in the ages 20–40 are studied. The survival time starts counting at zero for each individual at his 21st birthdate (that is, at exact age 20 years).

There are two R functions that are handy for a quick look at a data frame, str and head. The first give a summary description of the structure of an R object, often a data frame

library(eha) #Loads also the data frame 'mort'
str(mort)

'data.frame':   1208 obs. of  6 variables:
 $ id       : int  1 2 3 3 4 5 5 6 7 7 ...
 $ enter    : num  0 3.48 0 13.46 0 ...
 $ exit     : num  20 17.6 13.5 20 20 ...
 $ event    : int  0 1 0 0 0 0 0 0 0 1 ...
 $ birthdate: num  1800 1800 1800 1800 1800 ...
 $ ses      : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...

First, there is information about the data frame: It is a data.frame, with six variables measured on 1208 objects. Then each variable is individually described: name, type, and a few of the first values. The values are usually rounded to a few digits. The Factor line is worth noticing: It is of course a factor covariate, it takes two levels, lower and upper. Internally, the levels are coded 1, 2, respectively. Then (by default), lower (with internal code equal to 1) is the reference category. If desired, this can be changed with the relevel function.

mort$ses2 <- relevel(mort$ses, ref = "upper")
str(mort)

'data.frame':   1208 obs. of  7 variables:
 $ id       : int  1 2 3 3 4 5 5 6 7 7 ...
 $ enter    : num  0 3.48 0 13.46 0 ...
 $ exit     : num  20 17.6 13.5 20 20 ...
 $ event    : int  0 1 0 0 0 0 0 0 0 1 ...
 $ birthdate: num  1800 1800 1800 1800 1800 ...
 $ ses      : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...
 $ ses2     : Factor w/ 2 levels "upper","lower": 1 2 1 2 2 2 1 1 1 2 ...

Comparing the information about the variables ses and ses2, you find that the codings, and also the reference categories, are reversed. Otherwise the variables are identical, and any analysis involving any of them as an explanatory variable would lead to identical conclusions, but not identical parameter estimates, see below.

The function head prints the first few lines (observations) of a data frame (there is also a corresponding tail function that prints a few of the last rows).

head(mort)

  id  enter   exit event birthdate   ses  ses2
1  1  0.000 20.000     0  1800.010 upper upper
2  2  3.478 17.562     1  1800.015 lower lower
3  3  0.000 13.463     0  1800.031 upper upper
4  3 13.463 20.000     0  1800.031 lower lower
5  4  0.000 20.000     0  1800.064 lower lower
6  5  0.000  0.089     0  1800.084 lower lower

Apparently, the variables ses and ses2 are identical, but we know that behind the scenes they are formally different; different coding and (therefore) different reference category. This shows up in analyzes involving the two variables, one at a time.

First the analysis is run with ses.

res <- coxreg(Surv(enter, exit, event) ~ ses + birthdate, 
              data = mort)
res

Call:
coxreg(formula = Surv(enter, exit, event) ~ ses + birthdate, 
    data = mort)

Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
ses 
           lower    0.416     0         1 (reference)
           upper    0.584    -0.484     0.616     0.121     0.000 
birthdate        1811.640     0.029     1.029     0.011     0.008 

Events                    276 
Total time at risk         17038 
Max. log. likelihood      -1841.7 
LR test statistic         23.08 
Degrees of freedom        2 
Overall p-value           9.72167e-06

Then the variable ses2 is inserted instead.

res2 <- coxreg(Surv(enter, exit, event) ~ ses2 + birthdate, 
               data = mort)
res2

Call:
coxreg(formula = Surv(enter, exit, event) ~ ses2 + birthdate, 
    data = mort)

Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
ses2 
           upper    0.584     0         1 (reference)
           lower    0.416     0.484     1.623     0.121     0.000 
birthdate        1811.640     0.029     1.029     0.011     0.008 

Events                    276 
Total time at risk         17038 
Max. log. likelihood      -1841.7 
LR test statistic         23.08 
Degrees of freedom        2 
Overall p-value           9.72167e-06

Notice the difference; it is only formal. The substantial results are completely equivalent.

3.10.1 Likelihood Ratio Test

If we judge the result by the Wald \(p\)-values, it seems as if both variables are highly significant. To be sure it is advisable to apply the likelihood ratio test. In R, this is easily achieved by applying the drop1 function on the result, res, of the Cox regression.

drop1(res, test = "Chisq")

Single term deletions

Model:
Surv(enter, exit, event) ~ ses + birthdate
          Df    AIC     LRT  Pr(>Chi)
<none>       3687.3                  
ses        1 3701.4 16.1095 5.978e-05
birthdate  1 3692.6  7.2752  0.006991

In fact, we have performed two likelihood ratio tests here. First, the effect of removing ses from the full model. The \(p\)-value for this test is very small, \(p = 0.00005998\). The scientific notation, 5.998e-05 means that the decimal point should be moved five steps to the left (that’s the minus sign). The reason for this notation is essentially space-saving. Then, the second LR test concerns the absence or presence of birthdate in the full model. This also gives a small \(p\)-value, \(p = 0.006972\).

The earlier conclusion is not altered; both variables are highly significant.

3.10.2 The estimated baseline cumulative hazard function

The estimated baseline cumulative function (see Equation (3.7)) is best plotted by the plot command, see Figure 3.11.

FIGURE 3.11: Estimated cumulative baseline hazard function.

The figure shows the baseline cumulative hazard function for an average individual (i.e., with the value zero on all covariates when all covariates are centered around their weighted means.

References

Cox, D. R. 1972. “Regression Models and Life Tables.” Journal of the Royal Statistical Society Series B (with Discussion) 34: 187–220.

Cox, D. 1975. “Partial Likelihood.” Biometrika 62: 269–76.