Chapter 6 More on Cox Regression
Vital concepts like time-dependent covariates, communal covariates, handling of ties, model checking, sensitivity analysis, etc., are introduced in this chapter.
6.1 Time-Varying Covariates
Only a special kind of time-varying covariates can be treated in R by the
packages eha
and survival
, and that is so-called
piecewise constant functions. How this is done is best described by
an example.
The covariate (factor) civil status
(called civst
in the R data
frame) is an explanatory variable in a
mortality study, which changes value from 0 to 1 at marriage. How should this be
coded in the data file? The
solution is to create two records (for individuals that marry),
each with a fixed value of \(civ\_st\):
- Original record: \((t_0, t, d, x(s), t_0 < s \le t)\), married at time \(T\), \(t_0 < T < t\):
\[ \text{civst}(s) = \left\{\begin{array}{ll} \text{unmarried} , & s < T \\ \text{married}, & s \ge T \end{array}\right. \]
First new record: \((t_0, T, 0, \text{unmarried})\), always censored.
Second new record: \((T, t, d, \text{married})\).
The data file will contain two records like (with \(T = 30\)) what you can see in Table ??.
In this way, a time-varying covariate can always be handled by utilizing left truncation and right censoring. See also Figure 6.1 for an illustration.
FIGURE 6.1: A time-varying covariate (unmarried or married). Right censored and left truncated at age 30 (at marriage).
And note that this situation is formally equivalent to a situation with two individuals, one unmarried and one married. The first one is right censored at exactly the same time as the second one enters the study (left truncation). \(\ \Box\)
A word of caution: Marriage status may be interpreted as an internal covariate, i.e., the change of marriage status is individual, and may depend on health status. For instance, maybe only healthy persons get married. So, the risk is that health status acts as a confounder in the relation between marriage and survival. Generally, the use of time dependent internal covariates is dangerous, and one should always think of possible confounding or reverse causality taking place when allowing for it.
6.2 Communal Covariates
Communal (external) covariates are covariates that vary in time outside any individual, and is common to all individuals at each specific calendar time. In econometric literature, such a variable is often called exogenous. This could be viewed upon as a special case of a time-varying covariate, but without the danger of reversed causality that was discussed above.
For some 19th century parishes in southern Sweden, yearly time series of food prices are available. In this example we use deviation from the log trend in annual rye prices, shown in Figure 6.2.
Warning in axis(2, at = ttts, labels = um(ttts)): conversion failure
on '−0.4' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ttts, labels = um(ttts)): conversion failure
on '−0.4' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ttts, labels = um(ttts)): conversion failure
on '−0.4' in 'mbcsToSbcs': dot substituted for <92>
FIGURE 6.2: Log rye price deviations from trend, 19th century southern Sweden.
The idea behind this choice of transformation is to focus on deviations
from the “normal” as a proxy for short-term variation in economic stress.
We illustrate the idea with a subset of the built-in data set scania
,
and we show the information for individual No. 1.
id enter exit event birthdate sex ses
29 1 50 59.242 1 1781.454 male lower
Now, to put on the food prices as a time-varying covariate, use the function
make.communal
. We show what happens to individual No. 1.
<- make.communal(scand, logrye[, 2, drop = FALSE],
scand start = 1801.75)
$id == 1, ] scand[scand
enter exit event birthdate id sex ses foodprices
1 50.000 50.296 0 1781.454 1 male lower 0.126
2 50.296 51.296 0 1781.454 1 male lower 0.423
3 51.296 52.296 0 1781.454 1 male lower -0.019
4 52.296 53.296 0 1781.454 1 male lower -0.156
5 53.296 54.296 0 1781.454 1 male lower -0.177
6 54.296 55.296 0 1781.454 1 male lower -0.085
7 55.296 56.296 0 1781.454 1 male lower -0.007
8 56.296 57.296 0 1781.454 1 male lower 0.104
9 57.296 58.296 0 1781.454 1 male lower 0.118
10 58.296 59.242 1 1781.454 1 male lower -0.197
A new variable, foodprices
is added, and each individual’s duration
is split up in one year long intervals (except the first and last). This is
the way of treating foodprices
as a time-varying covariate. The analysis is then
straightforward.
<- coxreg(Surv(enter, exit, event) ~ ses + sex + foodprices,
fit data = scand)
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
ses | 0.6869 | |||||
upper | 0.202 | 0 | 1 | |||
lower | 0.798 | -0.0304 | 0.9701 | 0.0752 | ||
sex | 0.5004 | |||||
male | 0.504 | 0 | 1 | |||
female | 0.496 | 0.0411 | 1.0419 | 0.0610 | ||
foodprices | 0.002 | 0.3206 | 1.3779 | 0.1816 | 0.0775 | |
Max Log | Likelihood | -7184.1 |
Socio-economic status and sex do not matter much, but food prices do; the effect is almost significant at the 5 per cent level. See Table 6.1. \(\ \Box\)
6.3 Tied Event Times
Tied event times cannot, in theory, occur with continuous-time data, but it is impossible to measure duration and life lengths with infinite precision. Data are always more or less rounded, tied event times occur frequently in practice. This may cause problems (biased estimates) if occurring too frequently. There are a few ways to handle tied data, and the so-called exact method considers all possible permutations of the tied event times in each risk set. It works as shown in the following example.
\(R_i = \{1, 2, 3\}\), 1 and 2 are events; two possible orderings:
\[\begin{eqnarray*} L_i(\boldsymbol{\beta}) & = & \frac{\psi(1)}{\psi(1) + \psi(2) + \psi(3)} \times \frac{\psi(2)}{\psi(2) + \psi(3)} \\ & + & \frac{\psi(2)}{\psi(1) + \psi(2) + \psi(3)} \times \frac{\psi(1)}{\psi(1) + \psi(3)} \\ & = & \frac{\psi(1)\psi(2)}{\psi(1) + \psi(2) + \psi(3)} \left\{\frac{1}{\psi(2) + \psi(3)} + \frac{1}{\psi(1) + \psi(3)}\right\} \end{eqnarray*}\] \(\ \Box\)
The main drawback with the exact method is that it easily becomes
unacceptably slow, because of the huge number of permutations that may be
necessary to consider. It is however available in the function coxph
in the
survival
package as an option.
Fortunately, there are a few excellent approximations, most notably
Efron’s, which is the default method in most
survival packages in
R. Another common approximation, due to Breslow,
is the default in some
statistical software and also possible to choose in the eha
and
survival
packages in R. Finally, there is no cost involved in using
these approximations in the case of no ties at all; they will all give
identical results.
With too much ties in the data, there is always the possibility to use
discrete time methods. One simple way of doing it is to use the option
method = "ml"
in
the coxreg
function in the R package eha
.
We look at length of birth intervals
between first and second births for married women in 19th century
northern Sweden, a subset of the data set ‘fert,’ available in the
eha
package.
Four runs with coxreg
are performed, with all the possible ways of
treating ties.
library(eha)
<- fert[fert$parity == 1, ]
first ## Default method (not necessary to specify method = "efron"
<- coxreg(Surv(next.ivl, event) ~ year + age, data = first,
fit.e method = "efron")
## Breslow
<- coxreg(Surv(next.ivl, event) ~ year + age, data = first,
fit.b method = "breslow")
## The hybrid mppl:
<- coxreg(Surv(next.ivl, event) ~ year + age, data = first,
fit.mp method = "mppl", coxph = FALSE)
## True discrete:
<- coxreg(Surv(next.ivl, event) ~ year + age, data = first,
fit.ml method = "ml", coxph = FALSE)
Then we compose a table of the four results, see Table ??.
There are almost no difference in results between the four methods. For the
exact
method mentioned above, the number of permutations is
\(\sim 7 \times 10^{192}\), which is impossible to handle.\(\ \Box\)
With the help of the function risksets
in eha
, it is easy to
check the composition of risk sets in general and the frequency of ties in
particular.
<- risksets(Surv(first$next.ivl, first$event))
rs <- table(rs$n.event)
tt tt
1 2 3 4 5 6 7 9
369 199 133 60 27 14 2 2
This output says that 369 risk sets have only one event each (no ties), 199 risk sets have exactly two events each, etc.
The object rs
is a list
with seven components. Two of them are
n.event
, which counts the number of events in each risk set, and size
,
which gives the size (number of individuals under observation) of
each risk set. Both these vectors are ordered after the risktimes
,
which is another component of the list. For the rest of the components, see
the help page for risksets
.
It is easy to produce the Nelson-Aalen plot
with the output from risksets
, see Figure 6.3.
par(las = 1)
plot(rs$risktimes, cumsum(rs$n.event / rs$size), type = "s",
xlab = "Duration (years)", ylab = "Cum. hazards")
abline(h = 0)
FIGURE 6.3: Nelson-Aalen plot with risksets.
One version of the corresponding survival function is shown in Figure 6.4.
par(las = 1)
<- exp(-cumsum(rs$n.event / rs$size))
sur plot(rs$risktimes, sur, type = "s",
xlab = "Duration (years)", ylab = "Surviving fraction")
abline(h = 0)
FIGURE 6.4: Survival plot with the aid of the function risksets.
6.4 Stratification
Stratification means that data is split up in groups called strata, and a separate partial likelihood function is created for each stratum, but with common values on the regression parameters corresponding to the common explanatory variables. In the estimation, these partial likelihoods are multiplied together, and the product is treated as a likelihood function. Thus, there is one restriction on the parameters, they are the same across strata.
There are typically two reasons for stratification. First, if the proportionality assumption does not hold for a factor covariate, a way out is to stratify along it. Second, a factor may have too many levels, so that it is inappropriate to treat is as an ordinary factor. This argument is similar to the one about using a frailty model (Chapter 10). Stratification is also a useful tool with matched data, see Chapter 11.
When a factor does not produce proportional hazards between categories, stratify on the categories. For tests of the proportionality assumption, see Section 6.7.1.
For the birth interval data, we stratify on ses
in the Cox regression:
library(eha)
source("R/fit.out.R")
<- fert[fert$parity == 1, ]
fert1 levels(fert1$parish) <- c("Jörn", "Norsjö", "Skellefteå")
$parish <- relevel(fert1$parish, ref = "Skellefteå")
fert1$age.25 <- fert1$age - 25 # Center
fert1$year.1860 <- fert1$year - 1860 # Center
fert1<- coxreg(Surv(next.ivl, event) ~ age.25 + year.1860 +
fit + parish + strata(ses),
prev.ivl data = fert1)
fit.out(fit, caption = "Birth intervals, 19th century Skellefteå.",
label = "strafr6")
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
age.25 | 2.151 | -0.0227 | 0.9776 | 0.0057 | 0.0001 | |
year.1860 | -1.336 | -0.0007 | 0.9993 | 0.0022 | 0.7504 | |
prev.ivl | 1.309 | -0.2245 | 0.7989 | 0.0292 | 0.0000 | |
parish | 0.0093 | |||||
Skellefteå | 0.937 | 0 | 1 | |||
Jörn | 0.010 | 0.6196 | 1.8581 | 0.2144 | ||
Norsjö | 0.053 | 0.1869 | 1.2055 | 0.1087 | ||
Max Log | Likelihood | -9047.3 |
The results in Table 6.2 show that calendar time is unimportant, while mother’s age and the length of the previous interval, from marriage to first birth, both are important with birth intensity decreasing with both. Also, note that the continuous covariates are centered around a reasonable value in the range of both. It will not affect the rest of the table in this case, but with interactions involving either it would.
Next plot the fit, see Figure 6.5. The non-proportionality
pattern is clearly visible. Note the line par(lwd = 1.5)
: It magnifies the
line widths by 50 per cent in the plot, compared to default. And the values on
the \(y\) axis refers to a 25-year-old mother giving birth in Skellefteå 1860.
par(lwd = 1.5, cex = 0.8)
plot(fit, ylab = "Cumulative hazards", xlab = "Years")
FIGURE 6.5: Cumulative hazards by socio-economic status, second birth interval.
6.5 Sampling of Risk Sets
Some of the work by Langholz and Borgan (1995) is implemented in eha. The idea behind sampling of risk sets is that with huge data sets, in each risk sets there will be a few (often only one) events, but many survivors. From an efficiency point of view, not much will be lost by using only a random fraction of the survivors in each risk set. The gain is of course computational speed and memory saving. How it is done in eha is shown by an example.
For the male mortality
data set, we compare a full analysis with one where
only four survivors are sampled in each risk set.
<- coxreg(Surv(enter, exit, event) ~ ses,
fit data = mort)
.4 <- coxreg(Surv(enter, exit, event) ~ ses,
fitmax.survs = 4, data = mort)
<- coefficients(summary(fit))[c(1, 3)]
f1 <- coefficients(summary(fit.4))[c(1, 3)]
f4 <- rbind(f1, f4)
out colnames(out) <- c("Coef", "se(Coef)")
rownames(out) <- c("Original", "Sample")
round(out, 4)
Coef se(Coef)
Original -0.4795 0.1207
Sample -0.5442 0.1357
The results are comparable and a gain in execution time was noted. Of course, with the small data sets we work with here, the difference is of no practical importance.\(\ \Box\)
It is worth noting that the function risksets
has an argument
max.survs
, which, when it is set, sample survivors for each risk set
in the data. The output can then be used as input to coxreg
, see the
relevant help pages for more information.
6.6 Residuals
Residuals are generally the key to judgment of model fit to data. It is easy to show how it works with linear regression.
Figure 6.6 shows a scatter plot of bivariate data and a fitted straight line (left panel) and residuals versus fitted values (right panel).
FIGURE 6.6: Residuals in linear regression.
The residuals are the vertical distances between the line and the points.\(\ \Box\)
Unfortunately, this kind of simple and intuitive graph does not exist for results from a Cox regression analysis. However, there are a few ideas how to accomplish something similar. The main idea goes as follows. If \(T\) is a survival time, and \(S(t)\) the corresponding survivor function (continuous time), then it is a mathematical fact that \(U = S(T)\) is uniformly distributed on the interval \((0, 1)\). Continuing, this implies that \(-\log(U)\) is exponentially distributed with rate (parameter) 1.
It is thus shown that \(H(T)\) (\(H\) is the cumulative hazard function for \(T\)) is exponentially distributed with rate 1. This motivates the residuals, given a sample \(T_1, \ldots, T_n\) of survival times,
\[\begin{equation} r_{Ci} = \hat{H}(T_i; \mathbf{x}_i) = \exp(\boldsymbol{\beta} \mathbf{x}_i)\hat{H}_0(T_i),\quad i = 1, \ldots, n, \end{equation}\] which should behave like a censored sample from an exponential distribution with parameter 1, if the model is good. If, for some \(i\), \(T_i\) is censored, so is the residual.
6.6.1 Martingale residuals
A drawback with the Cox-Snell residuals is that they contain censored values. An attempt to correct the censored residuals leads to the so-called martingale residuals. The idea is simple; it builds on the “lack-of-memory” property of the exponential distribution. The expected value is 1, and by adding 1 to the censored residuals, they become predictions (estimates) of the corresponding uncensored values. then finally a twist is added: Subtract 1 from all values and multiply by \(-1\), leading to the definition of the martingale residuals.
\[\begin{equation} r_{Mi} = \delta_i - r_{Ci}, i = 1, \ldots, n. \end{equation}\] where \(\delta_i\) is the indicator of event for the \(i\):th observation. These residuals may be interpreted as the Observed minus the Expected number of events for each individual. Direct plots of the martingale residuals tend to be less informative, especially with large data sets.
The data set kidney
from the survival package is used. For more
information about it, read its help page in R. For each patient, two
survival times are recorded, but only one will be used here. That is
accomplished with the function duplicated
.
library(survival)
head(kidney)
id time status age sex disease frail
1 1 8 1 28 1 Other 2.3
2 1 16 1 28 1 Other 2.3
3 2 23 1 48 2 GN 1.9
4 2 13 0 48 2 GN 1.9
5 3 22 1 32 1 Other 1.2
6 3 28 1 32 1 Other 1.2
<- kidney[!duplicated(kidney$id), ]
k1 <- coxreg(Surv(time, status) ~ disease + age + sex,
fit data = k1)
The extractor function
residuals
is used to extract the
martingale residuals from the fit.
See Figure 6.7.
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <92>
FIGURE 6.7: Martingale residuals from a fit of the kidney data.
With a large data set, the plot will be hard to interpret, see
Figure 6.8, which shows the martingale residuals from a fit
of the male mortality (mort
) data in the package eha.
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−0.5' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−0.5' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ojs, labels = um(ojs)): conversion failure on
'−0.5' in 'mbcsToSbcs': dot substituted for <92>
FIGURE 6.8: Martingale residuals from a fit of the male mortality data.
It is clear that the residuals are grouped in two clusters. The
positive ones are connected to the observations with an event, while the
negative ones are corresponding to the censored observations. It is hard to
draw any conclusions from plots like this one. However, the residuals are
used in functions that evaluate goodness-of-fit, for instance the function
cox.zph
in the survival package. \(\ \Box\)
6.7 Checking Model Assumptions
6.7.1 Proportionality
The proportionality assumption is extremely important to check in the proportional hazards models. We have already seen how violations of the assumption can be dealt with (stratification).
We exemplify with the `births’ data set in the eha package.
In order to keep observations reasonably independent, we concentrate on one specific birth interval per woman, the interval between the second and third birth. That is, in our sample are all women with at least two births, and we monitor each woman from the day of her second delivery until the third, or if that never happens, throughout her observable fertile period, i.e, to age 50 or to loss of follow-up. The latter will result in a right-censored interval.
<- fert[fert$parity == 2, ] fert2
Then we save the fit from a Cox regression performed by the function coxph
in the
survival package (important!),
<- survival::coxph(Surv(next.ivl, event) ~ ses + age +
fit + parish,
year data = fert2)
and check the proportionality assumption. It is done with the function
cox.zph
in the survival package.
<- survival::cox.zph(fit)
prop.full prop.full
chisq df p
ses 9.76 3 0.0207
age 7.14 1 0.0075
year 2.34 1 0.1259
parish 1.39 2 0.4984
GLOBAL 18.73 7 0.0091
In the table above, look first at the last row, GLOBAL. It is the
result of a test of the global null hypothesis that proportionality
holds. The small \(p\)-value tells us that we have a big problem
with the proportionality assumption. Looking further up, we see two possible
problems, the variables age
and ses
. Unfortunately, age
is a continuous variable. A categorical
variable can be stratified upon, but now
we have to categorize first.
Let us first investigate the effect of stratifying on ses
.
<- coxph(Surv(next.ivl, event) ~ strata(ses) +
fit1 + year + parish, data = fert2) age
and plot the result, see Figure 6.9.
FIGURE 6.9: Stratification on socio-economic status, third birth interval data.
The non-proportionality is visible, some curves cross. The test of
proportionality of the stratified model shows that we still have a problem
with age
.
<- survival::cox.zph(fit1)
fit1.zph print(fit1.zph)
chisq df p
age 7.85 1 0.0051
year 1.61 1 0.2045
parish 1.47 2 0.4793
GLOBAL 9.19 4 0.0564
Since age
is continuous covariate, we may have to categorize
it. First its distribution is checked with a histogram, see
Figure 6.10.
hist(fert2$age, main = "", xlab = "age")
FIGURE 6.10: The distribution of mother’s age, third birth interval data.
The range of age
may reasonably be split into four equal-length
intervals with the cut
function.
$qage <- cut(fert2$age, 4)
fert2<- survival::coxph(Surv(next.ivl, event) ~ strata(ses) +
fit2 + year + parish, data = fert2) qage
and then the proportionality assumption is tested again.
<- survival::cox.zph(fit2)
fit2.zph fit2.zph
chisq df p
qage 7.589 3 0.055
year 0.758 1 0.384
parish 1.052 2 0.591
GLOBAL 8.536 6 0.201
The result is now reasonable, with the high age group deviating slightly. This is not so strange; fertility becomes essentially zero in the age range in the upper forties, and therefore it is natural that the proportionality assumption is violated. One way of dealing with the problem, apart from stratification, would be to analyze the age groups separately. \(\ \Box\)
6.7.2 Log-linearity
In the Cox regression model, the effect of a covariate on the hazard is log-linear, that is, it affects the log hazard linearly. Sometimes a covariate needs to be transformed to fit into the pattern. The first question is then: Should the covariate be transformed in the analysis? A graphical test can be done by plotting the martingale residuals from a null fit against the covariate, See Figure 6.11.
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(1, at = oj1, labels = um(oj1)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = oj2, labels = um(oj2)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <92>
FIGURE 6.11: Plot of residuals against the explanatory variable \(x\).
The function lowess
fits a “smooth”
curve to a scatterplot.
The curvature is clearly visible, let us see what happens if we make the
plot with \(x^2\) instead of \(x\). (Note that we now are cheating!) See
Figure 6.12.
par(las = 1)
plot(x^2, residuals(fit), axes = FALSE, xlab = expression(x^2))
axis(1)
<- seq(-4, 1, by = 1)
ah axis(2, at = ah, labels = um(ah))
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−4' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−3' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−2' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(2, at = ah, labels = um(ah)): conversion failure on
'−1' in 'mbcsToSbcs': dot substituted for <92>
box()
lines(lowess(x^2, residuals(fit)))
FIGURE 6.12: Plot of residuals against the explanatory variable \(x^2\).
This plot indicates more that an increasing \(x\) is associated with an increasing risk for the event to occur.
6.8 Fixed Study Period Survival
Sometimes survival up to some fixed time is of interest. For instance, in medicine five-year survival may be used as a measure of the success of a treatment. The measure is then the probability for a patient to survive five years after treatment.
This is simple to implement; it is a binary experiment, where for each patient survival or not survival is recorded (plus covariates, such as treatment). The only complicating factor is incomplete observations, i.e., right censoring. In that case it is recommended to use ordinary Cox regression anyway. Otherwise, logistic regression with a suitable link function works well. Of course, as mentioned earlier, the cloglog link is what gets you closest to the proportional hazards model, but usually it makes very little difference. See Kalbfleisch and Prentice (2002), pages 334–336, for slightly more detail.
6.9 Left or Right Censored Data.
Sometimes the available data is either left or right censored, i.e., for each individual \(i\), we know survival of \(t_i\) or not, i.e., data \[(t_i, \delta_i), \quad i = 1, \ldots n, \] \(\delta_i = 1\) if survival (died after \(t_i\)) and \(\delta_i = 0\) if not (died before \(t_i\)). It is still possible to estimate the survivor function \(S\) nonparametrically. The likelihood function is \[\begin{equation*} L(S) = \prod_{i=1}^n \left\{S(t_i)\right\}^{\delta_i} \left\{ 1 - S(t_i)\right\}^{1 - \delta_i} \end{equation*}\] How to maximize this under the restrictions \(t_i < t_j \Rightarrow S(t_i) \ge S(t_j)\) with the EM algorithm is shown by Groeneboom and Wellner (1992).
6.10 The Weird Bootstrap
In small to medium-sized samples, the estimated parameters and their standard errors may be not so reliable, and in such cases the bootstrap technique can be useful. There are a couple of ways to implement it in Cox regression, and here the weird bootstrap is presented.
The resampling procedure is done risk set by riskset. If a specific risk set of size \(n\) contains \(d\) events, \(d\) “new” events are drawn from the \(n\) members of the risk set, without replacement, and with probabilities proportional to their scores. Then the regression parameter is estimated and saved in the bootstrap sample. This procedure is repeated a suitable number of times.
The “weird” part of the resampling procedure is the fact that it is performed independently over risk sets, that is, one individual who is present in two risk sets may well have an event (for instance, die) in both risk sets. That is, the risk sets are treated as independent entities in this respect. See Andersen et al. (1993) for more detail.
The weird bootstrap is implemented in the coxreg
function in the eha package.
As an example, child mortality in 19th century Sweden is analyzed with
respect to the sex difference in mortality.
<- coxreg(Surv(enter, exit, event) ~ sex, boot = 300,
fit data = child)
<- fit$coefficents[1]
b_hat <- sqrt(fit$var[1, 1])
b_se <- fit$bootstrap[1, ] b_sample
This fitting procedure takes a fairly long time, around 19 minutes on a well equipped
laptop (2018). The bootstrap result is a matrix where the number of rows is equal to the
number of estimated parameters (one in this case), and the number of columns is
equal to the number of bootstrap replicates (300 in this case). The child
data set
is fairly large with 26574 children and 5616
deaths. The number of risk sets is 2497, with sizes between 20000 and 27000, and
up to 60 events in one risk set.
Now it is up to the user to do whatever she usually does with bootstrap samples. For instance, a start could be to get a grip of the distribution of the bootstrap sample around the parameter estimate, see Figure 6.13.
par(las = 1)
plot(density(b_sample - b_hat, bw = 0.005),
xlab = "Deviation", main = "", axes = FALSE)
<- c(-0.05, 0.00, 0.05)
xh axis(1, at = xh, labels = um(xh))
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <92>
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <e2>
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <88>
Warning in axis(1, at = xh, labels = um(xh)): conversion failure on
'−0.05' in 'mbcsToSbcs': dot substituted for <92>
axis(2)
box()
abline(h = 0, v = 0)
FIGURE 6.13: Bootstrap sample distribution around the estimate.
The mean of the bootstrap sample is \(-0.084\) and its sample standard deviation is 0.019. We can compare with the output of the standard Cox regression, see Table 6.3.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0018 | |||||
male | 0.510 | 0 | 1 | |||
female | 0.490 | -0.0833 | 0.9201 | 0.0267 | ||
Max Log | Likelihood | -56509.9 |
The mean is very close to the actual estimate, but the bootstrap standard error is slightly smaller than the estimated standard error.
For more on the bootstrap, see Efron and Tibshirani (1993), Hinkley (1988), and Hall and Wilson (1991).