Chapter 8 Parametric Models
We have already been given an example of a parametric survival model, the piecewise constant hazards model in the previous chapter. Apart from that, we have so far studied nonparametric methods for survival analysis. This is a tradition that has its roots in medical (cancer) research. In technical applications, on the other hand, parametric models dominate; of special interest is the Weibull model . The text book by is a good source for the study of parametric survival models. More technical detail about parametric distributions is found in Appendix B .
Three kinds of parametric models are considered here: the proportional hazards and the accelerated failure time in continuous time, and the discrete time proportional hazards models.
8.1 Proportional Hazards Models
A proportional hazards family of distributions is generated from one specific continuous distribution by multiplying the hazard function of that distribution by a strictly positive constant, and letting that constant vary over the full positive real line. So, if \(h_0\) is the hazard function corresponding to the generating distribution, the family of distributions can be described by saying that \(h\) is a member of the family if
\[\begin{equation*} h_1(t) = c h_0(t) \quad \text{for some } c > 0, \; \text{and all } t > 0. \end{equation*}\] Note that it is possible to choose any hazard function (on the positive real line) as the generating function. The resulting proportional hazards class of distributions may or may not be a well recognized family of distributions.
In R, eha is one of the packages that can fit parametric proportional hazards models. In the following subsections, the possibilities are examined.
The parametric distribution functions that naturally can be used as the baseline
distribution in the function phreg
are the Weibull,
Extreme value, and the Gompertz distributions.
The Piecewise constant hazards model is treated in the functions pch
(individual data)
and tpch
(tabular data).
The lognormal and loglogistic distributions are also included as possible choices and allow for hazard functions that are first increasing to a maximum and then decreasing, while the other distributions all have monotone hazard functions. However, since these families are not closed under proportional hazards without artificially adding a third, “proportionality,” parameter, they are not discussed here (regard these possibilities as experimental). It is better to combine the lognormal and the loglogistic distributions with the accelerated failure time modeling, where they naturally fit in.
See Figure 8.1 for Weibull and Gompertz hazard functions with selected parameter values.
FIGURE 8.1: Selected hazard functions.
We note in passing that the fourth case, the Gompertz model with negative rate parameter, does not represent a true survival distribution, because the hazard function decreases too fast: There will be a positive probability of eternal life.
Experience shows that the Gompertz distribution fits adult mortality very well, in the ages 30 to 85, say. The modeling of mortality from birth to early adulthood, on the other hand, is demanding since the typical hazard function for all these ages is U-shaped with high infant mortality and relatively low child mortality. Since both the Weibull and the Gompertz distributions have a monotone hazard function, neither is suitable to fit the mortality of the full human life span. However, both distributions are suitable for fitting shorter pieces of the life span, and for longer spans there are two possibilities, a nonparametric model (Cox regression) and a piecewise constant hazards model, where the former can be seen as a limiting case of the latter. More about that later.
8.1.1 The Weibull model
The family of Weibull distributions may be defined by the family of hazard functions
\[\begin{equation} h(t; p, \lambda) = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1}, \quad t, p, \lambda > 0. \tag{8.1} \end{equation}\]
If we start with
\[ h_1(t) = t^{p-1}, \quad p \ge 0, \; t > 0 \]
for a fixed \(p\), and generate a proportional hazards family from there,
\[\begin{equation*} h_c(t) = c h_1(t), \quad c, t > 0, \end{equation*}\]
we get
\[\begin{equation}\label{eq:6weibdist} h_c(t) = c t^{p-1} = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1} \end{equation}\]
by setting \[ c = \frac{p}{\lambda^p}, \]
which shows that for each fixed \(p > 0\), a proportional hazards family is generated by varying \(\lambda\) in equation . On the other hand, if we pick two members from the family of distributions defined by equation with different values of \(p\), they would not be proportional.
To summarize, the Weibull family of distributions is not one family of proportional hazards distributions, but a collection of families of proportional hazards. The collection is indexed by \(p > 0\). It is true, though, that all the families are closed under the Weibull distribution.
The proportional hazards regression model with a Weibull baseline distribution is obtained by multiplying (8.1) by \(\exp(\beta x)\):
\[\begin{equation} h(t; x, \lambda, p, \beta) = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1} e^{\beta x}, \quad t > 0. \tag{8.2} \end{equation}\]
The function phreg
in package eha fits models given by equation
(8.2) by default.
8.1.2 The Gompertz distribution
The Gompertz families of distributions are defined in essentially two ways in the
R package eha
: The rate and the canonical representations. The reason
for this duality is that the families need to be differently represented depending on
whether proportional hazards or accelerated failure time models are under consideration.
In the proportional hazards case, the rate formulation is used, and it is
characterized by an exponentially increasing hazard function with fixed rate r
:
\[\begin{equation} h(t; p, r) = p e^{r t}, \quad p, t > 0; -\infty < r < \infty. \end{equation}\]
As noted earlier, when \(r < 0\), the hazard function \(h\) is decreasing “too fast” to define a proper survival function, and \(r = 0\) gives the exponential distribution as a special case. And for each fixed \(r\), the family of distributions indexed by \(p > 0\) constitutes a proportional hazards family of distributions, and the corresponding regression model is written as
\[\begin{equation} h(t; x, p, r, \beta) = p e^{r t} e^{\beta x}, \quad t > 0. \end{equation}\]
8.1.3 Application
The data set oldmort in the R package eha contains
life histories of people followed from their 60th birthday to their 100th, or
until death, born between June 28, 1765 and December 31, 1820
in Skellefteå. The data set is described in detail in Chapter 1. The variable
enter
is age at start of the given interval, and exit
contains the age at
the end of the interval. We need to calculate follow-up time since age 60, so a new
data frame, olm
, is created as a copy of oldmort
, and then 60 is subtracted from
enter
and exit
.
See the result in
Table ??, where the most relevant variables for our purpose are shown.
The variable names are more or less self explanatory, enter and exit are time
in years since the sixtieth birthday, age is age at start of follow-up
(the original enter
variable). The variable event is
an indicator of death at the duration given by exit
.
To have something to compare to, a Cox regression is performed first, see Table
8.1. Two covariates are included in the model, sex
and region
.
Both are categorical, region
with three categories, town (reference), industry, and rural,
and sex
with female as reference category.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0001 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | 0.1867 | 1.2052 | 0.0459 | ||
region | 0.0013 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | 0.2120 | 1.2361 | 0.0849 | ||
rural | 0.563 | 0.0516 | 1.0530 | 0.0827 | ||
Max Log | Likelihood | -13563.1 |
Then a Weibull model is fitted, see Table 8.2.
$sex <- relevel(oldmort$sex, ref = "female")
oldmort<- phreg(Surv(enter - 60, exit - 60, event) ~ sex + region,
fit dist = "weibull", data = oldmort)
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0005 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | 0.1587 | 1.1720 | 0.0458 | ||
region | 0.0001 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | 0.2491 | 1.2829 | 0.0848 | ||
rural | 0.563 | 0.0525 | 1.0539 | 0.0826 | ||
Max Log | Likelihood | -7421.1 |
A closer look at the estimates of regression coefficients shows that they are not very close, in Table ?? they are put side by side for easier comparison.
Let us compare the estimated cumulative baseline hazard functions, see Figure 8.2.
FIGURE 8.2: Baseline cumulative hazards for Cox and Weibull regressions.
This is not a good fit, it seems as if the Weibull hazard cannot grow fast enough. A better approach is to fit a Gompertz distribution, and check parameter and baseline hazards estimates, see Figure 8.3 and Table ??.
FIGURE 8.3: Baseline cumulative hazards for Cox and Gompertz regressions.
The Gompertz model fits the baseline hazard very well up until duration 30 (age 90), but after that the exponential growth slows down. The early growth rate is 9.5 per cent per year.
The result of fitting the Gompertz model is shown in Table 8.3.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0000 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | 0.1897 | 1.2089 | 0.0458 | ||
region | 0.0015 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | 0.2066 | 1.2295 | 0.0849 | ||
rural | 0.563 | 0.0468 | 1.0480 | 0.0826 | ||
Max Log | Likelihood | -7280.9 |
The Gompertz and Cox models are very close, both regarding regression parameter estimates and baseline hazard functions.
8.1.4 The parametric model with left truncation
The data set oldmort
contains left-truncated life histories as a consequence of
using age as time scale. In the presentation above we chose to change the time scale
so that the origin was age 60 (sharp). This is of no importance when fitting the
semi-parametric Cox regression, an additive change of time scale will only shift
the estimated cumulative hazards along the x-axis.
But for parametric models it matters, and we illustrate it by repeating the previous Weibull, Gompertz, and Cox regression analyses with the original time scale. The consequence is that focus is shifted from a conditional (on survival until 60) analysis to an unconditional, where the baseline hazard and regression coefficients are estimated for the full life span (0–100 years of age).
Compare these fitted coefficients with the earlier from Table ??. For the Gompertz and Cox regression models, coefficients are identical, while they differ for the Weibull distribution.
Let us look at the baseline cumulative hazards, see Figure 8.4, where the graph is cut at ages 60 and 80 for clarity. Notice the value at Time = 60: The parametric models are “extrapolating” back to time at birth, and so the estimates do not represent the cumulative hazards of the conditional distribution, given survival to age 60. This has an impact on the estimates of the regression coefficients in the case of the Weibull distribution, because the conditional is not Weibull even though the unconditional is. This phenomenon does not apply to the Gompertz distribution, for which the conditional distribution is again Gompertz with the same rate, but with a different level.
FIGURE 8.4: Gompertz vs Cox and Weibull vs Cox estimated cumulative hazards functions.
From Figure 8.4 it appears as if we can get the conditional cumulative hazards simply by subtracting the value at age 60, \(H(60)\), from the whole curves, and that is in fact correct. In the Gompertz case, it would simply recover Figure 8.3, but the Weibull case is different: Starting with Figure 8.2 and adding the adjusted Weibull curve from Figure 8.4 we get Figure 8.5.
FIGURE 8.5: Comparison of the conditional and unconditional Weibull models.
Obviously, the conditional Weibull distribution fits data much better than the unconditional one. The comparison with the Gompertz distribution is shown in Figure 8.6.
FIGURE 8.6: Comparison of the Gompertz and conditional Weibull models.
It seems as if the conditional Weibull model fits data as good as or even better than the Gompertz model. The latter grows too fast in the very high ages, and this is an observation found in many studies lately (Rootzén and Zholud 2017; Lenart and Vaupel 2017; Barbi et al. 2018; Broström 2019).
8.1.5 The piecewise constant proportional hazards model
The piecewise constant hazard (pch
) model can always be used in the fitting process with good results.
It is however an uncertainty moment in the process: How should time be cut into pieces, and how
many pieces should there be? Two possible strategies, (i) choose equally-spaced cut points, and
(ii) relatively more cut points where there are many deaths, that is, where the hazard function is expected
to be steep.
The oldmort
data set spans a time interval of length 40 years, and we know that mortality on the age
interval 60–100 is increasing almost exponentially, suggesting more cut points in the high ages.
Against that is the fact that in very high ages, say 90–100, not many observations are still around, most
of them have already died.
We may start with eight intervals of equal length, 60–65, 65–70, …, 95–100, and fit a pch model
with the aid of the function pchreg
in eha
. The
result is shown in Table ??.
Then the baseline cumulative hazards are compared to the one from the Cox regression fit in Figure 8.7.
FIGURE 8.7: Piecewise constant cumulative hazards, old age mortality.
As expected, a good fit. It is also obvious that the estimated regression parameters do not vary much between the studied models.
The piecewise constant model works well with this data set, but its real strength is its
flexibility and speed with huge data sets. Its full potential is maximized by initially
tabulate the data by using the eha
function toTpch
. An illustration with the oldmort
data set.
<- toTpch(Surv(enter, exit, event) ~ sex + region,
olmtab cuts = c(seq(60, 85, by = 5), 100), data = oldmort)
The resulting table (first five rows) is shown in Table ??.
The original data set has 6495 rows (observations), while the created
table has only 36 rows. The latter is analyzed via the function
tpchreg
, see Table 8.4, which is identical to Table ??.
<- tpchreg(oe(event, exposure) ~ sex + region,
fit.tpch data = olmtab, time = age)
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0001 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | 0.1811 | 1.1985 | 0.0458 | ||
region | 0.0009 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | 0.2237 | 1.2507 | 0.0849 | ||
rural | 0.563 | 0.0598 | 1.0616 | 0.0826 | ||
Max Log | Likelihood | -7295.8 | ||||
Restricted | mean | 17.216 |
8.1.6 Testing the proportional hazards assumption
The pch
model is well suited for a formal test of the proportional hazards model,
but some trickery is needed with the eha
package. It is best shown by example,
and we continue by utilizing the newly created data table olmtab
.
By omitting the time
argument in the call to tpchreg
, an exponential (constant
hazards) model is fitted, and the variable age
is free to be included as a covariate.
<- tpchreg(oe(event, exposure) ~ age * (sex + region),
fit.tpch data = olmtab)
<- drop1(fit.tpch, test = "Chisq")) (dr
Single term deletions
Model:
oe(event, exposure) ~ age * (sex + region)
Df AIC LRT Pr(>Chi)
<none> 14614
age:sex 5 14610 7.009 0.21997
age:region 10 14614 20.108 0.02825
The age interaction with sex
is very non-significant, while the effect of region on mortality
seems to vary significantly with age. It can be illustrated by performing a stratified
(by region) analysis with the time = age
variable included in the usual way,
then plotting the hazard functions for the three strata, see Figure
8.8.
<- tpchreg(oe(event, exposure) ~ sex + strata(region),
fit.str time = age, data = olmtab)
FIGURE 8.8: Hazard functions for three regions, old age mortality, Skellefteå 1860–80.
The deviating region is town
. It is also the smallest region, with no registered
deaths above age 90 with only 4.7 person years. We also note that mortality in ages 85–90
is highest in the town, while in the younger ages the town region has the lowest mortality.
The zero mortality in ages above age 90 is simply an artefact depending on very few
observed person years.
Let us perform the same exercise with a larger data set, the Swedish population 1968–2019. (See also Example 3.1 in Chapter 3.) We check the hypothesis of proportional hazards between women and men, the question is: Is the female advantage of the same relative size in all ages?
<- swepop
sp $deaths <- swedeaths$deaths sp
Then
<- tpchreg(oe(deaths, pop) ~ strata(sex) + I(year - 2000),
fit.swr last = 101,
time = age, data = sp)
<- exp(tpchreg(oe(deaths, pop) ~ sex + I(year - 2000),
rr.sex last = 101,
time = age, data = sp)$coefficients[1])
<- hazards(fit.swr, cum = TRUE) # Cumulative hazards
cumhaz <- hazards(fit.swr, cum = FALSE) # NOT Cumulative hazards haz
<- par(mfrow = c(1, 2))
op plot(haz$x, haz$y[2, ] / haz$y[1, ], type = "l", ylim = c(1, 3),
xlab = "Age", ylab = "Hazard Ratio")
abline(h = 1)
abline(h = rr.sex, lty = 2)
text(5, 1.65, "PH")
plot(cumhaz$x, cumhaz$y[2, ] / cumhaz$y[1, ], type = "l",
ylim = c(1, 3),
xlab = "Age", ylab = "Cumulative Hazard Ratio")
abline(h = 1)
abline(h = rr.sex, lty = 2)
text(5, 1.65, "PH")
FIGURE 8.9: Mortality ratio for men vs. women, Sweden 1968–2019.
par(op)
We note two things: (i) The variation around the proportional hazards estimate (PH) is huge and (ii) the smoothing effect of accumulation is large, which we should keep in mind as a warning when trying to judge proportionality in graphs of cumulative hazards functions.
8.1.7 Choosing the best parametric proportional hazards model
For modeling survival data with parametric proportional hazards models,
the distributions of the function phreg
in the package eha are
available. How to select a suitable parametric model is shown by a couple
of examples using the now familiar data set oldmort
.
Remember that by design, individuals are followed from the day they are aged 60. In order to calculate the follow-up times, we usually subtract 60 from the two columns enter and exit. Otherwise, when specifying a parametric survivor distribution, it would in fact correspond to a left-truncated (at 60) distribution. However, for a Cox regression, this makes no difference.
<- oldmort
om <- as.formula("Surv(enter, exit, event) ~ sex + region")
fm <- as.formula("Surv(enter - 60, exit - 60, event) ~ sex + region")
fm0 <- phreg(fm, data = oldmort)
fit.w <- extractAIC(fit.w)[2]
o.w <- phreg(fm0, data = oldmort)
fit.w0 <- extractAIC(fit.w0)[2] o.w0
Here we applied a Weibull baseline distribution (the default
distribution in phreg
; by specifying nothing, the Weibull is chosen). Note also the
way we can store a formula for future use: This is particularly useful when the same
model will be fitted several times while changing some attribute, like the baseline distribution.
Now let us repeat this with the pch
, gompertz
and ev
distributions in the phreg
package, for both the unconditional and conditional approaches.
Then we compare the AICs and choose the distribution with the smallest value.
First we note that only for the Weibull distribution it makes a difference which time scale (age or duration) is used. The Gompertz distribution gives the best fit, but the conditional Weibull (age as time scale) is very close. We also saw this graphically earlier (Figure 8.6).
The reason that the pch
model fares so badly is that it is punished for the large
number of parameters (eight) it uses to estimate the baseline hazard.
8.2 Accelerated Failure Time Models
The accelerated failure time (AFT) model is best described through relations between survivor functions. For instance, comparing two groups:
- Group 0: \(P(T \ge t) = S_0(t)\) (control group)
- Group 1: \(P(T \ge t) = S_0(\phi t)\) (treatment group)
The model says that treatment accelerates failure time by the factor \(\phi\). If \(\phi < 1\), treatment is good (prolongs life), otherwise bad. Another interpretation is that the median life length is multiplied by \(1/\phi\) by treatment.
In Figure 8.10 the difference between the accelerated failure time and the proportional hazards models concerning the hazard functions is illustrated.
FIGURE 8.10: Proportional hazards (left) and accelerated failure time model (right). The baseline distribution is Loglogistic with shape 5 (dashed).
The AFT hazard is not only multiplied by 2, it is also shifted to the left; the process time is accelerated. Note how the hazards in the AFT case converges as time increases. This is usually a sign of the suitability of an AFT model.
8.2.1 The AFT regression model
If \(T\) has survivor function \(S(t)\) and \(T_c = T/c\), then \(T_c\) has survivor function \(S(ct)\). Then, if \(Y = \log(T)\) and \(Y_c = \log(T_c)\), the following relation holds:
\[\begin{equation*} Y_c = Y - log(c). \end{equation*}\]
With \(Y = \epsilon\), \(Y_c = Y\), and \(\log(c) = -\boldsymbol{\beta} \mathbf{x}\) this can be written in familiar form:
\[\begin{equation*} Y = \boldsymbol{\beta} \mathbf{x} + \epsilon, \end{equation*}\]
i.e., an ordinary linear regression model for the log survival times. In
the absence of right censoring and left truncation, this model can be
estimated by least squares. However, the presence of these forms of
incomplete data makes it necessary to rely on maximum likelihood
methods. In R, the functions aftreg
in the package
eha and the function survreg
in the package survival that
perform the task of fitting AFT models. The package flexsurv (Jackson 2016) has some useful
functionality in this area.
Besides differing parametennrizations, the main difference between
aftreg
and survreg
is that the latter does not allow for left
truncated data. One reason for this is that left truncation is a much
harder problem to deal with in AFT models than in proportional hazards models.
The reason is that, with a time varying covariate \(z(t), t \ge 0\), the AFT model is
\[\begin{equation*} S(t; z) = S_0\biggl(\int_0^t \exp\bigl(\beta z(s)\bigr)ds\biggr), \end{equation*}\]
and it is required that \(z(s)\) is known for all \(s, 0 \le s \le t\). With a left
truncated observation at \(t_0\), say, \(z(s)\) is unknown for \(0 \le s < t_0\). In eha
,
this is solved by assuming that \(z(s) = z(t_0), 0 \le s < t_0\).
A detailed description of the implementation of the AFT models in eha
is
found in Appendix B.
8.2.2 AFT modeling in R
We repeat the examples from the proportional hazards section, but with AFT models instead.
For a description of this data set, see above. Here we fit an AFT model with the Weibull distribution. This should be compared to the proportional hazards model with the Weibull distribution, see Table 8.2.
source("R/fit.out.R")
<- as.formula("Surv(enter - 60, exit - 60, event) ~ sex + region")
fm <- aftreg(fm, id = id, data = oldmort) fit.w1
Note carefully the inclusion of the argument id
, it is necessary when some
individuals are represented by more than one record in the data file.
Covariate | level | W_mean | Coef | lifeAcc | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0005 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | 0.0922 | 1.0966 | 0.0265 | ||
region | 0.0000 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | 0.1512 | 1.1632 | 0.0495 | ||
rural | 0.563 | 0.0319 | 1.0325 | 0.0481 | ||
Max Log | Likelihood | -7415.1 |
Note that the “Max. log. likelihood,” −7415, is not exactly the same, and
the reason is the presence of time-varying covariate region
. With no time-varying
covariates, the AFT and the PH models are equivalent. \(\ \Box\)
The data set fert in the R package eha contains birth intervals for married women in 19th century Skellefteå. It is described in detail in Chapter 1. Here only the intervals starting with the first birth for each woman are considered. First the data are extracted and examined.
Some women never got a second child, for instance the first woman (id = 1) above.
It turns out that, as is reasonably expected, that the hazard function for time to the second birth is first increasing to a maximum, then decreasing. Indeed, we can make a crude check of that by estimating the the hazard function without covariates with the piecewise constant hazard model, which in principle is a non-parametric approach if the time span is cut in small enough pieces. So we cut the first eight years in half-year-long pieces, see Figure 8.11.
FIGURE 8.11: Estimated hazard function for waiting time between first and second birth.
There are two possible candidates for the baseline distribution that has the right shape of the hazard function: The Lognormal and the Loglogistic distributions. \(\ \Box\)
8.2.3 The Lognormal model
The family of lognormal distributions is characterized by the fact that taking the natural logarithm of a random variable from the family gives a random variable from the family of Normal distributions. Both the hazard and the survivor functions lack closed forms.
The data are already extracted and examined in the previous example. Now, the accelerated failure time (AFT) lognormal regression model is examined.
$year.1860 <- f12$year - 1860
f12<- aftreg(Surv(next.ivl, event) ~ age + year.1860 + ses,
fit.lognorm data = f12, dist = "lognormal")
Note that the argument id
is not used here, that is because there are no
timevarying covariates for next.ivl
. The id
variable in the data set refers to
mother’s id and not to specific birth intervals. The data set f12
is a subset
of fert
, it only includes intervals starting with the first birth.
Covariate | level | W_mean | Coef | lifeAcc | SE | LR_p |
---|---|---|---|---|---|---|
age | 27.247 | -0.0180 | 0.9822 | 0.0027 | 0.0000 | |
year.1860 | -1.041 | 0.0000 | 1.0000 | 0.0010 | 0.9964 | |
ses | 0.1218 | |||||
farmer | 0.458 | 0 | 1 | |||
unknown | 0.178 | -0.0634 | 0.9386 | 0.0336 | ||
upper | 0.022 | 0.0377 | 1.0384 | 0.0821 | ||
lower | 0.341 | -0.0487 | 0.9525 | 0.0271 | ||
Max Log | Likelihood | -2426.2 |
The interpretation of the regression coefficients is different from the PH case: Exponentiated (lifeAcc) they measure how much time to event is accelerated. For instance, if mother’s age is increased by one year, the waiting time to the next birth is accelerated by the factor 0.9822, that is, slowed down slightly.
It may be more natural to use expected life as a comparison rather than
accelerated time, even if one is the reverse of the other. In aftreg
, it is
possible to choose param = lifeExp
for that puropse.
<- aftreg(Surv(next.ivl, event) ~ age +
fit.lognorm2 .1860 + ses,
yeardata = f12, dist = "lognormal",
param = "lifeExp")
Covariate | level | W_mean | Coef | lifeExp | SE | LR_p |
---|---|---|---|---|---|---|
age | 27.247 | 0.0180 | 1.0181 | 0.0027 | 0.0000 | |
year.1860 | -1.041 | -0.0000 | 1.0000 | 0.0010 | 0.9964 | |
ses | 0.1218 | |||||
farmer | 0.458 | 0 | 1 | |||
unknown | 0.178 | 0.0634 | 1.0654 | 0.0336 | ||
upper | 0.022 | -0.0377 | 0.9630 | 0.0821 | ||
lower | 0.341 | 0.0487 | 1.0499 | 0.0271 | ||
Max Log | Likelihood | -2426.2 |
As can be seen by comparing Tables 8.7 and 8.6, the only difference is the signs of the estimated coefficients. And therefore, “lifeAcc = 1 / lifeExp.” The simple interpretation of the “lifeExp” parameter is that it multiplies the expected value for the corresponding category. In Figure 8.12 the estimated baseline hazard function is shown.
FIGURE 8.12: Estimated lognormal baseline hazard function for length of birth intervals.
It was created using
plot(fit.lognorm, fn = "haz", main = "",
xlab = "Years", ylab = "Hazards")
\(\ \Box\)
8.2.4 The loglogistic model
The family of loglogistic distributions is characterized by the fact that taking the natural logarithm of a random variable from the family gives a random variable from the family of Logistic distributions. Contrary to the Lognormal case, both the hazard and the survivor functions have closed forms.
The data are already extracted and examined in the previous example. Now, the accelerated failure time (AFT) loglogistic regression model is examined.
<- aftreg(Surv(next.ivl, event) ~ age +
fit.loglogist .1860 + ses,
yeardata = f12, dist = "loglogistic")
Covariate | level | W_mean | Coef | lifeAcc | SE | LR_p |
---|---|---|---|---|---|---|
age | 27.247 | -0.0139 | 0.9862 | 0.0025 | 0.0000 | |
year.1860 | -1.041 | 0.0002 | 1.0002 | 0.0009 | 0.8194 | |
ses | 0.2056 | |||||
farmer | 0.458 | 0 | 1 | |||
unknown | 0.178 | -0.0293 | 0.9711 | 0.0293 | ||
upper | 0.022 | 0.0978 | 1.1028 | 0.0747 | ||
lower | 0.341 | -0.0327 | 0.9679 | 0.0237 | ||
Max Log | Likelihood | -2292 |
Which is the better fit, the lognormal or the loglogistic? Let us compare the maximized log likelihoods: For the Lognormal it is −2426.23, and for the Loglogistic it is −2292.01, so it seems as if the Loglogistic fit is clearly better. Note, though, that this is not a formal statistical test, it is a comparison by the Akaike criterion (Akaike 1974). This conclusion is also partly supported by a comparison of the two estimated baseline hazard functions (Figures 8.13 and 8.12) with the estimated crude hazard function in Figure 8.11.
FIGURE 8.13: Estimated loglogistic baseline hazard function for length of birth intervals.
8.2.5 The Gompertz model
The Gompertz distribution is special in that it can be fit into both the AFT and the PH framework. Of course, as we have seen, this also holds for the Weibull distribution in a trivial way, the AFT and the PH models are the same, but for the Gompertz distribution, the AFT and PH approaches yield different models.
For the AFT framework to be in place in the Gompertz case, it needs to
be formulated with a rather unfamiliar parameterization, which is called
the canonical parameterization in the package eha
. It works as follows.
The standard definition of the Gompertz hazard function is
\[\begin{equation*} h_r(t; (\alpha, \beta)) = \alpha \exp(\beta t), \quad t > 0; \; \alpha > 0, -\infty < \sigma < \infty. \end{equation*}\]
and it is called the rate parameterization in eha
.
As noted earlier, in order for \(h_r\) to determine a proper survival distribution, it
must be required that \(\beta \ge 0\). It was also noted that when \(\beta = 0\), the
resulting distribution is exponential.
The canonical definition of the Gompertz hazard function is given by
\[\begin{equation*} h_c(t; (\tau, \sigma)) = \frac{\tau}{\sigma} \exp\biggl(\frac{t}{\sigma}\biggr), \quad t > 0; \; \tau, \sigma > 0. \end{equation*}\]
The transition from \(h_r\) to \(h_c\) is given by \(\sigma = 1 / \beta, \, \tau = \alpha / \beta\), and note that this implies that the rate in the canonical form must be strictly positive. Furthermore, the exponential special case now appears in the limit as \(\sigma \rightarrow \infty\). In practice this means that when the baseline hazard is only weakly increasing, \(\sigma\) is very large and numerical problems in the estimation process is likely to occur.
The conclusion of all this is that the AFT Gompertz model is suitable in situations where the intensity of an event is clearly increasing with time. A good example is adult mortality.
We repeat the PH analysis in Table 8.3, but with the AFT model, see Table 8.9.
Covariate | level | W_mean | Coef | lifeExp | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0008 | |||||
female | 0.594 | 0 | 1 | |||
male | 0.406 | -0.0649 | 0.9371 | 0.0192 | ||
region | 0.0071 | |||||
town | 0.111 | 0 | 1 | |||
industry | 0.326 | -0.0960 | 0.9084 | 0.0391 | ||
rural | 0.563 | -0.0457 | 0.9554 | 0.0385 | ||
Max Log | Likelihood | -7288.3 |
The expected remaining life at 60 for a man living in the town is 16 years, and for a woman living in the town the expected remaining life is \(1.067 \times 16 = 17\) years.
8.3 Proportional Hazards or AFT Model?
The problem of choosing between a proportional hazards and an accelerated failure time model (everything else equal) can be solved by comparing the AIC (Leeuw 1992; Akaike 1974) of the models. Since the numbers of parameters are equal in the two cases, this amounts to comparing the maximized likelihoods. For instance, in the case with old age mortality:
Let us see what happens with the Gompertz AFT model:
Exactly the same procedure as with the Weibull distribution, but we have to
specify the Gompertz distribution in the call (remember, the Weibull
distribution is the default choice, both for phreg
and aftreg
).
Comparing the corresponding result for the proportional hazards and the AFT models with the Gompertz distribution, we find that the maximized log likelihood in the former case is −7280.9, compared to −7288.3 for the latter. This indicates that the proportional hazards model fit is better. Note however that we cannot formally test the proportional hazards hypothesis; the two models are not nested.
8.4 Discrete Time Models
There are two ways of looking at discrete duration data; either time is truly discrete, for instance the number of trials until an event occurs, or an approximation due to rounding of continuous time data. In a sense all data are discrete, because it is impossible to measure anything on a continuous scale with infinite precision, but from a practical point of view it is reasonable to say that data is discrete when tied events occur embarrassingly often.
8.4.1 Data formats: wide and long
When working with register data, time is often measured in years which
makes it necessary and convenient to work with discrete models. A typical
data format is the so-called wide format, where there is one record
(row) per individual, and measurements for many years. We have so far only
worked with the long format. The data sets created by survSplit
are in long format; there is one record per individual and age category.
The R work horse in switching back and forth between the long and wide
formats is the function reshape
.
It may look confusing at first, but
if data follow some simple rules, it is quite easy to use reshape
.
The function reshape
is typically used with longitudinal data,
where there are several measurements at different time points for each
individual. If the data for one individual is registered within one record
(row), we say that data are in wide format, and if there is one record (row)
per time (several records per individual), data are in long format. Using
wide format, the rule is that time-varying variable names must end in a
numeric value indicating at which time the measurement was taken. For
instance, if the variable civ
(civil status) is noted at times 1, 2,
and 3, there must be variables named civ.1, civ.2, civ.3
,
respectively. It is optional to use any separator between the base
name (civ
) and the time, but it should be one character or empty. The
“.” is what reshape
expects by default, so using that form
simplifies coding somewhat.
We start by creating an example data set as an illustration. This is
accomplished by starting off with the data set oldmort
in eha
and “trimming” it.
data(oldmort)
<- oldmort[oldmort$enter == 60, ]
om <- age.window(om, c(60, 70))
om $m.id <- om$f.id <- om$imr.birth <- om$birthplace <- NULL
om$birthdate <- om$ses.50 <- NULL
om<- survival::survSplit(om, cut = 61:69, start = "enter",
om1 end = "exit", event = "event",
episode = "agegrp")
$agegrp <- factor(om1$agegrp, labels = 60:69)
om1<- om1[order(om1$id, om1$enter), ]
om1 rownames(om1) <- 1:NROW(om1)
$id <- as.numeric(as.factor(om1$id))
om1head(om1)
id sex civ region enter exit event agegrp
1 1 male widow rural 60 61.000 0 60
2 1 male widow rural 61 62.000 0 61
3 1 male widow rural 62 63.000 0 62
4 1 male widow rural 63 63.413 0 63
5 2 female widow industry 60 61.000 0 60
6 2 female widow industry 61 62.000 0 61
This is the long format, each individual has as many records as “presence
ages.” For instance, person No. 1 has four records, for the ages 60–63.
The maximum possible No. of records for one individual is 10. We can check
the distribution of No. of records per person by using the function
tapply
:
<- tapply(om1$id, om1$id, length)
recs table(recs)
recs
1 2 3 4 5 6 7 8 9 10
400 397 351 315 307 250 192 208 146 657
It is easier to get to grips with the distribution with a graph, in this case a barplot, see Figure 8.14.
FIGURE 8.14: Barplot of the number of records per person.
Now, let us turn om1
into a data frame in wide format. This is
done with the function reshape
. First we remove the redundant
variables enter
and exit
.
$exit <- om1$enter <- NULL
om1<- reshape(om1, v.names = c("event", "civ", "region"),
om2 idvar = "id", direction = "wide",
timevar = "agegrp")
names(om2)
[1] "id" "sex" "event.60" "civ.60" "region.60"
[6] "event.61" "civ.61" "region.61" "event.62" "civ.62"
[11] "region.62" "event.63" "civ.63" "region.63" "event.64"
[16] "civ.64" "region.64" "event.65" "civ.65" "region.65"
[21] "event.66" "civ.66" "region.66" "event.67" "civ.67"
[26] "region.67" "event.68" "civ.68" "region.68" "event.69"
[31] "civ.69" "region.69"
Here there are two time-fixed variables, id
and sex
, and three
time-varying variables, event
, civ
, and region
. The
time-varying variables have suffix of the type .xx
, where xx
varies
from 60 to 69.
This is how data in wide format usually show up; the suffix may start with
something else than .
, but it must be a single character, or nothing.
The real problem is how to switch from wide format to long, because our
survival analysis tools want it that way. The solution is to use
reshape
again, but with other specifications.
<- reshape(om2, direction = "long", idvar = "id",
om3 varying = 3:32)
head(om3)
id sex time event civ region
1.60 1 male 60 0 widow rural
2.60 2 female 60 0 widow industry
3.60 3 male 60 0 married town
4.60 4 male 60 0 married town
5.60 5 female 60 0 married town
6.60 6 female 60 0 married town
There is a new variable time
created, which goes from 60 to 69, one
step for each of the ages. We would
like to have the file sorted primarily by id
and secondary by time.
<- om3[order(om3$id, om3$time), ]
om3 1:11, ] om3[
id sex time event civ region
1.60 1 male 60 0 widow rural
1.61 1 male 61 0 widow rural
1.62 1 male 62 0 widow rural
1.63 1 male 63 0 widow rural
1.64 1 male 64 NA <NA> <NA>
1.65 1 male 65 NA <NA> <NA>
1.66 1 male 66 NA <NA> <NA>
1.67 1 male 67 NA <NA> <NA>
1.68 1 male 68 NA <NA> <NA>
1.69 1 male 69 NA <NA> <NA>
2.60 2 female 60 0 widow industry
Note that all individuals got 10 records here, even those who only are observed for fewer years. Individual No. 1 is only observed for the ages 60–63, and the next six records are redundant; they will not be used in an analysis if kept, so it is from a practical point of view a good idea to remove them.
NROW(om3)
[1] 32230
<- om3[!is.na(om3$event), ]
om3 NROW(om3)
[1] 17434
The data frame shrunk to almost half of what it was originally. First, let us summarize data.
summary(om3)
id sex time event
Min. : 1 male : 7045 Min. :60.00 Min. :0.00000
1st Qu.: 655 female:10389 1st Qu.:61.00 1st Qu.:0.00000
Median :1267 Median :63.00 Median :0.00000
Mean :1328 Mean :63.15 Mean :0.02587
3rd Qu.:1948 3rd Qu.:65.00 3rd Qu.:0.00000
Max. :3223 Max. :69.00 Max. :1.00000
civ region
unmarried: 1565 town :2485
married :11380 industry:5344
widow : 4489 rural :9605
The key variables in the discrete time analysis are event
and
time
. For the baseline hazard we need one parameter per value of
time
, so it is practical to transform the continuous variable time
to a factor.
$time <- as.factor(om3$time)
om3summary(om3)
id sex time event
Min. : 1 male : 7045 60 :3223 Min. :0.00000
1st Qu.: 655 female:10389 61 :2823 1st Qu.:0.00000
Median :1267 62 :2426 Median :0.00000
Mean :1328 63 :2075 Mean :0.02587
3rd Qu.:1948 64 :1760 3rd Qu.:0.00000
Max. :3223 65 :1453 Max. :1.00000
(Other):3674
civ region
unmarried: 1565 town :2485
married :11380 industry:5344
widow : 4489 rural :9605
The summary now produces a frequency table for time
.
Note that we always want our data to be in long format before we start the analysis, so the important lesson here was how to go from wide to long. You may find the tidyr package (Wickham 2021) useful if you encounter “untidy” data and want to tidy up.
8.4.2 Binomial regression with glm
For a given time point and a given individual, the response is whether an
event has occurred or not, that is, it is modeled as a
outcome, which is a special
case of the distribution.
The discrete time analysis may now be performed in several ways. Most
straightforward is to run a
logistic regression with event
as
response through the basic glm
function with
family = binomial(link=cloglog)
. The so-called
link is
used in order to preserve the proportional hazards property in the underlying,
real world, continuous time model.
<- glm(event ~ sex + civ + region + time,
fit.glm family = binomial(link = cloglog), data = om3)
summary(fit.glm)
Call:
glm(formula = event ~ sex + civ + region + time, family = binomial(link = cloglog),
data = om3)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4263 -0.2435 -0.2181 -0.1938 2.9503
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.42133 0.21167 -16.164 < 2e-16
sexfemale -0.36332 0.09927 -3.660 0.000252
civmarried -0.33683 0.15601 -2.159 0.030847
civwidow -0.19008 0.16810 -1.131 0.258175
regionindustry 0.10784 0.14443 0.747 0.455264
regionrural -0.22423 0.14034 -1.598 0.110080
time61 0.04895 0.18505 0.265 0.791366
time62 0.46005 0.17400 2.644 0.008195
time63 0.05586 0.20200 0.277 0.782134
time64 0.46323 0.18883 2.453 0.014162
time65 0.31749 0.20850 1.523 0.127816
time66 0.45582 0.21225 2.148 0.031751
time67 0.91511 0.19551 4.681 2.86e-06
time68 0.68549 0.22575 3.036 0.002394
time69 0.60539 0.24904 2.431 0.015064
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4186.8 on 17433 degrees of freedom
Residual deviance: 4125.2 on 17419 degrees of freedom
AIC: 4155.2
Number of Fisher Scoring iterations: 7
This output is not so pleasant (we do not want the time
estimates printed),
but we can anyway see that females (as
usual) have lower mortality than males, that married are better off than
unmarried, and that regional differences maybe are not so large. To get a
better understanding of the statistical significance of the findings we run
drop1
on the fit.
Single term deletions
Model:
event ~ sex + civ + region + time
Df Deviance AIC LRT Pr(>Chi)
<none> 4125.2 4155.2
sex 1 4138.5 4166.5 13.302 0.0002651
civ 2 4130.2 4156.2 4.978 0.0829956
region 2 4135.8 4161.8 10.526 0.0051810
time 9 4161.2 4173.2 36.005 3.956e-05
Mildly surprisingly, civil status
is not that statistically
significant, but region
(and the other variables) is. The strong
significance of the time variable is of course expected; mortality is
expected to increase with increasing age.
8.4.3 Survival analysis with coxreg
By some data manipulation we can also use the function coxreg
in the package
eha
for the analysis. For
that to succeed we need intervals as responses, and the way of doing that
is to add two “fake” variables, exit
and enter
. The latter must be
slightly smaller than the former:
$exit <- as.numeric(as.character(om3$time))
om3$enter <- om3$exit - 0.1
om3= "Old age mortality, discrete time analysis."
cap <- coxreg(Surv(enter, exit, event) ~ sex + civ + region,
fit.ML method = "ml", data = om3, coxph = FALSE)
fit.out(fit.ML, caption = cap, label = "mlreg8")
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0003 | |||||
male | 0.404 | 0 | 1 | |||
female | 0.596 | -0.3633 | 0.6954 | 0.0993 | ||
civ | 0.0830 | |||||
unmarried | 0.090 | 0 | 1 | |||
married | 0.653 | -0.3368 | 0.7140 | 0.1560 | ||
widow | 0.257 | -0.1901 | 0.8269 | 0.1681 | ||
region | 0.0052 | |||||
town | 0.143 | 0 | 1 | |||
industry | 0.307 | 0.1078 | 1.1139 | 0.1445 | ||
rural | 0.551 | -0.2242 | 0.7991 | 0.1404 | ||
Max Log | Likelihood | -2062.6 |
The result is shown in Table 8.10, and plots of the cumulative hazards and the survival function are easily achieved, see Figures 8.15 and 8.16.
FIGURE 8.15: The cumulative hazards, from the coxreg fit.
FIGURE 8.16: The survival function, from the coxreg fit.
Finally, the proportional hazards assumption can be tested in the discrete
time framework by creating an interaction between time
and the
covariates in question. It is possible by using glm
.
<- glm(event ~ (sex + civ + region) * time,
fit2.glm family = binomial(link = cloglog),
data = om3)
drop1(fit2.glm, test = "Chisq")
Single term deletions
Model:
event ~ (sex + civ + region) * time
Df Deviance AIC LRT Pr(>Chi)
<none> 4077.1 4197.1
sex:time 9 4088.2 4190.2 11.116 0.2679
civ:time 18 4099.5 4183.5 22.425 0.2137
region:time 18 4093.7 4177.7 16.600 0.5508
There is no sign of non-proportionality, that is, no interaction with time
.