Chapter 8 Parametric Models

We have already been given an example of a parametric survival model, the piecewise constant hazards model in the previous chapter. Apart from that, we have so far studied nonparametric methods for survival analysis. This is a tradition that has its roots in medical (cancer) research. In technical applications, on the other hand, parametric models dominate; of special interest is the Weibull model . The text book by is a good source for the study of parametric survival models. More technical detail about parametric distributions is found in Appendix B .

Three kinds of parametric models are considered here: the proportional hazards and the accelerated failure time in continuous time, and the discrete time proportional hazards models.

8.1 Proportional Hazards Models

A proportional hazards family of distributions is generated from one specific continuous distribution by multiplying the hazard function of that distribution by a strictly positive constant, and letting that constant vary over the full positive real line. So, if \(h_0\) is the hazard function corresponding to the generating distribution, the family of distributions can be described by saying that \(h\) is a member of the family if

\[\begin{equation*} h_1(t) = c h_0(t) \quad \text{for some } c > 0, \; \text{and all } t > 0. \end{equation*}\] Note that it is possible to choose any hazard function (on the positive real line) as the generating function. The resulting proportional hazards class of distributions may or may not be a well recognized family of distributions.

In R, eha is one of the packages that can fit parametric proportional hazards models. In the following subsections, the possibilities are examined.

The parametric distribution functions that naturally can be used as the baseline distribution in the function phreg are the Weibull, Extreme value, and the Gompertz distributions. The Piecewise constant hazards model is treated in the functions pch (individual data) and tpch (tabular data).

The lognormal and loglogistic distributions are also included as possible choices and allow for hazard functions that are first increasing to a maximum and then decreasing, while the other distributions all have monotone hazard functions. However, since these families are not closed under proportional hazards without artificially adding a third, “proportionality,” parameter, they are not discussed here (regard these possibilities as experimental). It is better to combine the lognormal and the loglogistic distributions with the accelerated failure time modeling, where they naturally fit in.

See Figure 8.1 for Weibull and Gompertz hazard functions with selected parameter values.

FIGURE 8.1: Selected hazard functions.

We note in passing that the fourth case, the Gompertz model with negative rate parameter, does not represent a true survival distribution, because the hazard function decreases too fast: There will be a positive probability of eternal life.

Experience shows that the Gompertz distribution fits adult mortality very well, in the ages 30 to 85, say. The modeling of mortality from birth to early adulthood, on the other hand, is demanding since the typical hazard function for all these ages is U-shaped with high infant mortality and relatively low child mortality. Since both the Weibull and the Gompertz distributions have a monotone hazard function, neither is suitable to fit the mortality of the full human life span. However, both distributions are suitable for fitting shorter pieces of the life span, and for longer spans there are two possibilities, a nonparametric model (Cox regression) and a piecewise constant hazards model, where the former can be seen as a limiting case of the latter. More about that later.

8.1.1 The Weibull model

The family of Weibull distributions may be defined by the family of hazard functions

\[\begin{equation} h(t; p, \lambda) = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1}, \quad t, p, \lambda > 0. \tag{8.1} \end{equation}\]

If we start with

\[ h_1(t) = t^{p-1}, \quad p \ge 0, \; t > 0 \]

for a fixed \(p\), and generate a proportional hazards family from there,

\[\begin{equation*} h_c(t) = c h_1(t), \quad c, t > 0, \end{equation*}\]

we get

\[\begin{equation}\label{eq:6weibdist} h_c(t) = c t^{p-1} = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1} \end{equation}\]

by setting \[ c = \frac{p}{\lambda^p}, \]

which shows that for each fixed \(p > 0\), a proportional hazards family is generated by varying \(\lambda\) in equation . On the other hand, if we pick two members from the family of distributions defined by equation with different values of \(p\), they would not be proportional.

To summarize, the Weibull family of distributions is not one family of proportional hazards distributions, but a collection of families of proportional hazards. The collection is indexed by \(p > 0\). It is true, though, that all the families are closed under the Weibull distribution.

The proportional hazards regression model with a Weibull baseline distribution is obtained by multiplying (8.1) by \(\exp(\beta x)\):

\[\begin{equation} h(t; x, \lambda, p, \beta) = \frac{p}{\lambda} \biggl(\frac{t}{\lambda}\biggr)^{p-1} e^{\beta x}, \quad t > 0. \tag{8.2} \end{equation}\]

The function phreg in package eha fits models given by equation (8.2) by default.

8.1.2 The Gompertz distribution

The Gompertz families of distributions are defined in essentially two ways in the R package eha: The rate and the canonical representations. The reason for this duality is that the families need to be differently represented depending on whether proportional hazards or accelerated failure time models are under consideration.

In the proportional hazards case, the rate formulation is used, and it is characterized by an exponentially increasing hazard function with fixed rate r:

\[\begin{equation} h(t; p, r) = p e^{r t}, \quad p, t > 0; -\infty < r < \infty. \end{equation}\]

As noted earlier, when \(r < 0\), the hazard function \(h\) is decreasing “too fast” to define a proper survival function, and \(r = 0\) gives the exponential distribution as a special case. And for each fixed \(r\), the family of distributions indexed by \(p > 0\) constitutes a proportional hazards family of distributions, and the corresponding regression model is written as

\[\begin{equation} h(t; x, p, r, \beta) = p e^{r t} e^{\beta x}, \quad t > 0. \end{equation}\]

8.1.3 Application

The data set oldmort in the R package eha contains life histories of people followed from their 60th birthday to their 100th, or until death, born between June 28, 1765 and December 31, 1820 in Skellefteå. The data set is described in detail in Chapter 1. The variable enter is age at start of the given interval, and exit contains the age at the end of the interval. We need to calculate follow-up time since age 60, so a new data frame, olm, is created as a copy of oldmort, and then 60 is subtracted from enter and exit. See the result in Table ??, where the most relevant variables for our purpose are shown.

The variable names are more or less self explanatory, enter and exit are time in years since the sixtieth birthday, age is age at start of follow-up (the original enter variable). The variable event is an indicator of death at the duration given by exit.

To have something to compare to, a Cox regression is performed first, see Table 8.1. Two covariates are included in the model, sex and region. Both are categorical, region with three categories, town (reference), industry, and rural, and sex with female as reference category.

TABLE 8.1: Old age mortality, Cox proportional hazards model.
Covariate	level	W_mean	Coef	HR	SE	LR_p
sex						0.0001
	female	0.594	0	1
	male	0.406	0.1867	1.2052	0.0459
region						0.0013
	town	0.111	0	1
	industry	0.326	0.2120	1.2361	0.0849
	rural	0.563	0.0516	1.0530	0.0827
Max Log	Likelihood		-13563.1

Then a Weibull model is fitted, see Table 8.2.

oldmort$sex <- relevel(oldmort$sex, ref = "female")
fit <- phreg(Surv(enter - 60, exit - 60, event) ~ sex + region, 
             dist = "weibull", data = oldmort)

TABLE 8.2: Old age mortality, Weibull proportional hazards model.
Covariate	level	W_mean	Coef	HR	SE	LR_p
sex						0.0005
	female	0.594	0	1
	male	0.406	0.1587	1.1720	0.0458
region						0.0001
	town	0.111	0	1
	industry	0.326	0.2491	1.2829	0.0848
	rural	0.563	0.0525	1.0539	0.0826
Max Log	Likelihood		-7421.1

A closer look at the estimates of regression coefficients shows that they are not very close, in Table ?? they are put side by side for easier comparison.

Let us compare the estimated cumulative baseline hazard functions, see Figure 8.2.

FIGURE 8.2: Baseline cumulative hazards for Cox and Weibull regressions.

This is not a good fit, it seems as if the Weibull hazard cannot grow fast enough. A better approach is to fit a Gompertz distribution, and check parameter and baseline hazards estimates, see Figure 8.3 and Table ??.

FIGURE 8.3: Baseline cumulative hazards for Cox and Gompertz regressions.

The Gompertz model fits the baseline hazard very well up until duration 30 (age 90), but after that the exponential growth slows down. The early growth rate is 9.5 per cent per year.

The result of fitting the Gompertz model is shown in Table 8.3.

TABLE 8.3: Old age mortality, Gompertz PH model.
Covariate	level	W_mean	Coef	HR	SE	LR_p
sex						0.0000
	female	0.594	0	1
	male	0.406	0.1897	1.2089	0.0458
region						0.0015
	town	0.111	0	1
	industry	0.326	0.2066	1.2295	0.0849
	rural	0.563	0.0468	1.0480	0.0826
Max Log	Likelihood		-7280.9

The Gompertz and Cox models are very close, both regarding regression parameter estimates and baseline hazard functions.

8.1.4 The parametric model with left truncation

The data set oldmort contains left-truncated life histories as a consequence of using age as time scale. In the presentation above we chose to change the time scale so that the origin was age 60 (sharp). This is of no importance when fitting the semi-parametric Cox regression, an additive change of time scale will only shift the estimated cumulative hazards along the x-axis.

But for parametric models it matters, and we illustrate it by repeating the previous Weibull, Gompertz, and Cox regression analyses with the original time scale. The consequence is that focus is shifted from a conditional (on survival until 60) analysis to an unconditional, where the baseline hazard and regression coefficients are estimated for the full life span (0–100 years of age).

Compare these fitted coefficients with the earlier from Table ??. For the Gompertz and Cox regression models, coefficients are identical, while they differ for the Weibull distribution.

Let us look at the baseline cumulative hazards, see Figure 8.4, where the graph is cut at ages 60 and 80 for clarity. Notice the value at Time = 60: The parametric models are “extrapolating” back to time at birth, and so the estimates do not represent the cumulative hazards of the conditional distribution, given survival to age 60. This has an impact on the estimates of the regression coefficients in the case of the Weibull distribution, because the conditional is not Weibull even though the unconditional is. This phenomenon does not apply to the Gompertz distribution, for which the conditional distribution is again Gompertz with the same rate, but with a different level.

FIGURE 8.4: Gompertz vs Cox and Weibull vs Cox estimated cumulative hazards functions.

From Figure 8.4 it appears as if we can get the conditional cumulative hazards simply by subtracting the value at age 60, \(H(60)\), from the whole curves, and that is in fact correct. In the Gompertz case, it would simply recover Figure 8.3, but the Weibull case is different: Starting with Figure 8.2 and adding the adjusted Weibull curve from Figure 8.4 we get Figure 8.5.

FIGURE 8.5: Comparison of the conditional and unconditional Weibull models.

Obviously, the conditional Weibull distribution fits data much better than the unconditional one. The comparison with the Gompertz distribution is shown in Figure 8.6.

FIGURE 8.6: Comparison of the Gompertz and conditional Weibull models.

It seems as if the conditional Weibull model fits data as good as or even better than the Gompertz model. The latter grows too fast in the very high ages, and this is an observation found in many studies lately (Rootzén and Zholud 2017; Lenart and Vaupel 2017; Barbi et al. 2018; Broström 2019).

8.1.5 The piecewise constant proportional hazards model

The piecewise constant hazard (pch) model can always be used in the fitting process with good results. It is however an uncertainty moment in the process: How should time be cut into pieces, and how many pieces should there be? Two possible strategies, (i) choose equally-spaced cut points, and (ii) relatively more cut points where there are many deaths, that is, where the hazard function is expected to be steep.

The oldmort data set spans a time interval of length 40 years, and we know that mortality on the age interval 60–100 is increasing almost exponentially, suggesting more cut points in the high ages. Against that is the fact that in very high ages, say 90–100, not many observations are still around, most of them have already died.

We may start with eight intervals of equal length, 60–65, 65–70, …, 95–100, and fit a pch model with the aid of the function pchreg in eha. The result is shown in Table ??.

Then the baseline cumulative hazards are compared to the one from the Cox regression fit in Figure 8.7.

FIGURE 8.7: Piecewise constant cumulative hazards, old age mortality.

As expected, a good fit. It is also obvious that the estimated regression parameters do not vary much between the studied models.

The piecewise constant model works well with this data set, but its real strength is its flexibility and speed with huge data sets. Its full potential is maximized by initially tabulate the data by using the eha function toTpch. An illustration with the oldmort data set.

olmtab <- toTpch(Surv(enter, exit, event) ~ sex + region, 
                 cuts = c(seq(60, 85, by = 5), 100), data = oldmort)

The resulting table (first five rows) is shown in Table ??.

The original data set has 6495 rows (observations), while the created table has only 36 rows. The latter is analyzed via the function tpchreg, see Table 8.4, which is identical to Table ??.

fit.tpch <- tpchreg(oe(event, exposure) ~ sex + region, 
                    data = olmtab, time = age)

TABLE 8.4: Proportional hazards with table version of oldmort.
Covariate	level	W_mean	Coef	HR	SE	LR_p
sex						0.0001
	female	0.594	0	1
	male	0.406	0.1811	1.1985	0.0458
region						0.0009
	town	0.111	0	1
	industry	0.326	0.2237	1.2507	0.0849
	rural	0.563	0.0598	1.0616	0.0826
Max Log	Likelihood		-7295.8
Restricted	mean	17.216

8.1.6 Testing the proportional hazards assumption

The pch model is well suited for a formal test of the proportional hazards model, but some trickery is needed with the eha package. It is best shown by example, and we continue by utilizing the newly created data table olmtab.

By omitting the time argument in the call to tpchreg, an exponential (constant hazards) model is fitted, and the variable age is free to be included as a covariate.

fit.tpch <- tpchreg(oe(event, exposure) ~ age * (sex + region), 
                    data = olmtab)
(dr <- drop1(fit.tpch, test = "Chisq"))

Single term deletions

Model:
oe(event, exposure) ~ age * (sex + region)
           Df   AIC    LRT Pr(>Chi)
<none>        14614                
age:sex     5 14610  7.009  0.21997
age:region 10 14614 20.108  0.02825

The age interaction with sex is very non-significant, while the effect of region on mortality seems to vary significantly with age. It can be illustrated by performing a stratified (by region) analysis with the time = age variable included in the usual way, then plotting the hazard functions for the three strata, see Figure 8.8.

fit.str <- tpchreg(oe(event, exposure) ~ sex + strata(region), 
                   time = age, data = olmtab)

FIGURE 8.8: Hazard functions for three regions, old age mortality, Skellefteå 1860–80.

The deviating region is town. It is also the smallest region, with no registered deaths above age 90 with only 4.7 person years. We also note that mortality in ages 85–90 is highest in the town, while in the younger ages the town region has the lowest mortality. The zero mortality in ages above age 90 is simply an artefact depending on very few observed person years.

Let us perform the same exercise with a larger data set, the Swedish population 1968–2019. (See also Example 3.1 in Chapter 3.) We check the hypothesis of proportional hazards between women and men, the question is: Is the female advantage of the same relative size in all ages?

sp <- swepop
sp$deaths <- swedeaths$deaths

Then

fit.swr <- tpchreg(oe(deaths, pop) ~ strata(sex) + I(year - 2000), 
                   last = 101,
                   time = age, data = sp)
rr.sex <- exp(tpchreg(oe(deaths, pop) ~ sex + I(year - 2000), 
                      last = 101,
                      time = age, data = sp)$coefficients[1])
cumhaz <- hazards(fit.swr, cum = TRUE) # Cumulative hazards
haz <- hazards(fit.swr, cum = FALSE) # NOT Cumulative hazards

op <- par(mfrow = c(1, 2))
plot(haz$x, haz$y[2, ] / haz$y[1, ], type = "l", ylim = c(1, 3), 
     xlab = "Age", ylab = "Hazard Ratio")
abline(h = 1)
abline(h = rr.sex, lty = 2)
text(5, 1.65, "PH")
plot(cumhaz$x, cumhaz$y[2, ] / cumhaz$y[1, ], type = "l", 
     ylim = c(1, 3), 
     xlab = "Age", ylab = "Cumulative Hazard Ratio")
abline(h = 1)
abline(h = rr.sex, lty = 2)
text(5, 1.65, "PH")

FIGURE 8.9: Mortality ratio for men vs. women, Sweden 1968–2019.

par(op)

We note two things: (i) The variation around the proportional hazards estimate (PH) is huge and (ii) the smoothing effect of accumulation is large, which we should keep in mind as a warning when trying to judge proportionality in graphs of cumulative hazards functions.

8.1.7 Choosing the best parametric proportional hazards model

For modeling survival data with parametric proportional hazards models, the distributions of the function phreg in the package eha are available. How to select a suitable parametric model is shown by a couple of examples using the now familiar data set oldmort.

Remember that by design, individuals are followed from the day they are aged 60. In order to calculate the follow-up times, we usually subtract 60 from the two columns enter and exit. Otherwise, when specifying a parametric survivor distribution, it would in fact correspond to a left-truncated (at 60) distribution. However, for a Cox regression, this makes no difference.

om <- oldmort
fm <- as.formula("Surv(enter, exit, event) ~ sex + region")
fm0 <- as.formula("Surv(enter - 60, exit - 60, event) ~ sex + region")
fit.w <- phreg(fm, data = oldmort)
o.w <- extractAIC(fit.w)[2]
fit.w0 <- phreg(fm0, data = oldmort)
o.w0 <- extractAIC(fit.w0)[2]

Here we applied a Weibull baseline distribution (the default distribution in phreg; by specifying nothing, the Weibull is chosen). Note also the way we can store a formula for future use: This is particularly useful when the same model will be fitted several times while changing some attribute, like the baseline distribution. Now let us repeat this with the pch, gompertz and ev distributions in the phreg package, for both the unconditional and conditional approaches.

Then we compare the AICs and choose the distribution with the smallest value.

First we note that only for the Weibull distribution it makes a difference which time scale (age or duration) is used. The Gompertz distribution gives the best fit, but the conditional Weibull (age as time scale) is very close. We also saw this graphically earlier (Figure 8.6).

The reason that the pch model fares so badly is that it is punished for the large number of parameters (eight) it uses to estimate the baseline hazard.

8.2 Accelerated Failure Time Models

The accelerated failure time (AFT) model is best described through relations between survivor functions. For instance, comparing two groups:

Group 0: \(P(T \ge t) = S_0(t)\) (control group)
Group 1: \(P(T \ge t) = S_0(\phi t)\) (treatment group)

The model says that treatment accelerates failure time by the factor \(\phi\). If \(\phi < 1\), treatment is good (prolongs life), otherwise bad. Another interpretation is that the median life length is multiplied by \(1/\phi\) by treatment.

In Figure 8.10 the difference between the accelerated failure time and the proportional hazards models concerning the hazard functions is illustrated.

FIGURE 8.10: Proportional hazards (left) and accelerated failure time model (right). The baseline distribution is Loglogistic with shape 5 (dashed).

The AFT hazard is not only multiplied by 2, it is also shifted to the left; the process time is accelerated. Note how the hazards in the AFT case converges as time increases. This is usually a sign of the suitability of an AFT model.

8.2.1 The AFT regression model

If \(T\) has survivor function \(S(t)\) and \(T_c = T/c\), then \(T_c\) has survivor function \(S(ct)\). Then, if \(Y = \log(T)\) and \(Y_c = \log(T_c)\), the following relation holds:

\[\begin{equation*} Y_c = Y - log(c). \end{equation*}\]

With \(Y = \epsilon\), \(Y_c = Y\), and \(\log(c) = -\boldsymbol{\beta} \mathbf{x}\) this can be written in familiar form:

\[\begin{equation*} Y = \boldsymbol{\beta} \mathbf{x} + \epsilon, \end{equation*}\]

i.e., an ordinary linear regression model for the log survival times. In the absence of right censoring and left truncation, this model can be estimated by least squares. However, the presence of these forms of incomplete data makes it necessary to rely on maximum likelihood methods. In R, the functions aftreg in the package eha and the function survreg in the package survival that perform the task of fitting AFT models. The package flexsurv (Jackson 2016) has some useful functionality in this area.

Besides differing parametennrizations, the main difference between aftreg and survreg is that the latter does not allow for left truncated data. One reason for this is that left truncation is a much harder problem to deal with in AFT models than in proportional hazards models. The reason is that, with a time varying covariate \(z(t), t \ge 0\), the AFT model is

\[\begin{equation*} S(t; z) = S_0\biggl(\int_0^t \exp\bigl(\beta z(s)\bigr)ds\biggr), \end{equation*}\]

and it is required that \(z(s)\) is known for all \(s, 0 \le s \le t\). With a left truncated observation at \(t_0\), say, \(z(s)\) is unknown for \(0 \le s < t_0\). In eha, this is solved by assuming that \(z(s) = z(t_0), 0 \le s < t_0\).

A detailed description of the implementation of the AFT models in eha is found in Appendix B.

8.2.2 AFT modeling in R

We repeat the examples from the proportional hazards section, but with AFT models instead.

Example 8.1 (Old age mortality)

For a description of this data set, see above. Here we fit an AFT model with the Weibull distribution. This should be compared to the proportional hazards model with the Weibull distribution, see Table 8.2.

source("R/fit.out.R")
fm <- as.formula("Surv(enter - 60, exit - 60, event) ~ sex + region")
fit.w1 <- aftreg(fm, id = id, data = oldmort)

Note carefully the inclusion of the argument id, it is necessary when some individuals are represented by more than one record in the data file.

TABLE 8.5: Old age (60+) mortality in 19th century Skellefteå, AFT model.
Covariate	level	W_mean	Coef	lifeAcc	SE	LR_p
sex						0.0005
	female	0.594	0	1
	male	0.406	0.0922	1.0966	0.0265
region						0.0000
	town	0.111	0	1
	industry	0.326	0.1512	1.1632	0.0495
	rural	0.563	0.0319	1.0325	0.0481
Max Log	Likelihood		-7415.1

Note that the “Max. log. likelihood,” −7415, is not exactly the same, and the reason is the presence of time-varying covariate region. With no time-varying covariates, the AFT and the PH models are equivalent. \(\ \Box\)

Example 8.2 (Length of birth intervals.)

The data set fert in the R package eha contains birth intervals for married women in 19th century Skellefteå. It is described in detail in Chapter 1. Here only the intervals starting with the first birth for each woman are considered. First the data are extracted and examined.

Some women never got a second child, for instance the first woman (id = 1) above.

It turns out that, as is reasonably expected, that the hazard function for time to the second birth is first increasing to a maximum, then decreasing. Indeed, we can make a crude check of that by estimating the the hazard function without covariates with the piecewise constant hazard model, which in principle is a non-parametric approach if the time span is cut in small enough pieces. So we cut the first eight years in half-year-long pieces, see Figure 8.11.

FIGURE 8.11: Estimated hazard function for waiting time between first and second birth.

There are two possible candidates for the baseline distribution that has the right shape of the hazard function: The Lognormal and the Loglogistic distributions. \(\ \Box\)

8.2.3 The Lognormal model

The family of lognormal distributions is characterized by the fact that taking the natural logarithm of a random variable from the family gives a random variable from the family of Normal distributions. Both the hazard and the survivor functions lack closed forms.

The data are already extracted and examined in the previous example. Now, the accelerated failure time (AFT) lognormal regression model is examined.

f12$year.1860 <- f12$year - 1860
fit.lognorm <- aftreg(Surv(next.ivl, event) ~ age + year.1860 + ses, 
                      data = f12, dist = "lognormal")

Note that the argument id is not used here, that is because there are no timevarying covariates for next.ivl. The id variable in the data set refers to mother’s id and not to specific birth intervals. The data set f12 is a subset of fert, it only includes intervals starting with the first birth.

TABLE 8.6: Length of birth intervals, Lognormal AFT model.
Covariate	level	W_mean	Coef	lifeAcc	SE	LR_p
age		27.247	-0.0180	0.9822	0.0027	0.0000
year.1860		-1.041	0.0000	1.0000	0.0010	0.9964
ses						0.1218
	farmer	0.458	0	1
	unknown	0.178	-0.0634	0.9386	0.0336
	upper	0.022	0.0377	1.0384	0.0821
	lower	0.341	-0.0487	0.9525	0.0271
Max Log	Likelihood		-2426.2

The interpretation of the regression coefficients is different from the PH case: Exponentiated (lifeAcc) they measure how much time to event is accelerated. For instance, if mother’s age is increased by one year, the waiting time to the next birth is accelerated by the factor 0.9822, that is, slowed down slightly.

It may be more natural to use expected life as a comparison rather than accelerated time, even if one is the reverse of the other. In aftreg, it is possible to choose param = lifeExp for that puropse.

fit.lognorm2 <- aftreg(Surv(next.ivl, event) ~ age + 
                           year.1860 + ses, 
                      data = f12, dist = "lognormal", 
                      param = "lifeExp")

TABLE 8.7: Length of birth intervals, Lognormal AFT model, reverse parameterization.
Covariate	level	W_mean	Coef	lifeExp	SE	LR_p
age		27.247	0.0180	1.0181	0.0027	0.0000
year.1860		-1.041	-0.0000	1.0000	0.0010	0.9964
ses						0.1218
	farmer	0.458	0	1
	unknown	0.178	0.0634	1.0654	0.0336
	upper	0.022	-0.0377	0.9630	0.0821
	lower	0.341	0.0487	1.0499	0.0271
Max Log	Likelihood		-2426.2

As can be seen by comparing Tables 8.7 and 8.6, the only difference is the signs of the estimated coefficients. And therefore, “lifeAcc = 1 / lifeExp.” The simple interpretation of the “lifeExp” parameter is that it multiplies the expected value for the corresponding category. In Figure 8.12 the estimated baseline hazard function is shown.

FIGURE 8.12: Estimated lognormal baseline hazard function for length of birth intervals.

It was created using

plot(fit.lognorm, fn = "haz", main = "", 
     xlab = "Years", ylab = "Hazards")

\(\ \Box\)

8.2.4 The loglogistic model

The family of loglogistic distributions is characterized by the fact that taking the natural logarithm of a random variable from the family gives a random variable from the family of Logistic distributions. Contrary to the Lognormal case, both the hazard and the survivor functions have closed forms.

The data are already extracted and examined in the previous example. Now, the accelerated failure time (AFT) loglogistic regression model is examined.

fit.loglogist <- aftreg(Surv(next.ivl, event) ~ age + 
                            year.1860 + ses, 
                      data = f12, dist = "loglogistic")

TABLE 8.8: Length of birth intervals, Loglogistic AFT model.
Covariate	level	W_mean	Coef	lifeAcc	SE	LR_p
age		27.247	-0.0139	0.9862	0.0025	0.0000
year.1860		-1.041	0.0002	1.0002	0.0009	0.8194
ses						0.2056
	farmer	0.458	0	1
	unknown	0.178	-0.0293	0.9711	0.0293
	upper	0.022	0.0978	1.1028	0.0747
	lower	0.341	-0.0327	0.9679	0.0237
Max Log	Likelihood		-2292

Which is the better fit, the lognormal or the loglogistic? Let us compare the maximized log likelihoods: For the Lognormal it is −2426.23, and for the Loglogistic it is −2292.01, so it seems as if the Loglogistic fit is clearly better. Note, though, that this is not a formal statistical test, it is a comparison by the Akaike criterion (Akaike 1974). This conclusion is also partly supported by a comparison of the two estimated baseline hazard functions (Figures 8.13 and 8.12) with the estimated crude hazard function in Figure 8.11.

FIGURE 8.13: Estimated loglogistic baseline hazard function for length of birth intervals.

8.2.5 The Gompertz model

The Gompertz distribution is special in that it can be fit into both the AFT and the PH framework. Of course, as we have seen, this also holds for the Weibull distribution in a trivial way, the AFT and the PH models are the same, but for the Gompertz distribution, the AFT and PH approaches yield different models.

For the AFT framework to be in place in the Gompertz case, it needs to be formulated with a rather unfamiliar parameterization, which is called the canonical parameterization in the package eha. It works as follows. The standard definition of the Gompertz hazard function is

\[\begin{equation*} h_r(t; (\alpha, \beta)) = \alpha \exp(\beta t), \quad t > 0; \; \alpha > 0, -\infty < \sigma < \infty. \end{equation*}\]

and it is called the rate parameterization in eha. As noted earlier, in order for \(h_r\) to determine a proper survival distribution, it must be required that \(\beta \ge 0\). It was also noted that when \(\beta = 0\), the resulting distribution is exponential.

The canonical definition of the Gompertz hazard function is given by

\[\begin{equation*} h_c(t; (\tau, \sigma)) = \frac{\tau}{\sigma} \exp\biggl(\frac{t}{\sigma}\biggr), \quad t > 0; \; \tau, \sigma > 0. \end{equation*}\]

The transition from \(h_r\) to \(h_c\) is given by \(\sigma = 1 / \beta, \, \tau = \alpha / \beta\), and note that this implies that the rate in the canonical form must be strictly positive. Furthermore, the exponential special case now appears in the limit as \(\sigma \rightarrow \infty\). In practice this means that when the baseline hazard is only weakly increasing, \(\sigma\) is very large and numerical problems in the estimation process is likely to occur.

The conclusion of all this is that the AFT Gompertz model is suitable in situations where the intensity of an event is clearly increasing with time. A good example is adult mortality.

We repeat the PH analysis in Table 8.3, but with the AFT model, see Table 8.9.

TABLE 8.9: Old age mortality (60–100), Gompertz AFT model.
Covariate	level	W_mean	Coef	lifeExp	SE	LR_p
sex						0.0008
	female	0.594	0	1
	male	0.406	-0.0649	0.9371	0.0192
region						0.0071
	town	0.111	0	1
	industry	0.326	-0.0960	0.9084	0.0391
	rural	0.563	-0.0457	0.9554	0.0385
Max Log	Likelihood		-7288.3

The expected remaining life at 60 for a man living in the town is 16 years, and for a woman living in the town the expected remaining life is \(1.067 \times 16 = 17\) years.

8.3 Proportional Hazards or AFT Model?

The problem of choosing between a proportional hazards and an accelerated failure time model (everything else equal) can be solved by comparing the AIC (Leeuw 1992; Akaike 1974) of the models. Since the numbers of parameters are equal in the two cases, this amounts to comparing the maximized likelihoods. For instance, in the case with old age mortality:

Let us see what happens with the Gompertz AFT model: Exactly the same procedure as with the Weibull distribution, but we have to specify the Gompertz distribution in the call (remember, the Weibull distribution is the default choice, both for phreg and aftreg).

Comparing the corresponding result for the proportional hazards and the AFT models with the Gompertz distribution, we find that the maximized log likelihood in the former case is −7280.9, compared to −7288.3 for the latter. This indicates that the proportional hazards model fit is better. Note however that we cannot formally test the proportional hazards hypothesis; the two models are not nested.

8.4 Discrete Time Models

There are two ways of looking at discrete duration data; either time is truly discrete, for instance the number of trials until an event occurs, or an approximation due to rounding of continuous time data. In a sense all data are discrete, because it is impossible to measure anything on a continuous scale with infinite precision, but from a practical point of view it is reasonable to say that data is discrete when tied events occur embarrassingly often.

8.4.1 Data formats: wide and long

When working with register data, time is often measured in years which makes it necessary and convenient to work with discrete models. A typical data format is the so-called wide format, where there is one record (row) per individual, and measurements for many years. We have so far only worked with the long format. The data sets created by survSplit are in long format; there is one record per individual and age category. The R work horse in switching back and forth between the long and wide formats is the function reshape. It may look confusing at first, but if data follow some simple rules, it is quite easy to use reshape.

The function reshape is typically used with longitudinal data, where there are several measurements at different time points for each individual. If the data for one individual is registered within one record (row), we say that data are in wide format, and if there is one record (row) per time (several records per individual), data are in long format. Using wide format, the rule is that time-varying variable names must end in a numeric value indicating at which time the measurement was taken. For instance, if the variable civ (civil status) is noted at times 1, 2, and 3, there must be variables named civ.1, civ.2, civ.3, respectively. It is optional to use any separator between the base name (civ) and the time, but it should be one character or empty. The “.” is what reshape expects by default, so using that form simplifies coding somewhat.

We start by creating an example data set as an illustration. This is accomplished by starting off with the data set oldmort in eha and “trimming” it.

data(oldmort)
om <- oldmort[oldmort$enter == 60, ]
om <- age.window(om, c(60, 70))
om$m.id <- om$f.id <- om$imr.birth <- om$birthplace <- NULL
om$birthdate <- om$ses.50 <- NULL
om1 <- survival::survSplit(om, cut = 61:69, start = "enter", 
                           end = "exit", event = "event", 
                           episode = "agegrp")
om1$agegrp <- factor(om1$agegrp, labels = 60:69)
om1 <- om1[order(om1$id, om1$enter), ]
rownames(om1) <- 1:NROW(om1)
om1$id <- as.numeric(as.factor(om1$id))
head(om1)

  id    sex   civ   region enter   exit event agegrp
1  1   male widow    rural    60 61.000     0     60
2  1   male widow    rural    61 62.000     0     61
3  1   male widow    rural    62 63.000     0     62
4  1   male widow    rural    63 63.413     0     63
5  2 female widow industry    60 61.000     0     60
6  2 female widow industry    61 62.000     0     61

This is the long format, each individual has as many records as “presence ages.” For instance, person No. 1 has four records, for the ages 60–63. The maximum possible No. of records for one individual is 10. We can check the distribution of No. of records per person by using the function tapply:

recs <- tapply(om1$id, om1$id, length)
table(recs)

recs
  1   2   3   4   5   6   7   8   9  10 
400 397 351 315 307 250 192 208 146 657

It is easier to get to grips with the distribution with a graph, in this case a barplot, see Figure 8.14.

FIGURE 8.14: Barplot of the number of records per person.

Now, let us turn om1 into a data frame in wide format. This is done with the function reshape. First we remove the redundant variables enter and exit.

om1$exit <- om1$enter <- NULL
om2 <- reshape(om1, v.names = c("event", "civ", "region"), 
               idvar = "id", direction = "wide", 
               timevar = "agegrp")
names(om2)

 [1] "id"        "sex"       "event.60"  "civ.60"    "region.60"
 [6] "event.61"  "civ.61"    "region.61" "event.62"  "civ.62"   
[11] "region.62" "event.63"  "civ.63"    "region.63" "event.64" 
[16] "civ.64"    "region.64" "event.65"  "civ.65"    "region.65"
[21] "event.66"  "civ.66"    "region.66" "event.67"  "civ.67"   
[26] "region.67" "event.68"  "civ.68"    "region.68" "event.69" 
[31] "civ.69"    "region.69"

Here there are two time-fixed variables, id and sex, and three time-varying variables, event, civ, and region. The time-varying variables have suffix of the type .xx, where xx varies from 60 to 69.

This is how data in wide format usually show up; the suffix may start with something else than ., but it must be a single character, or nothing. The real problem is how to switch from wide format to long, because our survival analysis tools want it that way. The solution is to use reshape again, but with other specifications.

om3 <- reshape(om2, direction = "long", idvar = "id", 
               varying = 3:32)
head(om3)

     id    sex time event     civ   region
1.60  1   male   60     0   widow    rural
2.60  2 female   60     0   widow industry
3.60  3   male   60     0 married     town
4.60  4   male   60     0 married     town
5.60  5 female   60     0 married     town
6.60  6 female   60     0 married     town

There is a new variable time created, which goes from 60 to 69, one step for each of the ages. We would like to have the file sorted primarily by id and secondary by time.

om3 <- om3[order(om3$id, om3$time), ]
om3[1:11, ]

     id    sex time event   civ   region
1.60  1   male   60     0 widow    rural
1.61  1   male   61     0 widow    rural
1.62  1   male   62     0 widow    rural
1.63  1   male   63     0 widow    rural
1.64  1   male   64    NA  <NA>     <NA>
1.65  1   male   65    NA  <NA>     <NA>
1.66  1   male   66    NA  <NA>     <NA>
1.67  1   male   67    NA  <NA>     <NA>
1.68  1   male   68    NA  <NA>     <NA>
1.69  1   male   69    NA  <NA>     <NA>
2.60  2 female   60     0 widow industry

Note that all individuals got 10 records here, even those who only are observed for fewer years. Individual No. 1 is only observed for the ages 60–63, and the next six records are redundant; they will not be used in an analysis if kept, so it is from a practical point of view a good idea to remove them.

NROW(om3)

[1] 32230

om3 <- om3[!is.na(om3$event), ]
NROW(om3)

[1] 17434

The data frame shrunk to almost half of what it was originally. First, let us summarize data.

summary(om3)

       id           sex             time           event        
 Min.   :   1   male  : 7045   Min.   :60.00   Min.   :0.00000  
 1st Qu.: 655   female:10389   1st Qu.:61.00   1st Qu.:0.00000  
 Median :1267                  Median :63.00   Median :0.00000  
 Mean   :1328                  Mean   :63.15   Mean   :0.02587  
 3rd Qu.:1948                  3rd Qu.:65.00   3rd Qu.:0.00000  
 Max.   :3223                  Max.   :69.00   Max.   :1.00000  
        civ             region    
 unmarried: 1565   town    :2485  
 married  :11380   industry:5344  
 widow    : 4489   rural   :9605

The key variables in the discrete time analysis are event and time. For the baseline hazard we need one parameter per value of time, so it is practical to transform the continuous variable time to a factor.

om3$time <- as.factor(om3$time)
summary(om3)

       id           sex             time          event        
 Min.   :   1   male  : 7045   60     :3223   Min.   :0.00000  
 1st Qu.: 655   female:10389   61     :2823   1st Qu.:0.00000  
 Median :1267                  62     :2426   Median :0.00000  
 Mean   :1328                  63     :2075   Mean   :0.02587  
 3rd Qu.:1948                  64     :1760   3rd Qu.:0.00000  
 Max.   :3223                  65     :1453   Max.   :1.00000  
                               (Other):3674                    
        civ             region    
 unmarried: 1565   town    :2485  
 married  :11380   industry:5344  
 widow    : 4489   rural   :9605

The summary now produces a frequency table for time.

Note that we always want our data to be in long format before we start the analysis, so the important lesson here was how to go from wide to long. You may find the tidyr package (Wickham 2021) useful if you encounter “untidy” data and want to tidy up.

8.4.2 Binomial regression with glm

For a given time point and a given individual, the response is whether an event has occurred or not, that is, it is modeled as a outcome, which is a special case of the distribution. The discrete time analysis may now be performed in several ways. Most straightforward is to run a logistic regression with event as response through the basic glm function with family = binomial(link=cloglog). The so-called link is used in order to preserve the proportional hazards property in the underlying, real world, continuous time model.

fit.glm <- glm(event ~ sex + civ + region + time, 
               family = binomial(link = cloglog), data = om3)
summary(fit.glm)


Call:
glm(formula = event ~ sex + civ + region + time, family = binomial(link = cloglog), 
    data = om3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4263  -0.2435  -0.2181  -0.1938   2.9503  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)    -3.42133    0.21167 -16.164  < 2e-16
sexfemale      -0.36332    0.09927  -3.660 0.000252
civmarried     -0.33683    0.15601  -2.159 0.030847
civwidow       -0.19008    0.16810  -1.131 0.258175
regionindustry  0.10784    0.14443   0.747 0.455264
regionrural    -0.22423    0.14034  -1.598 0.110080
time61          0.04895    0.18505   0.265 0.791366
time62          0.46005    0.17400   2.644 0.008195
time63          0.05586    0.20200   0.277 0.782134
time64          0.46323    0.18883   2.453 0.014162
time65          0.31749    0.20850   1.523 0.127816
time66          0.45582    0.21225   2.148 0.031751
time67          0.91511    0.19551   4.681 2.86e-06
time68          0.68549    0.22575   3.036 0.002394
time69          0.60539    0.24904   2.431 0.015064

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4186.8  on 17433  degrees of freedom
Residual deviance: 4125.2  on 17419  degrees of freedom
AIC: 4155.2

Number of Fisher Scoring iterations: 7

This output is not so pleasant (we do not want the time estimates printed), but we can anyway see that females (as usual) have lower mortality than males, that married are better off than unmarried, and that regional differences maybe are not so large. To get a better understanding of the statistical significance of the findings we run drop1 on the fit.

Single term deletions

Model:
event ~ sex + civ + region + time
       Df Deviance    AIC    LRT  Pr(>Chi)
<none>      4125.2 4155.2                 
sex     1   4138.5 4166.5 13.302 0.0002651
civ     2   4130.2 4156.2  4.978 0.0829956
region  2   4135.8 4161.8 10.526 0.0051810
time    9   4161.2 4173.2 36.005 3.956e-05

Mildly surprisingly, civil status is not that statistically significant, but region (and the other variables) is. The strong significance of the time variable is of course expected; mortality is expected to increase with increasing age.

8.4.3 Survival analysis with coxreg

By some data manipulation we can also use the function coxreg in the package eha for the analysis. For that to succeed we need intervals as responses, and the way of doing that is to add two “fake” variables, exit and enter. The latter must be slightly smaller than the former:

om3$exit <- as.numeric(as.character(om3$time))
om3$enter <- om3$exit - 0.1
cap = "Old age mortality, discrete time analysis."
fit.ML <- coxreg(Surv(enter, exit, event) ~ sex + civ + region, 
                 method = "ml", data = om3, coxph = FALSE)
fit.out(fit.ML, caption = cap, label = "mlreg8")

TABLE 8.10: Old age mortality, discrete time analysis.
Covariate	level	W_mean	Coef	HR	SE	LR_p
sex						0.0003
	male	0.404	0	1
	female	0.596	-0.3633	0.6954	0.0993
civ						0.0830
	unmarried	0.090	0	1
	married	0.653	-0.3368	0.7140	0.1560
	widow	0.257	-0.1901	0.8269	0.1681
region						0.0052
	town	0.143	0	1
	industry	0.307	0.1078	1.1139	0.1445
	rural	0.551	-0.2242	0.7991	0.1404
Max Log	Likelihood		-2062.6

The result is shown in Table 8.10, and plots of the cumulative hazards and the survival function are easily achieved, see Figures 8.15 and 8.16.

FIGURE 8.15: The cumulative hazards, from the coxreg fit.

FIGURE 8.16: The survival function, from the coxreg fit.

Finally, the proportional hazards assumption can be tested in the discrete time framework by creating an interaction between time and the covariates in question. It is possible by using glm.

fit2.glm <- glm(event ~ (sex + civ + region) * time, 
                family = binomial(link = cloglog), 
                data = om3)
drop1(fit2.glm, test = "Chisq")

Single term deletions

Model:
event ~ (sex + civ + region) * time
            Df Deviance    AIC    LRT Pr(>Chi)
<none>           4077.1 4197.1                
sex:time     9   4088.2 4190.2 11.116   0.2679
civ:time    18   4099.5 4183.5 22.425   0.2137
region:time 18   4093.7 4177.7 16.600   0.5508

There is no sign of non-proportionality, that is, no interaction with time.

References

Akaike, H. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19: 716–23.

Barbi, E., F. Lagona, M. Marsili, J. W. Vaupel, and K. W. Wachter. 2018. “The Plateau of Human Mortality: Demography of Longevity Pioneers.” Science 360: 1459–61.

———. 2019. “The Semi-Supercentenarian Mortality Plateau in Sweden.”

Jackson, Christopher. 2016. “flexsurv: A Platform for Parametric Survival Modeling in R.” Journal of Statistical Software 70 (8): 1–33. https://doi.org/10.18637/jss.v070.i08.

Leeuw, J. de. 1992. “Introduction to Akaike (1973) Information Theory and an Extension of the Maximum Likelihood Principle.” In Breakthroughs in Statistics I, edited by S. Kotz and N. L. Johnson, 599–609. Springer.

Lenart, A., and J. W. Vaupel. 2017. “Questionable Evidence for a Limit to Human Lifespan.” Nature 546: E13–14.

Rootzén, H., and D. Zholud. 2017. “Human Life Is Unlimited—but Short.” Extremes 20: 713–28.

Wickham, Hadley. 2021. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.