6.3 Choosing the best parametric model
For modeling survival data with parametric proportional hazards models, the distributions of the function phreg in the package eha are available. How to select a suitable parametric model is shown by a couple of examples.
6.3.1 Old age mortality
The data set oldmort in the R package eha contains life histories for people aged 60 and above in the years 1860–1880, i.e., 21 years. The data come from the Demographic Data Base at Umeå University, Umeå, Sweden, and cover the sawmill district of Sundsvall in the middle of Sweden. This was one of the largest sawmill districts in Europe in the late 19th century. The town Sundsvall is located in the district, which also contains a rural area, where farming was the main occupation.
## id enter exit event
## Min. :765000603 Min. :60.00 Min. : 60.00 Mode :logical
## 1st Qu.:797001170 1st Qu.:60.00 1st Qu.: 63.88 FALSE:4524
## Median :804001545 Median :60.07 Median : 68.51 TRUE :1971
## Mean :803652514 Mean :64.07 Mean : 69.89
## 3rd Qu.:812001564 3rd Qu.:66.88 3rd Qu.: 74.73
## Max. :826002672 Max. :94.51 Max. :100.00
##
## birthdate m.id f.id sex
## Min. :1765 Min. : 6039 Min. : 2458 male :2884
## 1st Qu.:1797 1st Qu.:766000610 1st Qu.:763000610 female:3611
## Median :1805 Median :775000742 Median :772000649
## Mean :1804 Mean :771271398 Mean :762726961
## 3rd Qu.:1812 3rd Qu.:783000743 3rd Qu.:780001077
## Max. :1820 Max. :802000669 Max. :797001468
## NA's :3155 NA's :3310
## civ ses.50 birthplace imr.birth region
## unmarried: 557 middle : 233 parish:3598 Min. : 4.348 town : 657
## married :3638 unknown:2565 region:1503 1st Qu.:12.709 industry:2214
## widow :2300 upper : 55 remote:1394 Median :14.234 rural :3624
## farmer :1562 Mean :15.209
## lower :2080 3rd Qu.:17.718
## Max. :31.967
##
A short explanation of the variables: id is an identification number. It connects records (follow-up intervals) that belong to one person. m.id, f.id are mother’s and father’s identification numbers, respectively. enter, exit and event are the components of the survival object, i.e., start of interval, end of interval, and an indicator of whether the interval ends with a death, respectively.
Note that by design, individuals are followed from the day they are aged 60. In order to calculate the follow-up times, we should subtract 60 from the two columns enter and exit. Otherwise, when specifying a parametric survivor distribution, it would in fact correspond to a left-truncated (at 60) distribution. However, for a Cox regression, this makes no difference.
The covariate (factor) ses.50 is the socio-economic status at (approximately) age 50. The largest category is unknown. We have chosen to include it as a level of its own, rather than treating it as missing. One reason is that we cannot assume missing at random here. On the contrary, a missing value strongly indicates that the person belongs to a lower class, or was born early.
om <- oldmort
om$Y <- Surv(om$enter - 60, om$exit - 60, om$event)
fit.w <- phreg(Y ~ sex + civ + birthplace, data = om)
fit.w## Call:
## phreg(formula = Y ~ sex + civ + birthplace, data = om)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## sex
## male 0.406 0 1 (reference)
## female 0.594 -0.248 0.780 0.048 0.000
## civ
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.456 0.634 0.081 0.000
## widow 0.390 -0.209 0.811 0.079 0.008
## birthplace
## parish 0.574 0 1 (reference)
## region 0.226 0.055 1.056 0.055 0.321
## remote 0.200 0.062 1.064 0.059 0.299
##
## log(scale) 2.639 0.048 0.000
## log(shape) 0.526 0.019 0.000
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -7410
## LR test statistic 55.10
## Degrees of freedom 5
## Overall p-value 1.24244e-10
Note how we introduced the new variable \(Y\); the sole purpose of this is to get more compact formulas in the modeling.
Here we applied a Weibull baseline distribution (the default distribution in phreg; by specifying nothing, the Weibull is chosen). Now let us repeat this with all the distributions in the phreg package.
ln <- phreg(Y ~ sex + civ + birthplace, data = om,
dist = "lognormal")
ll <- phreg(Y ~ sex + civ + birthplace, data = om,
dist = "loglogistic")
g <- phreg(Y ~ sex + civ + birthplace, data = om,
dist = "gompertz")
ev <- phreg(Y ~ sex + civ + birthplace, data = om,
dist = "ev")Then we compare the maximized log-likelihoods and choose the distribution with the largest value.
xx <- c(fit.w$loglik[2], ln$loglik[2], ll$loglik[2],
g$loglik[2], ev$loglik[2])
names(xx) <- c("w", "ln", "ll", "g", "ev")
xx## w ln ll g ev
## -7409.954 -7419.544 -7409.954 -7273.701 -7310.142
The Gompertz (g) distribution gives the largest value of the maximized log likelihood. Let us graphically inspect the fit, see Figure 6.10.
FIGURE 6.10: Graphical fit: Gompertz baseline versus the nonparametric (Cox regression, dashed).
The fit is obviously great during the first 30 years (ages 60–90), but fails somewhat thereafter. One explanation to “the failure” is that after age 90, not so many persons are still at risk (alive); another is that maybe the mortality levels off at very high ages (on a very high level, of course).
The Gompertz hazard function is shown in Figure 6.11.
FIGURE 6.11: Estimated Gompertz hazard function for old age mortality data.
It is an exponentially increasing function of time; the Gompertz model usually fits old age mortality very well.
The \(p\)-values measure the significance of a group compared to the reference category, but we would also like to have an overall \(p\)-value for the covariates (factors) as such, i.e., answer the question “does birthplace matter?”. This can be achieved by running an analysis without birthplace:
## Call:
## phreg(formula = Y ~ sex + civ, data = om)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## sex
## male 0.406 0 1 (reference)
## female 0.594 -0.249 0.780 0.047 0.000
## civ
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.455 0.634 0.081 0.000
## widow 0.390 -0.207 0.813 0.079 0.008
##
## log(scale) 2.625 0.047 0.000
## log(shape) 0.524 0.019 0.000
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -7410.8
## LR test statistic 53.49
## Degrees of freedom 3
## Overall p-value 1.44385e-11
Then we compare the max log likelihoods: -7409.954 and -7410.763, respectively. The test statistic is 2 times the difference, 1.618. Under the null hypothesis of no birthplace effect on mortality, this test statistic has an approximate \(\chi^2\) distribution with 2 degrees of freedom. The degrees of freedom is the number of omitted parameters in the reduced model, two in this case. This because the factor birthplace has three levels.
There is a much simpler way of doing this, and that is to use the function
drop1. As its name may suggest, it drops one variable at a time and
reruns the fit, calculating the max log likelihoods and differences as above.
## Single term deletions
##
## Model:
## Y ~ sex + civ + birthplace
## Df AIC LRT Pr(>Chi)
## <none> 14830
## sex 1 14855 27.048 1.985e-07 ***
## civ 2 14866 39.989 2.073e-09 ***
## birthplace 2 14828 1.618 0.4453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, we do not only recover the test statistic for birthplace, we get the corresponding tests for all the involved covariates. Thus, it is obvious that both sex and civ have a statistically very significant effect on old age mortality, while birthplace does not mean much. Note that these effects are measured in the presence of the other two variables.
We can plot the baseline distribution by
with the result shown in Figure 6.12.
FIGURE 6.12: Baseline distribution of remaining life at 60 for an average person. Weibull model.
We can see one advantage with parametric models here: They make it possible to estimate the hazard and density functions. This is much trickier with a semi-parametric model like the Cox proportional hazards model. Another question is, of course, how well the Weibull model fits the data. One way to graphically check this is to fit a Cox regression model and compare the two cumulative hazards plots. This is done by using the function check.dist:
The result is shown in Figure 6.13.
FIGURE 6.13: Baseline distribution of remaining life at 60 for an average person. Weibull model.
Obviously, the fit is not good; the Weibull model cannot capture the fast rise of the hazard by age. An exponentially increasing hazard function may be needed, so let us try the Gompertz distribution:
## Call:
## phreg(formula = Y ~ sex + civ + birthplace, data = om, dist = "gompertz")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## sex
## male 0.406 0 1 (reference)
## female 0.594 -0.246 0.782 0.047 0.000
## civ
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.405 0.667 0.081 0.000
## widow 0.390 -0.265 0.767 0.079 0.001
## birthplace
## parish 0.574 0 1 (reference)
## region 0.226 0.065 1.067 0.055 0.240
## remote 0.200 0.085 1.089 0.059 0.150
##
## log(scale) 2.364 0.032 0.000
## log(shape) -1.181 0.109 0.000
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -7273.7
## LR test statistic 45.51
## Degrees of freedom 5
## Overall p-value 1.14164e-08
One sign of a much better fit is the larger value of the maximized log likelihood, -7273.701 versus -7409.954 for the Weibull fit. The comparison to the Cox regression fit is given by
with the result shown in Figure~.
FIGURE 6.14: Check of a Gompertz fit. The solid line is the cumulative hazards function from the Gompertz fit, and the dashed line is the fit from a Cox regression.
This is a much better fit. The deviation that starts above 30 (corresponding to the age 90) is no big issue; in that range few people are still alive and the random variation takes over. Generally, it seems as if the Gompertz distribution fits old age mortality well.
The plots of the baseline Gompertz distribution is shown in Figure 6.15.
FIGURE 6.15: Baseline distribution of remaining life at 60 for an average person. Gompertz model.
Compare to the corresponding fit to the Weibull model, Figure 6.12.