6.3 Choosing the best parametric model

For modeling survival data with parametric proportional hazards models, the distributions of the function phreg in the package eha are available. How to select a suitable parametric model is shown by a couple of examples.

6.3.1 Old age mortality

The data set oldmort in the R package eha contains life histories for people aged 60 and above in the years 1860–1880, i.e., 21 years. The data come from the Demographic Data Base at Umeå University, Umeå, Sweden, and cover the sawmill district of Sundsvall in the middle of Sweden. This was one of the largest sawmill districts in Europe in the late 19th century. The town Sundsvall is located in the district, which also contains a rural area, where farming was the main occupation.

require(eha)
data(oldmort)
summary(oldmort)
##        id                enter            exit          event        
##  Min.   :765000603   Min.   :60.00   Min.   : 60.00   Mode :logical  
##  1st Qu.:797001170   1st Qu.:60.00   1st Qu.: 63.88   FALSE:4524     
##  Median :804001545   Median :60.07   Median : 68.51   TRUE :1971     
##  Mean   :803652514   Mean   :64.07   Mean   : 69.89                  
##  3rd Qu.:812001564   3rd Qu.:66.88   3rd Qu.: 74.73                  
##  Max.   :826002672   Max.   :94.51   Max.   :100.00                  
##                                                                      
##    birthdate         m.id                f.id               sex      
##  Min.   :1765   Min.   :     6039   Min.   :     2458   male  :2884  
##  1st Qu.:1797   1st Qu.:766000610   1st Qu.:763000610   female:3611  
##  Median :1805   Median :775000742   Median :772000649                
##  Mean   :1804   Mean   :771271398   Mean   :762726961                
##  3rd Qu.:1812   3rd Qu.:783000743   3rd Qu.:780001077                
##  Max.   :1820   Max.   :802000669   Max.   :797001468                
##                 NA's   :3155        NA's   :3310                     
##         civ           ses.50      birthplace     imr.birth           region    
##  unmarried: 557   middle : 233   parish:3598   Min.   : 4.348   town    : 657  
##  married  :3638   unknown:2565   region:1503   1st Qu.:12.709   industry:2214  
##  widow    :2300   upper  :  55   remote:1394   Median :14.234   rural   :3624  
##                   farmer :1562                 Mean   :15.209                  
##                   lower  :2080                 3rd Qu.:17.718                  
##                                                Max.   :31.967                  
## 

A short explanation of the variables: id is an identification number. It connects records (follow-up intervals) that belong to one person. m.id, f.id are mother’s and father’s identification numbers, respectively. enter, exit and event are the components of the survival object, i.e., start of interval, end of interval, and an indicator of whether the interval ends with a death, respectively.

Note that by design, individuals are followed from the day they are aged 60. In order to calculate the follow-up times, we should subtract 60 from the two columns enter and exit. Otherwise, when specifying a parametric survivor distribution, it would in fact correspond to a left-truncated (at 60) distribution. However, for a Cox regression, this makes no difference.

The covariate (factor) ses.50 is the socio-economic status at (approximately) age 50. The largest category is unknown. We have chosen to include it as a level of its own, rather than treating it as missing. One reason is that we cannot assume missing at random here. On the contrary, a missing value strongly indicates that the person belongs to a lower class, or was born early.

om <- oldmort
om$Y <- Surv(om$enter - 60, om$exit - 60, om$event)    
fit.w <- phreg(Y ~ sex + civ + birthplace, data = om)
fit.w
## Call:
## phreg(formula = Y ~ sex + civ + birthplace, data = om)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## sex 
##             male    0.406     0         1           (reference)
##           female    0.594    -0.248     0.780     0.048     0.000 
## civ 
##        unmarried    0.080     0         1           (reference)
##          married    0.530    -0.456     0.634     0.081     0.000 
##            widow    0.390    -0.209     0.811     0.079     0.008 
## birthplace 
##           parish    0.574     0         1           (reference)
##           region    0.226     0.055     1.056     0.055     0.321 
##           remote    0.200     0.062     1.064     0.059     0.299 
## 
## log(scale)                    2.639               0.048     0.000 
## log(shape)                    0.526               0.019     0.000 
## 
## Events                    1971 
## Total time at risk         37824 
## Max. log. likelihood       -7410 
## LR test statistic         55.10 
## Degrees of freedom        5 
## Overall p-value           1.24244e-10

Note how we introduced the new variable \(Y\); the sole purpose of this is to get more compact formulas in the modeling.

Here we applied a Weibull baseline distribution (the default distribution in phreg; by specifying nothing, the Weibull is chosen). Now let us repeat this with all the distributions in the phreg package.

ln <- phreg(Y ~ sex + civ + birthplace, data = om, 
            dist = "lognormal")
ll <- phreg(Y ~ sex + civ + birthplace, data = om, 
            dist = "loglogistic")
g <- phreg(Y ~ sex + civ + birthplace, data = om, 
           dist = "gompertz")
ev <- phreg(Y ~ sex + civ + birthplace, data = om, 
            dist = "ev")

Then we compare the maximized log-likelihoods and choose the distribution with the largest value.

xx <- c(fit.w$loglik[2], ln$loglik[2], ll$loglik[2], 
        g$loglik[2], ev$loglik[2])
names(xx) <- c("w", "ln", "ll", "g", "ev")
xx
##         w        ln        ll         g        ev 
## -7409.954 -7419.544 -7409.954 -7273.701 -7310.142

The Gompertz (g) distribution gives the largest value of the maximized log likelihood. Let us graphically inspect the fit, see Figure 6.10.

fit.c <- coxreg(Y ~ sex + civ + birthplace, data = om)
check.dist(fit.c, g)
Graphical fit: Gompertz baseline versus the nonparametric (Cox regression, dashed).

FIGURE 6.10: Graphical fit: Gompertz baseline versus the nonparametric (Cox regression, dashed).

The fit is obviously great during the first 30 years (ages 60–90), but fails somewhat thereafter. One explanation to “the failure” is that after age 90, not so many persons are still at risk (alive); another is that maybe the mortality levels off at very high ages (on a very high level, of course).

The Gompertz hazard function is shown in Figure 6.11.

plot(g, fn = "haz")
Estimated Gompertz hazard function for old age mortality data.

FIGURE 6.11: Estimated Gompertz hazard function for old age mortality data.

It is an exponentially increasing function of time; the Gompertz model usually fits old age mortality very well.

The \(p\)-values measure the significance of a group compared to the reference category, but we would also like to have an overall \(p\)-value for the covariates (factors) as such, i.e., answer the question “does birthplace matter?”. This can be achieved by running an analysis without birthplace:

fit.w1 <- phreg(Y ~ sex + civ, data = om) 
fit.w1
## Call:
## phreg(formula = Y ~ sex + civ, data = om)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## sex 
##             male    0.406     0         1           (reference)
##           female    0.594    -0.249     0.780     0.047     0.000 
## civ 
##        unmarried    0.080     0         1           (reference)
##          married    0.530    -0.455     0.634     0.081     0.000 
##            widow    0.390    -0.207     0.813     0.079     0.008 
## 
## log(scale)                    2.625               0.047     0.000 
## log(shape)                    0.524               0.019     0.000 
## 
## Events                    1971 
## Total time at risk         37824 
## Max. log. likelihood      -7410.8 
## LR test statistic         53.49 
## Degrees of freedom        3 
## Overall p-value           1.44385e-11

Then we compare the max log likelihoods: -7409.954 and -7410.763, respectively. The test statistic is 2 times the difference, 1.618. Under the null hypothesis of no birthplace effect on mortality, this test statistic has an approximate \(\chi^2\) distribution with 2 degrees of freedom. The degrees of freedom is the number of omitted parameters in the reduced model, two in this case. This because the factor birthplace has three levels.

There is a much simpler way of doing this, and that is to use the function drop1. As its name may suggest, it drops one variable at a time and reruns the fit, calculating the max log likelihoods and differences as above.

drop1(fit.w, test = "Chisq")
## Single term deletions
## 
## Model:
## Y ~ sex + civ + birthplace
##            Df   AIC    LRT  Pr(>Chi)    
## <none>        14830                     
## sex         1 14855 27.048 1.985e-07 ***
## civ         2 14866 39.989 2.073e-09 ***
## birthplace  2 14828  1.618    0.4453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, we do not only recover the test statistic for birthplace, we get the corresponding tests for all the involved covariates. Thus, it is obvious that both sex and civ have a statistically very significant effect on old age mortality, while birthplace does not mean much. Note that these effects are measured in the presence of the other two variables.

We can plot the baseline distribution by

plot(fit.w)

with the result shown in Figure 6.12.

Baseline distribution of remaining life at 60 for an average  person. Weibull model.

FIGURE 6.12: Baseline distribution of remaining life at 60 for an average person. Weibull model.

We can see one advantage with parametric models here: They make it possible to estimate the hazard and density functions. This is much trickier with a semi-parametric model like the Cox proportional hazards model. Another question is, of course, how well the Weibull model fits the data. One way to graphically check this is to fit a Cox regression model and compare the two cumulative hazards plots. This is done by using the function check.dist:

The result is shown in Figure 6.13.

Baseline distribution of remaining life at 60 for an average person. Weibull model.

FIGURE 6.13: Baseline distribution of remaining life at 60 for an average person. Weibull model.

Obviously, the fit is not good; the Weibull model cannot capture the fast rise of the hazard by age. An exponentially increasing hazard function may be needed, so let us try the Gompertz distribution:

fit.g <- phreg(Y ~ sex + civ + birthplace, data = om, 
               dist = "gompertz")
fit.g
## Call:
## phreg(formula = Y ~ sex + civ + birthplace, data = om, dist = "gompertz")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## sex 
##             male    0.406     0         1           (reference)
##           female    0.594    -0.246     0.782     0.047     0.000 
## civ 
##        unmarried    0.080     0         1           (reference)
##          married    0.530    -0.405     0.667     0.081     0.000 
##            widow    0.390    -0.265     0.767     0.079     0.001 
## birthplace 
##           parish    0.574     0         1           (reference)
##           region    0.226     0.065     1.067     0.055     0.240 
##           remote    0.200     0.085     1.089     0.059     0.150 
## 
## log(scale)                    2.364               0.032     0.000 
## log(shape)                   -1.181               0.109     0.000 
## 
## Events                    1971 
## Total time at risk         37824 
## Max. log. likelihood      -7273.7 
## LR test statistic         45.51 
## Degrees of freedom        5 
## Overall p-value           1.14164e-08

One sign of a much better fit is the larger value of the maximized log likelihood, -7273.701 versus -7409.954 for the Weibull fit. The comparison to the Cox regression fit is given by

check.dist(fit, fit.g)

with the result shown in Figure~.

Check of a Gompertz fit. The solid line is the cumulative hazards function from the Gompertz fit, and the dashed line is the fit from a Cox regression.

FIGURE 6.14: Check of a Gompertz fit. The solid line is the cumulative hazards function from the Gompertz fit, and the dashed line is the fit from a Cox regression.

This is a much better fit. The deviation that starts above 30 (corresponding to the age 90) is no big issue; in that range few people are still alive and the random variation takes over. Generally, it seems as if the Gompertz distribution fits old age mortality well.

The plots of the baseline Gompertz distribution is shown in Figure 6.15.

Baseline distribution of remaining life at 60 for an average person. Gompertz model.

FIGURE 6.15: Baseline distribution of remaining life at 60 for an average person. Gompertz model.

Compare to the corresponding fit to the Weibull model, Figure 6.12.