6.2 Comparing the Weibull and Lognormal fits

From looking at Figures 6.2 and 6.3, it seems obvious that at least one of the fits must be less good: The Weibull cumulative hazards curve is convex and the lognormal one is concave, and the estimates of the two hazard functions are very different in shape. There are two direct ways of comparing the fits.

The first way to compare is to look at the maximized log likelihoods. For the Weibull fit it is -3140.04 and for the lognormal fit it is -2584.13. The rule is that the largest value wins, so the lognormal model is the clear winner. Note that we do not claim to perform a formal test here; a likelihood ratio test would require that the two models we want to compare are nested, but that is not the case here. This is (here) equivalent to using the AIC as a measure of comparison, because the two models have the same number of parameters to estimate.

The second method of comparison is graphical. We can plot the cumulative hazard functions against the nonparametric estimate from a Cox regression fit, and judge which looks closer. It starts by fitting a Cox model with the same covariates:

fit.cr <- coxreg(Y ~ age + year + ses, data = f12)

Then the function check.dist (from **eha) is called. We start with the Weibull* fit:

check.dist(fit.cr, fit.w)
Check of the Weibull model, birth intervals.

FIGURE 6.4: Check of the Weibull model, birth intervals.

This fit (Figure 6.4) looks very poor: One reason is that the longest waiting time for an observed event is about 12 years, while some women simply do not get more than one child, with a long waiting time ending in a censoring as a result. The time span from around 12 to 25 simply has no events, and the nonparametric estimate of the hazard function in that interval is zero. The Weibull distribution fit cannot cope with that.

One possibility is to try to fit a Weibull distribution only for the 12 first years of duration. Note below that after calling age.window, \(Y\) must be recreated! This is the danger with this “lazy” approach.

f12$enter <- rep(0, NROW(f12))
f12.cens <- age.window(f12, c(0, 12), 
                       surv = c("enter", "next.ivl", "event"))
f12.cens$Y <- Surv(f12.cens$enter, f12.cens$next.ivl, 
                   f12.cens$event)
fit.wc <- phreg(Y ~ age + year + ses, data = f12.cens)
fit.c <- coxreg(Y ~ age + year + ses, data = f12.cens)
fit.wc
## Call:
## phreg(formula = Y ~ age + year + ses, data = f12.cens)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## age                27.267    -0.050     0.951     0.006     0.000 
## year             1858.953     0.002     1.002     0.002     0.237 
## ses 
##           farmer    0.456     0         1           (reference)
##          unknown    0.182    -0.262     0.770     0.070     0.000 
##            upper    0.022    -0.204     0.816     0.173     0.239 
##            lower    0.340    -0.077     0.926     0.056     0.171 
## 
## log(scale)                    3.050               2.428     0.209 
## log(shape)                    0.454               0.016     0.000 
## 
## Events                    1656 
## Total time at risk          4311 
## Max. log. likelihood      -2927.7 
## LR test statistic         106.81 
## Degrees of freedom        5 
## Overall p-value           0

Compare the Weibull fit with what we had without cutting off durations at 12. The comparison with the nonparametric cumulative hazards function is now seen in Figure 6.5.

check.dist(fit.c, fit.wc)
Check of the Weibull model, birth intervals censored at 12.

FIGURE 6.5: Check of the Weibull model, birth intervals censored at 12.

This wasn’t much better. Let’s look what happens with the lognormal distribution.

We continue using the data censored at 12 and refit the lognormal model.

fit.lnc <- phreg(Y ~ age + year + ses, data = f12.cens, 
                 dist = "lognormal")

Then, in Figure 6.6 we check the fit against the nonparametric fit again.

check.dist(fit.c, fit.lnc)
Check of the lognormal model, birth intervals censored at 12.

FIGURE 6.6: Check of the lognormal model, birth intervals censored at 12.

It is not very much better. In fact, this empirical distribution has features that the ordinary parametric distributions cannot cope with: Its hazard function starts off being zero for almost one year (birth intervals are rarely shorter than one year), then it grows rapidly and soon decreases towards zero again.

There is, however one parametric distribution that is flexible enough: The piecewise constant hazards distribution.

6.2.1 The piecewise constant hazards (pch) model

The pch distribution is flexible because you can add as many parameters as you want. We will try it on the birth interval data. We start off with just four intervals, and the non-censored data set.

fit.pch <- phreg(Surv(next.ivl, event) ~ age + year + ses, 
                 data = f12, dist = "pch", cuts = c(4, 8, 12))
fit.c <- coxreg(Surv(next.ivl, event) ~ age + year + ses, 
                data = f12)
fit.pch
## Call:
## phreg(formula = Surv(next.ivl, event) ~ age + year + ses, data = f12, 
##     dist = "pch", cuts = c(4, 8, 12))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## age                27.151    -0.031     0.969     0.006     0.000 
## year             1858.664     0.001     1.001     0.002     0.622 
## ses 
##           farmer    0.449     0         1           (reference)
##          unknown    0.190    -0.114     0.892     0.070     0.102 
##            upper    0.024    -0.033     0.968     0.173     0.849 
##            lower    0.336    -0.051     0.950     0.056     0.361 
## 
## 
## Events                    1657 
## Total time at risk        4500.5 
## Max. log. likelihood      -3170.2 
## LR test statistic         39.21 
## Degrees of freedom        5 
## Overall p-value           2.15744e-07

Then we check the fit in Figure 6.7.

check.dist(fit.c, fit.pch)
Check of the pch model, uncensored birth intervals.

FIGURE 6.7: Check of the pch model, uncensored birth intervals.

This is not good enough. We need shorter intervals close to zero, so

fit.pch <- phreg(Surv(next.ivl, event) ~ age + year + ses, 
                 data = f12, dist = "pch", cuts = 1:13)
fit.pch
## Call:
## phreg(formula = Surv(next.ivl, event) ~ age + year + ses, data = f12, 
##     dist = "pch", cuts = 1:13)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## age                27.151    -0.040     0.961     0.005     0.000 
## year             1858.664     0.002     1.002     0.002     0.372 
## ses 
##           farmer    0.449     0         1           (reference)
##          unknown    0.190    -0.113     0.893     0.070     0.106 
##            upper    0.024    -0.006     0.994     0.173     0.973 
##            lower    0.336    -0.070     0.932     0.056     0.212 
## 
## 
## Events                    1657 
## Total time at risk        4500.5 
## Max. log. likelihood      -2306.3 
## LR test statistic         67.87 
## Degrees of freedom        5 
## Overall p-value           2.83995e-13

This gives us 14 intervals with constant baseline hazard, that is, 14 parameters to estimate only for the baseline hazard function! But the fit is good, see Figure 6.8. In fact, it is almost perfect!

check.dist(fit.c, fit.pch)
Check of the pch model with thirteen constant hazard intervals, uncensored birth intervals.

FIGURE 6.8: Check of the pch model with thirteen constant hazard intervals, uncensored birth intervals.

One of the advantages with parametric models is that it is easy to study and plot the hazard function, and not only the cumulative hazards function, which is the dominant tool for (graphical) analysis in the case of the nonparametric model of Cox regression. For the piecewise constant model just fitted we get the estimated hazard function in Figure 6.9.

plot(fit.pch, fn = "haz")
The estimated hazard function for the length of birth intervals, piecewise constant hazards distribution.

FIGURE 6.9: The estimated hazard function for the length of birth intervals, piecewise constant hazards distribution.

The peak is reached during the third year of waiting, and after that the intensity of giving birth drops fast, and it becomes zero after 12 years.

What are the implications for the estimates of the regression parameters of the different choices of parametric model? Let us compare the regression parameter estimates only for the three distributional assumptions, Weibull, lognormal, and pch. See Table 6.1.

TABLE 6.1: Parameter estimates under different distribution assumptions for the birth intervals data.
Cox Pch Weibull Lognormal
age 0.961 0.961 0.961 0.003
year 1.002 1.002 1.005 0.955
sesunknown 0.897 0.893 0.698 1.003
sesupper 1.018 0.994 0.683 0.764
seslower 0.930 0.932 0.908 0.792

We can see that the differences are not large. However, the two assumptions closest to the “truth”, the Cox model and the Pch model, are very close, while the other two have larger deviations. So an appropriate choice of parametric model seems to be somewhat important, contrary to “common knowledge”, which says that it is not so important to have a good fit of the baseline hazard, if the only interest lies in the regression parameters.

6.2.2 Testing the proportionality assumption with the Pch model

One advantage with the piecewise constant hazards model is that it is easy to test the assumption of proportional hazards. But it cannot be done directly with the phreg function. There is, however, a simple alternative. Before we show how to do it, we must show how to utilize Poisson regression when analyzing the pch model. We do it by reanalyzing the birth intervals data.

First, the data set is split up after the cuts in strata. This is done with the aid of the survSplit function in the survival package.

f12 <- fert[fert$parity == 1, ]
f12$enter <- 0 # 0 expands to a vector
f12.split <- survSplit(f12, cut = 1:13, start = "enter", 
                       end = "next.ivl", event = "event", 
                       episode = "ivl")
head(f12.split)
##   id parity age year prev.ivl    ses parish enter next.ivl event ivl
## 1  1      1  25 1826    0.411 farmer    SKL     0        1     0   1
## 2  1      1  25 1826    0.411 farmer    SKL     1        2     0   2
## 3  1      1  25 1826    0.411 farmer    SKL     2        3     0   3
## 4  1      1  25 1826    0.411 farmer    SKL     3        4     0   4
## 5  1      1  25 1826    0.411 farmer    SKL     4        5     0   5
## 6  1      1  25 1826    0.411 farmer    SKL     5        6     0   6

To see better what happened, we sort the new data frame by id and enter.

f12.split <- f12.split[order(f12.split$id, f12.split$enter), ]
head(f12.split)
##   id parity age year prev.ivl    ses parish enter next.ivl event ivl
## 1  1      1  25 1826    0.411 farmer    SKL     0        1     0   1
## 2  1      1  25 1826    0.411 farmer    SKL     1        2     0   2
## 3  1      1  25 1826    0.411 farmer    SKL     2        3     0   3
## 4  1      1  25 1826    0.411 farmer    SKL     3        4     0   4
## 5  1      1  25 1826    0.411 farmer    SKL     4        5     0   5
## 6  1      1  25 1826    0.411 farmer    SKL     5        6     0   6

We see that a new variable ivl, is created. It tells us which interval we are looking at. You may notice that the variables enter and ivl happens to be equal here; it is because our intervals start by 0, 1, 2, , 13, as given by cuts above.

In the Poisson regression to follow, we model the probability that \(event = 1\). This probability will depend on the length of the studied interval, here \(next.ivl - enter\), which for our data mostly equals one. The exact value we need is the logarithm of this difference, and it will be entered as an offset in the model.

Finally, one parameter per interval is needed, corresponding to the piecewise constant hazards. This is accomplished by including ivl as a factor in the model. Thus we get

f12.split$offs <- log(f12.split$next.ivl - f12.split$enter)
f12.split$ivl <- as.factor(f12.split$ivl)
fit12.pn <- glm(event ~ offset(offs) + 
                age + year + ses + ivl, 
                family = "poisson", data = f12.split)
drop1(fit12.pn, test = "Chisq")
## Single term deletions
## 
## Model:
## event ~ offset(offs) + age + year + ses + ivl
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      4699.7 8051.7                      
## age     1   4756.2 8106.2   56.48 5.682e-14 ***
## year    1   4700.5 8050.5    0.80    0.3723    
## ses     3   4703.0 8049.0    3.29    0.3484    
## ivl    13   6663.7 9989.7 1964.00 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The piecewise cutting (ivl) is very statistically significant, but some caution with the interpretation is recommended: The cut generated 13 parameters, and there may be too few events in some intervals. This easily checked with the function tapply, which is very handy for doing calculations on subsets of the data frame. In this case we want to sum the number ofevents in each interval:

tapply(f12.split$event, f12.split$ivl, sum)
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##  56 819 585 130  32  16   7   1   3   3   4   0   1   0

A reasonable interpretation of this is to restrict attention to the 10 first years; only one birth occurs after a longer waiting time. Then it would be reasonable to collapse the intervals in \((7, 10]\) to one interval. Thus

fc <- age.window(f12.split, c(0, 11), 
                 surv = c("enter", "next.ivl", "event"))
levels(fc$ivl) <- c(0:6, rep("7-11", 7))
tapply(fc$event, fc$ivl, sum)
##    0    1    2    3    4    5    6 7-11 
##   56  819  585  130   32   16    7   11

Note two things here. First the use of the function age.window, second the levels function. The first makes an “age cut” in the Lexis diagram, i.e., all spells are censored at (exact) age 11. The second is more intricate; factors can be tricky to handle. The problem here is that after age.window, the new data frame contains only individuals with values \((0, 1, \ldots, 11)\) on the factor variable ivl, but it is still defined* as a factor with 14 levels. The call to levels collapses the last seven to “7-11”.

Now we rerun the analysis with the new data frame.

fit <- glm(event ~  offset(offs) + age + year + ses + ivl, 
           family = "poisson", data = fc)
drop1(fit, test = "Chisq")
## Single term deletions
## 
## Model:
## event ~ offset(offs) + age + year + ses + ivl
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      4696.2 8034.2                      
## age     1   4752.6 8088.6   56.48 5.679e-14 ***
## year    1   4697.0 8033.0    0.82    0.3650    
## ses     3   4699.5 8031.5    3.32    0.3453    
## ivl     7   6492.9 9816.9 1796.78 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This change in categorization does not change the general conclusions about statistical significance at all.

Now let us include interactions with ivl in the model.

fit.ia <- glm(event ~  offset(offs) + (age + year + ses) * ivl, 
              family = "poisson", data = fc)
drop1(fit.ia, test = "Chisq")
## Single term deletions
## 
## Model:
## event ~ offset(offs) + (age + year + ses) * ivl
##          Df Deviance    AIC     LRT Pr(>Chi)  
## <none>        4646.4 8054.4                   
## age:ivl   7   4661.7 8055.7 15.3110  0.03221 *
## year:ivl  7   4651.0 8045.0  4.5477  0.71497  
## ses:ivl  21   4676.2 8042.2 29.7964  0.09616 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is an apparent interaction with age, which is not very surprising; younger mothers will have a longer fertility period ahead than old mothers. The other interactions do not seem to be very important, so we remove them and rerun the model once again.

fit.ia <- glm(event ~  offset(offs) + year + ses + age * ivl, 
              family = "poisson", data = fc)
drop1(fit.ia, test = "Chisq")
## Single term deletions
## 
## Model:
## event ~ offset(offs) + year + ses + age * ivl
##         Df Deviance    AIC     LRT Pr(>Chi)  
## <none>       4681.1 8033.1                   
## year     1   4682.1 8032.1  0.9707  0.32451  
## ses      3   4684.1 8030.1  3.0028  0.39119  
## age:ivl  7   4696.2 8034.2 15.0437  0.03544 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us look at the parameter estimates.

summary(fit.ia)
## 
## Call:
## glm(formula = event ~ offset(offs) + year + ses + age * ivl, 
##     family = "poisson", data = fc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9217  -0.8446  -0.2513   0.5580   3.6734  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.720e+00  3.986e+00  -1.686 0.091839 .  
## year         2.109e-03  2.139e-03   0.986 0.324283    
## sesunknown  -1.051e-01  6.976e-02  -1.506 0.132006    
## sesupper    -2.611e-02  1.732e-01  -0.151 0.880140    
## seslower    -7.134e-02  5.609e-02  -1.272 0.203388    
## age         -2.363e-02  2.833e-02  -0.834 0.404334    
## ivl1         2.970e+00  7.777e-01   3.819 0.000134 ***
## ivl2         4.231e+00  7.834e-01   5.401 6.62e-08 ***
## ivl3         4.544e+00  8.657e-01   5.248 1.53e-07 ***
## ivl4         3.174e+00  1.090e+00   2.912 0.003592 ** 
## ivl5         3.505e+00  1.382e+00   2.537 0.011193 *  
## ivl6         5.074e+00  2.374e+00   2.137 0.032604 *  
## ivl7-11      2.028e+00  1.877e+00   1.080 0.279928    
## age:ivl1    -3.914e-07  2.911e-02   0.000 0.999989    
## age:ivl2    -2.267e-02  2.933e-02  -0.773 0.439413    
## age:ivl3    -5.040e-02  3.227e-02  -1.562 0.118364    
## age:ivl4    -3.171e-02  3.947e-02  -0.804 0.421665    
## age:ivl5    -5.681e-02  5.091e-02  -1.116 0.264416    
## age:ivl6    -1.406e-01  9.315e-02  -1.509 0.131199    
## age:ivl7-11 -4.787e-02  6.993e-02  -0.685 0.493601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6545.4  on 5201  degrees of freedom
## Residual deviance: 4681.1  on 5182  degrees of freedom
## AIC: 8033.1
## 
## Number of Fisher Scoring iterations: 6

We can see (admittedly not clearly) a tendency of decreasing intensity in the later intervals with higher age.

This can actually also be done with phreg and the Weibull model and fixed shape at one (i.e., an exponential model).

fit.exp <- phreg(Surv(enter, next.ivl, event) ~  year + ses + 
                 age * ivl, dist = "weibull", shape = 1, 
                 data = fc)
drop1(fit.exp, test = "Chisq")
## Single term deletions
## 
## Model:
## Surv(enter, next.ivl, event) ~ year + ses + age * ivl
##         Df    AIC     LRT Pr(>Chi)  
## <none>     4630.1                   
## year     1 4629.0  0.9707  0.32451  
## ses      3 4627.1  3.0028  0.39119  
## age:ivl  7 4631.1 15.0437  0.03544 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The parameter estimates are now presented like this.

fit.exp
## Call:
## phreg(formula = Surv(enter, next.ivl, event) ~ year + ses + age * 
##     ivl, data = fc, dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## year             1858.968     0.002     1.002     0.002     0.324 
## ses 
##           farmer    0.456     0         1           (reference)
##          unknown    0.181    -0.105     0.900     0.070     0.132 
##            upper    0.022    -0.026     0.974     0.173     0.880 
##            lower    0.340    -0.071     0.931     0.056     0.203 
## age                27.267    -0.024     0.977     0.028     0.404 
## ivl 
##                0    0.420     0         1           (reference)
##                1    0.316     2.970    19.494     0.778     0.000 
##                2    0.119     4.231    68.801     0.783     0.000 
##                3    0.044     4.544    94.030     0.866     0.000 
##                4    0.027     3.174    23.909     1.090     0.004 
##                5    0.019     3.505    33.288     1.382     0.011 
##                6    0.015     5.074   159.762     2.374     0.033 
##             7-11    0.041     2.028     7.602     1.877     0.280 
## age:ivl          
##    :1                        -0.000     1.000     0.029     1.000 
##    :2                        -0.023     0.978     0.029     0.439 
##    :3                        -0.050     0.951     0.032     0.118 
##    :4                        -0.032     0.969     0.039     0.422 
##    :5                        -0.057     0.945     0.051     0.264 
##    :6                        -0.141     0.869     0.093     0.131 
##    :7-11                     -0.048     0.953     0.070     0.494 
## 
## log(scale)                    6.720               3.986     0.092 
## 
##  Shape is fixed at  1 
## 
## Events                    1656 
## Total time at risk        4279.3 
## Max. log. likelihood       -2296 
## LR test statistic         1864.32 
## Degrees of freedom        19 
## Overall p-value           0

A lot of information, but the same content as in the output from the Poisson regression. However, here we get the exponentiated parameter estimates, in the column Exp(Coef). These numbers are risk ratios (or relative risks), and easier to interpret.

What should be done with the interactions? The simplest way to go to get an easy-to-interpret result is to categorize age and make separate analyzes, one for each category of age. Let us do that; first we have to decide on how to categorize: there should not be too many categories, and not too few mothers (or births) in any category. As a starting point, take three categories by cutting age at (exact) ages 25 and 30:

fc$agegrp <- cut(fc$age, c(0, 25, 30, 100), 
                 labels = c("< 25", "25-30", ">= 30"))
table(fc$agegrp)
## 
##  < 25 25-30 >= 30 
##  2352  1658  1192

Note the use of the function cut. It “cuts” a continuous variable into pieces, defined by the second argument. Note that we must give lower and upper bounds; I chose 0 and 100, respectively. They are of course not near real ages, but I am sure that no value falls outside the interval \((0, 100)\), and that is the important thing here. Then I can (optionally) give names to each category with the argument labels.

The tabulation shows that the oldest category contains relatively few mothers. Adding to that, the oldest category will probably also have fewer births. We can check that with the aid of the function tapply.

tapply(fc$event, fc$agegrp, sum)
##  < 25 25-30 >= 30 
##   796   567   293

Now, rerun the last analysis for each agegrp separately. For illustration only, we show how to do it for the first age group:

fit.ses1 <- phreg(Surv(enter, next.ivl, event) ~  year + ses + ivl,
dist = "weibull", shape = 1, data = fc[fc$agegrp == "< 25", ])
fit.ses1
## Call:
## phreg(formula = Surv(enter, next.ivl, event) ~ year + ses + ivl, 
##     data = fc[fc$agegrp == "< 25", ], dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## year             1854.228     0.001     1.001     0.003     0.838 
## ses 
##           farmer    0.461     0         1           (reference)
##          unknown    0.256    -0.017     0.983     0.088     0.848 
##            upper    0.019    -0.269     0.764     0.308     0.382 
##            lower    0.264    -0.101     0.904     0.087     0.244 
## ivl 
##                0    0.433     0         1           (reference)
##                1    0.326     2.895    18.076     0.196     0.000 
##                2    0.117     3.644    38.248     0.198     0.000 
##                3    0.037     3.296    27.016     0.227     0.000 
##                4    0.021     2.363    10.618     0.328     0.000 
##                5    0.016     2.082     8.024     0.401     0.000 
##                6    0.012     1.850     6.359     0.486     0.000 
##             7-11    0.039     0.882     2.415     0.450     0.050 
## 
## log(scale)                    4.467               5.435     0.411 
## 
##  Shape is fixed at  1 
## 
## Events                    796 
## Total time at risk        1921.8 
## Max. log. likelihood      -1060.7 
## LR test statistic         873.77 
## Degrees of freedom        11 
## Overall p-value           0

You may try to repeat this for all age groups and check if the regression parameters for year and ses are very different. Hint: They are not.