Chapter 5 Poisson Regression

Here Poisson regression is introduced and its connection to Cox regression is discussed. We start by defining the Poisson distribution, then discuss the connection to Cox regression and tabular lifetime data.

5.1 The Poisson Distribution

The Poisson distribution is used for count data, i.e., when the result may be any positive integer 0, 1, 2, …, without upper limit. The probability density function (pdf) \(P\) of a random variable \(X\) following a Poisson distribution is

\[\begin{equation}\label{eq:poisdist} P(X = k) = \frac{\lambda^k}{k!}\exp(-\lambda), \quad \lambda > 0; \; k = 0, 1, 2, \ldots, \end{equation}\]

The parameter \(\lambda\) is both the mean and the variance of the distribution. In Figure 5.1 the pdf is plotted for some values of \(\lambda\).

The Poisson cdf for different values of the mean.

FIGURE 5.1: The Poisson cdf for different values of the mean.

Note that when \(\lambda\) increases, the distribution looks more and more like a normal distribution.

In R, the Poisson distribution is represented by four functions, dpois ppois, qpois, and rpois, representing the probability density function (pdf), the cumulative distribution function (cdf), the quantile function (the inverse of the cdf), and random number generation, respectively. See the help page for the Poisson distribution for more detail. In fact, this is the scheme present for all probability distributions available in R.

For example, the upper left bar plot in Figure 5.1 is produced in R by

barplot(dpois(0:5, lambda = 0.5), axes = FALSE, 
        main = expression(paste(lambda, " = ", 0.5)))

Note that the function dpois is vectorizing:

round(dpois(0:5, lambda = 0.5), 5)
[1] 0.60653 0.30327 0.07582 0.01264 0.00158 0.00016
Example 5.1 (Marital fertility)

As an example where the Poisson distribution may be relevant, we look at the number of children born to a woman after marriage. The data frame fert in eha can be used to calculate the number of births per married woman in Skellefteå during the 19th century; however, this data set contains only marriages with one or more births. Let us instead count the number of births beyond one. The result is shown in Figure 5.2.

Number of children beyond one for married women with at lest one child.

FIGURE 5.2: Number of children beyond one for married women with at lest one child.

The question is: Does this look like a Poisson distribution? One way of checking this is to plot the theoretic distribution with the same mean (the parameter \(\lambda\)) as the sample mean in the data.

lam <- mean(kids)
barplot(dpois(0:12, lambda = lam))

The result is shown in Figure 5.3.

Theoretical Poisson distribution.

FIGURE 5.3: Theoretical Poisson distribution.

Obviously, the fertility data do not follow the Poisson distribution so well. It is in fact over-dispersed compared to the Poisson distribution. A simple way to check that is to calculate the sample mean and variance of the data. If data come from a Poisson distribution, these numbers should be equal (theoretically) or reasonably close.

mean(kids)
[1] 4.548682
var(kids)
[1] 8.586838

They are not very close, which also is obvious from comparing the graphs. \(\ \Box\)

5.2 The Connection to Cox Regression

There is an interesting connection between Cox regression and Poisson regression, which can be illustrated as follows.

dat <- data.frame(enter = rep(0, 4), exit = 1:4, 
                  event = rep(1, 4), x = c(0, 1, 0, 1))
dat
  enter exit event x
1     0    1     1 0
2     0    2     1 1
3     0    3     1 0
4     0    4     1 1

We have generated a very simple data set, dat. It consists of four life lengths with a covariate \(x\). A Cox regression, where the covariate \(x\) is related to survival is given by

library(eha)
fit <- coxreg(Surv(enter, exit, event) ~ x, data = dat)
print(summary(fit), short = TRUE)
Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
x                     0.600    -0.941     0.390     1.240    0.432 

The baseline hazards is given by the function hazards. By default it calculates the cumulative hazards, but here we need the non-cumulative ones, which is accomplished by specifying cum = FALSE:

haz <- hazards(fit, cum = FALSE)
haz
$`1`
     [,1]      [,2]
[1,]    1 0.3596118
[2,]    2 0.5615528
[3,]    3 0.7192236
[4,]    4 2.5615528

attr(,"class")
[1] "hazdata"

Note that haz is a list with one component per stratum. In this case there is only one stratum, so the list has one component. This component is a matrix with two columns; the first contains the observed distinct failure times, and the second the corresponding estimates of the hazard “atoms.”

Now, this can be replicated by Poisson regression! First the data set is summarized around each observed failure time. That is, a snapshot of the risk set at each failure time is created by the function toBinary.

library(eha)
datB <- toBinary(dat)
datB
    event riskset risktime x orig.row
1       1       1        1 0        1
2       0       1        1 1        2
3       0       1        1 0        3
4       0       1        1 1        4
2.1     1       2        2 1        2
3.1     0       2        2 0        3
4.1     0       2        2 1        4
3.2     1       3        3 0        3
4.2     0       3        3 1        4
4.3     1       4        4 1        4

Note that three new “covariates” are created, riskset, risktime, and orig.row. In the Poisson regression to follow, the variable riskset must be included as a factor. The Poisson regression is performed with the glm function:

fit2 <- glm(event ~ riskset + x, family = poisson, 
            data = datB)
(co <- coefficients(fit2))
(Intercept)    riskset2    riskset3    riskset4           x 
 -1.0227302   0.4456807   0.6931472   1.9633438  -0.9406136 
co[2:4] <- co[2:4] + co[1] # Calculate the hazard atoms.
haz.glm <- exp(co[1:4])

The parameter estimate corresponding to \(x\) is exactly the same as in the Cox regression. The baseline hazard function is estimated with the aid of the (Intercept) and the riskset estimates, and we compare them below with the hazards estimates from coxreg.

xx <- cbind(haz[[1]], haz.glm)
colnames(xx) <- c("Time", "coxreg", "glm")
xx
            Time    coxreg       glm
(Intercept)    1 0.3596118 0.3596118
riskset2       2 0.5615528 0.5615528
riskset3       3 0.7192236 0.7192236
riskset4       4 2.5615528 2.5615528

The results are identical. This correspondence between Poisson and Cox regression was pointed out by Johansen (1983). However, this approach is not very useful in practical work with medium or large data sets, because the output from toBinary may be huge. For instance, with the data set child, the resulting data frame consists of slightly more than 56 million rows and 11 columns.

Note: There is a similar connection to Binomial regression: With the method = "ml", we can compare output from coxreg and glm with family = binomial.

fit.ml <- coxreg(Surv(enter, exit, event) ~ x, 
                 method = "ml", data = dat)
fit.b <- glm(event ~ riskset + x, 
             family = binomial(link = "cloglog"), 
             data = datB)
rbind(coef(fit.ml),  coef(fit.b)["x"])
             x
[1,] -1.267101
[2,] -1.267101

Identical regression coefficients (and slightly different from what what we got from the Poisson approach). More about discrete time models in Chapter 7, “Parametric models.”

5.3 The Connection to the Piecewise Constant Hazard Model

For completeness, the connection between the Poisson distribution and the piecewise constant hazard model is mentioned here. This is explored in detail in Chapter 7, where the Poisson formulation is cast into a survival analysis costume and very straightforward to use.

5.4 Tabular Lifetime Data

Although it is possible to use Poisson regression in place of Cox regression, the most useful application of Poisson regression in survival analysis is with tabular data. We have already seen an example of this in Chapter 2, the life and its connection to the survival function.

Example 5.2 (Mortality in ages 61–90, Sweden 2019–2020.)

We create a data set swe (a subset and combination of swepop and swedeaths in eha), which contains population size and number of deaths by age and sex in Sweden 2019 and 2020. The age span is restricted to ages 61–80. One reason for these choices is to get a grip on the effect of the covid-19 pandemic on the mortality increase in 2020 compared to 2019, pre-corona. Normally, mortality is lowered in all age groups from one year to the next, but 2019 to 2020 is different, see Table ??.

The Poisson model is, where \(D_{ij}\) is number of deaths and \(P_{ij}\) population size for age \(i\) and sex \(j\), \(i = 61, \ldots, 80\); \(j = \text{female (0)}, \text{male (1)}\), and \(\lambda_{ij}\) is the corresponding mortality,

\[\begin{equation*} D_{ij} \sim \text{Poisson}(\lambda_{ij} P_{ij}), \quad i = 61, 62, \ldots, 80; \; j = 0, 1, \end{equation*}\]

with

\[\begin{equation*} \begin{split} \lambda_{61,j} P_{61,j} &= P_{61,j}\exp(\gamma + \beta * j) \\ &= \exp(\log(P_{61,j}) + \gamma + \beta * j), \quad j = 0, 1, \end{split} \end{equation*}\]

and

\[\begin{equation*} \begin{split} \lambda_{ij} P_{ij} &= P_{ij} \exp(\gamma + \alpha_i + \beta * j) \\ &= \exp(\log(P_{ij}) + \gamma + \alpha_i + \beta * j), \quad i = 62, 63, \ldots, 80; \; j = 0, 1. \end{split} \end{equation*}\]

This calculation shows that it is the log of the population sizes, \(\log(P_{ij})\), that is the correct offset to use in the Poisson regression. First we want age to be a factor (no restrictions like linearity), then the R function glm (“generalized linear model”) is used to fit a Poisson regression model.

swe$age <- factor(swe$age)
fit <- glm(deaths ~ offset(log(pop)) + year + sex + age, 
           family = poisson, data = swe)
drop1(fit, test = "Chisq")
Single term deletions

Model:
deaths ~ offset(log(pop)) + year + sex + age
       Df Deviance    AIC    LRT  Pr(>Chi)
<none>         152   1260                 
year    1      387   1493    235 < 2.2e-16
sex     1     4653   5760   4501 < 2.2e-16
age    29   117129 118180 116977 < 2.2e-16

The function drop1 is used above to perform two likelihood ratio tests. The results are that the three variables are all highly statistically significant, meaning that they are very important in describing the variation in mortality over year, sex, and age. To know how, we present (some of) the parameter estimates.

round(summary(fit)$coefficients[c(1:3), ], 3) # First 3 rows.
            Estimate Std. Error  z value Pr(>|z|)
(Intercept)   -5.453      0.029 -190.722        0
year2020       0.084      0.005   15.316        0
sexmen         0.370      0.006   66.995        0
round(summary(fit)$coefficients[c(19:21), ], 3) # Last three rows.
      Estimate Std. Error z value Pr(>|z|)
age77    1.685      0.032  53.329        0
age78    1.791      0.032  56.560        0
age79    1.945      0.031  61.788        0

The results so far tell us that males have a distinctly higher mortality than females, and that mortality steadily increases with age, no surprises so far. More interesting is to see that the covid-19 year 2020 has a significantly higher mortality than the year 2019. The parameter estimates for age also opens up the possibility to simplify the model by assuming a linear or quadratic effect of age on mortality. We leave that possibility for later.

One question remains: Is the 2019 advantage relatively the same over ages? To answer that question, we introduce an interaction:

fit1 <- glm(deaths ~ offset(log(pop)) + sex + year * age, 
            family = poisson, data = swe)
drop1(fit1, test = "Chisq")
Single term deletions

Model:
deaths ~ offset(log(pop)) + sex + year * age
         Df Deviance    AIC    LRT Pr(>Chi)
<none>         107.2 1273.6                
sex       1   4606.8 5771.2 4499.6  < 2e-16
year:age 29    152.1 1260.4   44.9  0.03019

Note that the drop1 function only tests the interaction term. This is because, as we saw before, the main effects tend to be meaningless in the presence of interaction. However, here there is no sign of interaction, and we can conclude that in a survival model with sex as covariate, we have proportional hazards! This is most easily seen in graphs. The first plot is based on the estimated Poisson model without interaction. Some calculations are first needed.

beta <- coefficients(fit)[2:3]
alpha <- coefficients(fit)[-(2:3)] # Remove sex and year
alpha[2:length(alpha)] <- alpha[2:length(alpha)] + alpha[1]
lambda.2019 <- exp(alpha)
lambda.2020 <- exp(alpha + beta[1])

Then the plot of the hazard functions by

par(las = 1)
plot(ages, lambda.2019, ylim = c(0, 0.15), type = "S",
     xlab = "Age", ylab = "Mortality")
lines(ages, lambda.2020, type = "S", lty = 2)
abline(h = 0)
legend("topleft", legend = c(2019, 2020), lty = 1:2)

Note the parameter ylim. It sets the scale on the \(y\) axis, and some trial and error may be necessary to get a good choice. The function line adds a curve to an already existing plot, and the function abline adds a straight line; the argument h = 0 results in a horizontal line. See the help pages for these functions for further information.

The result is shown in Figure 5.4.

Model based hazard functions for 2019 and 2020.

FIGURE 5.4: Model based hazard functions for 2019 and 2020.

We can also make a plot based only on raw data. For that it is only necessary to calculate the occurrence-exposure rates. An occurrence-exposure rate is simply the ratio between the number of occurrences (of some event) divided by total exposure (waiting) time. In the present example, mean population size during one year is a very good approximation of the exact total exposure time (in years).

y2019 <- swe[swe$year == 2019, ]
y2019 <- aggregate(y2019[, c("deaths", "pop")], 
                   by = y2019["age"], FUN = sum)
y2020 <- swe[swe$year == 2020, ]
y2020 <- aggregate(y2020[, c("deaths", "pop")], 
                   by = y2020["age"], FUN = sum)
rate19 <- y2019$deaths / y2019$pop
rate20 <- y2020$deaths / y2020$pop

And finally the plot of the raw death rates is created as follows,

par(las = 1)
plot(ages, rate19, ylim = c(0, 0.045), xlab = "Age", 
     ylab = "Mortality", type = "S")
lines(ages, rate20, lty = 2, type = "S")
abline(h = 0)

with the result as shown in Figure 5.5.

Hazard functions for 2019 and 2020 based on raw data.

FIGURE 5.5: Hazard functions for 2019 and 2020 based on raw data.

The differences between Figure 5.4 and Figure 5.5 are (i) the model-based hazard curves are higher than the raw ones, which depends on the fact that the model has females as reference category, and thus this figure depicts conditions for females, and (ii) we see small deviations from proportionality; the relative difference seems to increase with age. However, as the likelihood ratio implied, this finding is not statistically significant on the 5 per cent level.

A final remark: It was quite tedious, although straightforward, to apply raw Poisson regression to this problem. In Chapter 7, it will be shown how to use the function tpchreg in the eha package for a smooth and simple way of analyzing tabular data.\(\ \Box\)

References

Johansen, S. 1983. “An Extension of Cox’s Regression Model.” International Statistical Review 51: 165–74.