Chapter 5 Poisson Regression
Here Poisson regression is introduced and its connection to Cox regression is discussed. We start by defining the Poisson distribution, then discuss the connection to Cox regression and tabular lifetime data.
5.1 The Poisson Distribution
The Poisson distribution is used for count data, i.e., when the result may be any positive integer 0, 1, 2, …, without upper limit. The probability density function (pdf) \(P\) of a random variable \(X\) following a Poisson distribution is
\[\begin{equation}\label{eq:poisdist} P(X = k) = \frac{\lambda^k}{k!}\exp(-\lambda), \quad \lambda > 0; \; k = 0, 1, 2, \ldots, \end{equation}\]
The parameter \(\lambda\) is both the mean and the variance of the distribution. In Figure 5.1 the pdf is plotted for some values of \(\lambda\).
FIGURE 5.1: The Poisson cdf for different values of the mean.
Note that when \(\lambda\) increases, the distribution looks more and more like a normal distribution.
In R, the Poisson distribution is represented by four functions, dpois
ppois
, qpois
, and rpois
, representing the
probability density function (pdf), the cumulative distribution function
(cdf), the quantile function (the inverse of the cdf), and random number
generation,
respectively. See the help page for the Poisson distribution for more
detail. In fact, this is the scheme present for all probability
distributions available in R.
For example, the upper left bar plot in Figure 5.1 is produced in R by
barplot(dpois(0:5, lambda = 0.5), axes = FALSE,
main = expression(paste(lambda, " = ", 0.5)))
Note that the function
dpois
is vectorizing:
round(dpois(0:5, lambda = 0.5), 5)
[1] 0.60653 0.30327 0.07582 0.01264 0.00158 0.00016
As an example where the Poisson distribution may be relevant, we look at
the number of children born to a woman after marriage. The data frame
fert in eha
can be used to calculate the number of births per married
woman in Skellefteå during the 19th century; however, this data set
contains only marriages with one or more births. Let us instead count the
number of births beyond one.
The result is shown in Figure 5.2.
FIGURE 5.2: Number of children beyond one for married women with at lest one child.
The question is: Does this look like a Poisson distribution? One way of checking this is to plot the theoretic distribution with the same mean (the parameter \(\lambda\)) as the sample mean in the data.
<- mean(kids)
lam barplot(dpois(0:12, lambda = lam))
The result is shown in Figure 5.3.
FIGURE 5.3: Theoretical Poisson distribution.
Obviously, the fertility data do not follow the Poisson distribution so well. It is in fact over-dispersed compared to the Poisson distribution. A simple way to check that is to calculate the sample mean and variance of the data. If data come from a Poisson distribution, these numbers should be equal (theoretically) or reasonably close.
mean(kids)
[1] 4.548682
var(kids)
[1] 8.586838
They are not very close, which also is obvious from comparing the graphs. \(\ \Box\)
5.2 The Connection to Cox Regression
There is an interesting connection between Cox regression and Poisson regression, which can be illustrated as follows.
<- data.frame(enter = rep(0, 4), exit = 1:4,
dat event = rep(1, 4), x = c(0, 1, 0, 1))
dat
enter exit event x
1 0 1 1 0
2 0 2 1 1
3 0 3 1 0
4 0 4 1 1
We have generated a very simple data set, dat
. It consists of four
life lengths with a covariate \(x\). A Cox regression, where the covariate
\(x\) is related to survival is given by
library(eha)
<- coxreg(Surv(enter, exit, event) ~ x, data = dat)
fit print(summary(fit), short = TRUE)
Covariate Mean Coef Rel.Risk S.E. LR p
x 0.600 -0.941 0.390 1.240 0.432
The baseline hazards is given by the function hazards
. By default it calculates
the cumulative hazards, but here we need the non-cumulative ones, which is accomplished
by specifying cum = FALSE
:
<- hazards(fit, cum = FALSE)
haz haz
$`1`
[,1] [,2]
[1,] 1 0.3596118
[2,] 2 0.5615528
[3,] 3 0.7192236
[4,] 4 2.5615528
attr(,"class")
[1] "hazdata"
Note that haz
is a list with one component per
stratum. In this case there is only one stratum, so the list has one
component. This component is a matrix with two columns; the first contains
the observed distinct failure times, and the second the corresponding
estimates of the hazard “atoms.”
Now, this can be replicated by Poisson regression! First the data set
is summarized around each observed failure time. That is, a snapshot of the
risk set at each failure time is created by the function toBinary
.
library(eha)
<- toBinary(dat)
datB datB
event riskset risktime x orig.row
1 1 1 1 0 1
2 0 1 1 1 2
3 0 1 1 0 3
4 0 1 1 1 4
2.1 1 2 2 1 2
3.1 0 2 2 0 3
4.1 0 2 2 1 4
3.2 1 3 3 0 3
4.2 0 3 3 1 4
4.3 1 4 4 1 4
Note that three new “covariates” are created, riskset
, risktime
,
and orig.row
. In the Poisson regression to follow, the variable
riskset
must be included as a factor. The Poisson regression is performed
with the glm
function:
<- glm(event ~ riskset + x, family = poisson,
fit2 data = datB)
<- coefficients(fit2)) (co
(Intercept) riskset2 riskset3 riskset4 x
-1.0227302 0.4456807 0.6931472 1.9633438 -0.9406136
2:4] <- co[2:4] + co[1] # Calculate the hazard atoms.
co[<- exp(co[1:4]) haz.glm
The parameter estimate corresponding to \(x\) is exactly the same as in the
Cox regression. The baseline hazard function is estimated with the aid of
the (Intercept) and the riskset estimates, and we compare them below with
the hazards estimates from coxreg
.
<- cbind(haz[[1]], haz.glm)
xx colnames(xx) <- c("Time", "coxreg", "glm")
xx
Time coxreg glm
(Intercept) 1 0.3596118 0.3596118
riskset2 2 0.5615528 0.5615528
riskset3 3 0.7192236 0.7192236
riskset4 4 2.5615528 2.5615528
The results are identical. This correspondence between Poisson and Cox regression was pointed out by
Johansen (1983). However, this approach is not very useful in practical work with medium or large
data sets, because the output from toBinary
may be huge. For instance, with
the data set child
, the resulting data frame consists of slightly more than
56 million rows and 11 columns.
Note: There is a similar connection to Binomial regression: With the
method = "ml"
, we can compare output from coxreg
and glm
with family = binomial
.
<- coxreg(Surv(enter, exit, event) ~ x,
fit.ml method = "ml", data = dat)
<- glm(event ~ riskset + x,
fit.b family = binomial(link = "cloglog"),
data = datB)
rbind(coef(fit.ml), coef(fit.b)["x"])
x
[1,] -1.267101
[2,] -1.267101
Identical regression coefficients (and slightly different from what what we got
from the Poisson approach). More about discrete time models in Chapter 7, “Parametric models.”
5.3 The Connection to the Piecewise Constant Hazard Model
For completeness, the connection between the Poisson distribution and the piecewise constant hazard model is mentioned here. This is explored in detail in Chapter 7, where the Poisson formulation is cast into a survival analysis costume and very straightforward to use.
5.4 Tabular Lifetime Data
Although it is possible to use Poisson regression in place of Cox regression, the most useful application of Poisson regression in survival analysis is with tabular data. We have already seen an example of this in Chapter 2, the life and its connection to the survival function.
We create a data set swe
(a subset and combination of swepop
and swedeaths
in eha
), which contains population size and number of deaths by age
and sex in Sweden 2019 and 2020. The age span is restricted to ages 61–80.
One reason for these choices is to get a grip on the effect of the covid-19
pandemic on the mortality increase in 2020 compared to 2019, pre-corona. Normally,
mortality is lowered in all age groups from one year to the next, but 2019 to 2020
is different, see Table ??.
The Poisson model is, where \(D_{ij}\) is number of deaths and \(P_{ij}\)
population size for age
\(i\) and sex
\(j\), \(i = 61, \ldots, 80\);
\(j = \text{female (0)}, \text{male (1)}\), and \(\lambda_{ij}\) is the corresponding
mortality,
\[\begin{equation*} D_{ij} \sim \text{Poisson}(\lambda_{ij} P_{ij}), \quad i = 61, 62, \ldots, 80; \; j = 0, 1, \end{equation*}\]
with
\[\begin{equation*} \begin{split} \lambda_{61,j} P_{61,j} &= P_{61,j}\exp(\gamma + \beta * j) \\ &= \exp(\log(P_{61,j}) + \gamma + \beta * j), \quad j = 0, 1, \end{split} \end{equation*}\]
and
\[\begin{equation*} \begin{split} \lambda_{ij} P_{ij} &= P_{ij} \exp(\gamma + \alpha_i + \beta * j) \\ &= \exp(\log(P_{ij}) + \gamma + \alpha_i + \beta * j), \quad i = 62, 63, \ldots, 80; \; j = 0, 1. \end{split} \end{equation*}\]
This calculation shows that it is the log of the population sizes,
\(\log(P_{ij})\), that is the correct offset to use in the Poisson regression.
First we want age
to be a factor (no restrictions like linearity),
then the R function glm
(“generalized linear model”) is used to
fit a Poisson regression model.
$age <- factor(swe$age)
swe<- glm(deaths ~ offset(log(pop)) + year + sex + age,
fit family = poisson, data = swe)
drop1(fit, test = "Chisq")
Single term deletions
Model:
deaths ~ offset(log(pop)) + year + sex + age
Df Deviance AIC LRT Pr(>Chi)
<none> 152 1260
year 1 387 1493 235 < 2.2e-16
sex 1 4653 5760 4501 < 2.2e-16
age 29 117129 118180 116977 < 2.2e-16
The function drop1
is used above to perform two likelihood ratio tests. The results
are that the three variables are all highly statistically significant, meaning that
they are very important in describing the variation in mortality over year, sex,
and age. To know how, we present (some of) the parameter estimates.
round(summary(fit)$coefficients[c(1:3), ], 3) # First 3 rows.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.453 0.029 -190.722 0
year2020 0.084 0.005 15.316 0
sexmen 0.370 0.006 66.995 0
round(summary(fit)$coefficients[c(19:21), ], 3) # Last three rows.
Estimate Std. Error z value Pr(>|z|)
age77 1.685 0.032 53.329 0
age78 1.791 0.032 56.560 0
age79 1.945 0.031 61.788 0
The results so far tell us that males have a distinctly higher mortality than
females, and that mortality steadily increases with age, no
surprises so far. More interesting is to see that the covid-19 year 2020 has
a significantly higher mortality than the year 2019.
The parameter estimates for age
also opens up the
possibility to simplify the model by assuming a linear or quadratic effect
of age on mortality. We leave that possibility for later.
One question remains: Is the 2019 advantage relatively the same over ages? To answer that question, we introduce an interaction:
<- glm(deaths ~ offset(log(pop)) + sex + year * age,
fit1 family = poisson, data = swe)
drop1(fit1, test = "Chisq")
Single term deletions
Model:
deaths ~ offset(log(pop)) + sex + year * age
Df Deviance AIC LRT Pr(>Chi)
<none> 107.2 1273.6
sex 1 4606.8 5771.2 4499.6 < 2e-16
year:age 29 152.1 1260.4 44.9 0.03019
Note that the drop1
function only tests the interaction term. This is
because, as we saw before, the main effects tend to be meaningless in the
presence of interaction. However, here there is no sign of interaction, and
we can conclude that in a survival model with sex
as covariate, we
have proportional hazards! This is most easily seen in
graphs. The first plot is based on the estimated Poisson model
without interaction. Some calculations are first needed.
<- coefficients(fit)[2:3]
beta <- coefficients(fit)[-(2:3)] # Remove sex and year
alpha 2:length(alpha)] <- alpha[2:length(alpha)] + alpha[1]
alpha[.2019 <- exp(alpha)
lambda.2020 <- exp(alpha + beta[1]) lambda
Then the plot of the hazard functions by
par(las = 1)
plot(ages, lambda.2019, ylim = c(0, 0.15), type = "S",
xlab = "Age", ylab = "Mortality")
lines(ages, lambda.2020, type = "S", lty = 2)
abline(h = 0)
legend("topleft", legend = c(2019, 2020), lty = 1:2)
Note the parameter ylim
. It sets the scale on the \(y\) axis, and some
trial and error may be necessary to get a good choice.
The function line
adds a curve to an already
existing plot, and the function abline
adds a straight line; the argument h = 0
results in a
horizontal line. See the help pages for these functions for further
information.
The result is shown in Figure 5.4.
FIGURE 5.4: Model based hazard functions for 2019 and 2020.
We can also make a plot based only on raw data. For that it is only necessary to calculate the occurrence-exposure rates. An occurrence-exposure rate is simply the ratio between the number of occurrences (of some event) divided by total exposure (waiting) time. In the present example, mean population size during one year is a very good approximation of the exact total exposure time (in years).
<- swe[swe$year == 2019, ]
y2019 <- aggregate(y2019[, c("deaths", "pop")],
y2019 by = y2019["age"], FUN = sum)
<- swe[swe$year == 2020, ]
y2020 <- aggregate(y2020[, c("deaths", "pop")],
y2020 by = y2020["age"], FUN = sum)
<- y2019$deaths / y2019$pop
rate19 <- y2020$deaths / y2020$pop rate20
And finally the plot of the raw death rates is created as follows,
par(las = 1)
plot(ages, rate19, ylim = c(0, 0.045), xlab = "Age",
ylab = "Mortality", type = "S")
lines(ages, rate20, lty = 2, type = "S")
abline(h = 0)
with the result as shown in Figure 5.5.
FIGURE 5.5: Hazard functions for 2019 and 2020 based on raw data.
The differences between Figure 5.4 and Figure 5.5 are (i) the model-based hazard curves are higher than the raw ones, which depends on the fact that the model has females as reference category, and thus this figure depicts conditions for females, and (ii) we see small deviations from proportionality; the relative difference seems to increase with age. However, as the likelihood ratio implied, this finding is not statistically significant on the 5 per cent level.
A final remark: It was quite tedious, although straightforward, to apply raw Poisson
regression to this problem. In Chapter 7, it will be shown how to use the function tpchreg
in the eha package for a smooth and simple way of analyzing tabular data.\(\ \Box\)