2.1 Continuous time model descriptions
Traditional statistical model descriptions are based on the density and the cumulative distribution functions. These functions are not so suitable for use with censored and/or truncated data. The survival and hazard functions are better suited, as will be demonstrated here. It should however be acknowledged that all these functions are simple functions of each other, so the information they convey is in principle the same, it is only convenience that determines which one(s) to choose and use in a particular application.
2.1.1 The survival function
We start with a motivating example (2.1), the life table, and its connection to death risks and survival probabilities in a population, see Table 2.3.
Statistics Sweden (SCB) is a government agency that produces statistics and have a coordinating role for the official statistics of Sweden. From their home page (http://www.scb.se) it is possible to download vital population statistics, like population size and number of deaths by age, gender, and year. From such data it is possible to construct a life table.
Two tables are needed, one with population size by age and sex, and one with the number of deaths by the same categories. Tables are available for age intervals of lengths 1, 5, and 10 years. We always prefer data with as much information as possible, so we choose the age interval length one. It is always possible to aggregate later, if that is prefered.
First, the population data is read from the CSV file downloaded from SCB.
pop <- read.table("Data/BE0101A6.csv", header = TRUE,
stringsAsFactors = FALSE,
fileEncoding = "latin1")The argument stringsAsFactors = FALSE is needed in order to avoid character columns
being converted to factors, and fileEncoding is needed to take care of some
strange Swedish characters (å, ä, ö) in variable names and labels.
The result is shown in Table 2.1.
| ålder | kön | X2016 | |
|---|---|---|---|
| 1 | 0 år | män | 60500 |
| 2 | 0 år | kvinnor | 56951 |
| 3 | 1 år | män | 60996 |
| 4 | 1 år | kvinnor | 57254 |
| ….. | ….. | ….. | ….. |
| 199 | 99 år | män | 276 |
| 200 | 99 år | kvinnor | 1094 |
| 201 | 100+ år | män | 324 |
| 202 | 100+ år | kvinnor | 1615 |
Next the table of deaths by age and sex is read from disk.
deaths <- read.table("Data/BE0101D9.csv", header = TRUE,
stringsAsFactors = FALSE,
fileEncoding = "latin1")The result is shown in Table 2.2.
| ålder | kön | X2016 | |
|---|---|---|---|
| 1 | 0 år | män | 157 |
| 2 | 0 år | kvinnor | 135 |
| 3 | 1 år | män | 13 |
| 4 | 1 år | kvinnor | 10 |
| ….. | ….. | ….. | ….. |
| 199 | 99 år | män | 118 |
| 200 | 99 år | kvinnor | 420 |
| 201 | 100+ år | män | 185 |
| 202 | 100+ år | kvinnor | 796 |
We concentrate on female mortality 2016 by combining the two tables after removing all information relating to men (“män”) and translating to English.
names(pop) <- c("age", "sex", "pop")
females <- pop[pop$sex == "kvinnor", c("age", "pop")]
females$age <- 0:100
females$deaths <- deaths$X2016[deaths[, 2] == "kvinnor"]
head(females, 4)## age pop deaths
## 2 0 56951 135
## 4 1 57254 10
## 6 2 57591 6
## 8 3 57536 6
## age pop deaths
## 196 97 2389.0 791
## 198 98 1611.0 577
## 200 99 1094.5 420
## 202 100 1615.0 796
Then we add two columns, derived from the original three.
females$risk <- females$deaths / females$pop
alive <- numeric(101)
alive[1] <- 100000
for (i in 2:101){
alive[i] <- alive[i - 1] * (1 - females$risk[i-1])
}
females$alive <- aliveThe result is shown in Table 2.3.
| age | pop | deaths | risk | alive |
|---|---|---|---|---|
| 0 | 56951 | 135 | 0.00237 | 100000 |
| 1 | 57254 | 10 | 0.00017 | 99763 |
| 2 | 57591 | 6 | 0.00010 | 99746 |
| 3 | 57536 | 6 | 0.00010 | 99735 |
| 4 | 57710 | 7 | 0.00012 | 99725 |
| … | … | … | … | … |
| 96 | 3783 | 1197 | 0.31642 | 8306 |
| 97 | 2389 | 791 | 0.33110 | 5678 |
| 98 | 1611 | 577 | 0.35816 | 3798 |
| 99 | 1094 | 420 | 0.38374 | 2438 |
| 100+ | 1615 | 796 | 0.49288 | 1502 |
It is constructed in the following way. The first column, headed by age, contains the ages in completed years. The second column, headed pop contains the numbers of women in Sweden in the corresponding ages. It is equal to the mean population size over the year. Since population numbers from the SCB are taken at the turn of the years, the numbers in column two are averages of two columns in the SCB file. The third column is the total numbers of deaths in each age category during the year. The fourth and fifth columns are derived from the previous ones. The risk is essentially the ratio between the number of deaths and population size for each age.
Column five, alive, contains the actual construction of the life table. It depicts the actual size of a birth cohort, year by year, exposed to the death risks of column four. It starts with 100000 newborn females. Under the first year of life the death risk is 0.00237, meaning that we expect 0.00237 x 100000 = 237 girls to die the first year. In other words, 100000 - 237 = 99763 will remain alive at the first anniversary (at exact age one year). That is the number in the second row in the fifth column. The calculations then continues in the same way through the last two columns. Note, though, that the last interval (100+) is an open interval, i.e., it has no fixed upper limit.
Note the difference in interpretation between the numbers in the second column (“pop”) and those in the last (“alive”); the former gives the age distribution in a population a given year (referred to as period data), while the latter constitutes , i.e., a group of newborn are followed from birth to death, and the actual size of the (still alive) cohort is recorded age by age. That is, “individuals” in the synthetic cohort live their whole lives under the mortality conditions of the year 2016. Of course no real individual can experience this scenario, hence the name “synthetic cohort”. Nevertheless, it is a useful tool to illustrate the state of affairs regarding mortality a given year.
FIGURE 2.1: Life table and survival function, females, Sweden 2016.
It makes sense to plot the life table, see Figure 2.1. Exactly how this is done will be shown later in the chapter. Median life length is 86.75, see construction in the figure. By dividing all numbers by the original cohort size (in our case 100000), they may be interpreted as probabilities.\(\Box\)
The survival function \(S(t), \; t > 0\), is defined as the probability of surviving past \(t\), \(t > 0\), or with a formula,
\[\begin{equation} S(t) = P(T \ge t), \quad t > 0, \tag{2.1} \end{equation}\]
where \(T\) is the (random) life length under study. The symbol \(P(T \ge t)\) reads “the probability that \(T\) is equal to or larger than \(t\)”; here \(t\) is a fixed number while \(T\) denotes a “random quantity”. In statistical language, \(T\) is called a random variable. In our example, \(T\) is the (future) life length of a randomly chosen newborn female, of course unknown at birth.
We will, for the time being, assume that \(S\) is a ``nice’’ function, smooth and differentiable. That will ensure that the following definitions are unambiguous.
2.1.2 The density function
The density function is defined as minus the derivative of the survival function, or
\[\begin{equation} f(t) = -\frac{d}{dt} S(t), \quad t > 0 \tag{2.2} \end{equation}\]
The intuitive interpretation of this definition is that, for small enough \(s\), the following approximation is good:
\[ P(t_0 \le T < t_0 + s) = S(t_0) - S(t_0 + s) \approx s f(t_0). \] This is illustrated in Figure 2.2, with \(t_0 = 10\) and \(s = 2\). The application is remaining life at age 60 in a nineteenth century setting.
FIGURE 2.2: Interpretation of the density function.
For a short enough interval \((t_0, t_0 +s]\), the probability of an observation falling in that interval is well approximated by the area of a rectangle with sides of lengths \(s\) and \(f(t_0)\), or \(sf(t_0)\). The formal mathematical definition as a limit is given in equation (2.3).
\[\begin{equation} f(t) = \lim_{s \rightarrow 0} \frac{P(t \le T < t + s)}{s}, \quad t > 0. \tag{2.3} \end{equation}\]
2.1.3 The hazard function
The is central to the understanding of survival analysis, so the reader is recommended to get at least an intuitive understanding of it. One way of thinking of it is as an ``instant probability’’; at a given age \(t\), it measures the risk of dying in a short interval \((t, t + s)\) immediately after \(t\), for an individual who still is alive at \(t\).
\[\begin{equation*} \begin{split} h(t) &= \lim_{s \rightarrow 0}\frac{P(t \le T < t + s \mid T \ge t)}{s} \\ &= \lim_{s \rightarrow 0}\frac{S(t) - S(t + s)}{s S(t)} \\ &= \frac{f(t)}{S(t)} \end{split} \end{equation*}\]
Note the difference between the density and the hazard functions. The former is (the limit of) an probability, while the latter is (the limit of) a probability per time unit.
2.1.4 The cumulative hazard function
The cumulative hazard function is defined as the integral of the hazard function,
\[\begin{equation*} H(t) = \int_0^t h(s) ds, \quad t \ge 0. \end{equation*}\] That is, an intuitive interpretation is that the cumulative hazard function successively accumulates the instant risks.
The cumulative hazard function is important because it is fairly easy to estimate nonparametrically (i.e., without any restrictions), in contrast to the hazard and density functions.
Perhaps the simplest continuous life length distribution is the exponential distribution. It is simple because its hazard function is constant:
\[\begin{equation*} h(t) = \lambda, \quad \lambda > 0, \; t \ge 0. \end{equation*}\] From this it is easy to calculate the other functions that characterize the exponential distribution. The cumulative hazards function is
\[\begin{equation*} H(t) = \lambda t, \quad \lambda > 0,\; t \ge 0, \end{equation*}\] the survival function is
\[\begin{equation*} S(t) = e^{-\lambda t}, \quad \lambda > 0,\; t \ge 0, \end{equation*}\] and the density function is
\[\begin{equation*} f(t) = \lambda e^{-\lambda t}, \quad \lambda > 0,\; t \ge 0. \end{equation*}\] The property of constant hazard implies no aging. This is not a realistic property for human mortality, but, as we will see, a useful benchmark, and a useful model for modelling mortality over short time intervals (piece-wise constant hazard). The exponential distribution is described in detail in Appendix B.\(\Box\)