Chapter 2 Single Sample Data

The basic model descriptions of survival data are introduced. Basically, the distribution of survival data may be either continuous or discrete, but the nonparametric estimators of the distributions of survival time are discrete in any case. Therefore, the discrete models are necessary to know in order to understand nonparametric estimation.

In the present chapter, only nonparametric estimation is discussed. This means that no assumptions whatsoever are made about the true underlying distribution. Parametric models are presented in Chapter 8.

2.1 Continuous Time Model Descriptions

Traditional statistical model descriptions are based on the density and the cumulative distribution functions. These functions are not so suitable for use with censored and/or truncated data. The survival and hazard functions are better suited, as will be demonstrated here. It should however be acknowledged that all these functions are simple functions of each other, so the information they convey is in principle the same, it is only convenience that determines which one(s) to choose and use in a particular application.

2.1.1 The survival function

We start with a motivating example (2.1), the life table, and its connection to death risks and survival probabilities in a population, see Table ??.

Example 2.1 (Swedish life tables)

Statistics Sweden (SCB) is a government agency that produces statistics and have a coordinating role for the official statistics of Sweden. From their home page it is possible to download vital population statistics, like population size and number of deaths by age, gender, and year. From such data it is possible to construct life tables.

Two raw tables are needed, one with population size by age, sex, and year, and one with the number of deaths by the same categories. Tables are available for age intervals of lengths 1, 5, and 10 years. We always prefer data with as much information as possible, so we choose the tables with age interval lengths one. It is always possible to aggregate later, if that is preferred.

There are two data sets in the R package eha that contain population size information and information about number of deaths, swepop and swedeaths, respectively. Both cover the years 1969–2020. We use both to construct a data frame females containing population size and number of deaths for females 2020. Below, the first five and last five rows are printed. Note that the last row, with age = 100 in fact includes all ages from 100 and above. It is common that tables like this one ends with an open interval.

library(eha)
females <- swepop[swepop$sex == "women" & swepop$year == 2020, 
                  c("age", "pop")]
females$deaths <- swedeaths[swedeaths$sex == "women" & 
                                swedeaths$year == 2020, "deaths"]
print(females[c(1:5, 97:101), ], row.names = FALSE)
 age     pop deaths
   0 55505.5    112
   1 57158.0      8
   2 58254.5      5
   3 59680.0      3
   4 60075.5      9
  96  3940.5   1256
  97  2768.0    883
  98  2033.5    764
  99  1445.0    581
 100  1933.5   1004

A full view of the population size by age is given in Figure 2.1.

Female population by age, Sweden 2020.

FIGURE 2.1: Female population by age, Sweden 2020.

And the numbers of deaths by age are shown in Figure 2.2.

Female deaths by age, Sweden 2020.

FIGURE 2.2: Female deaths by age, Sweden 2020.

Our primary interest lies in the age specific mortality, which simply is given by dividing the number of deaths by the population size, age by age. We do that, and add one column, derived from the original three, in the following way.

females$risk <- females$deaths / females$pop
alive <- numeric(101)
alive[1] <- 100000
for (i in 2:101){
    alive[i] <- alive[i - 1] * (1 - females$risk[i-1])
}
females$alive <- alive

The result is shown in Table ??.

It is constructed in the following way. The first column, headed by age, contains the ages in completed years. The second column, headed pop contains the numbers of women in Sweden in the corresponding ages. It is equal to the mean population size over the year. Since population numbers from the SCB are taken at the turn of the years, the numbers in column two are averages of two columns in the SCB file. The third column contains the total numbers of deaths in each age category during the year. The fourth and fifth columns are derived from the previous ones. The risk is the ratio between the number of deaths and population size for each age.

Column five, alive, contains the actual construction of the life table. It depicts the actual size of a hypothetical birth cohort, year by year, exposed to the death risks of column four. It starts with 100000 (an arbitrary, but commonly chosen, number) newborn females. Under the first year of life the death risk is 0.00202, meaning that we expect 0.00202 x 100000 = 202 girls to die in the first year. In other words, 99798 girls will remain alive at the first anniversary (at exact age one year). That is the number in the second row in the fifth column. The calculations then continues in the same way through the last two columns. Note, though, that the last interval (100+) is an open interval, i.e., it has no fixed upper limit.

Note the difference in interpretation between the numbers in the second column (“pop”) and those in the last (“alive”); the former gives the age distribution in a population a given year (referred to as period data), while the latter constitutes synthetic cohort data, i.e., a group of newborn are followed from birth to death, and the actual size of the (still alive) cohort is recorded age by age. That is, “individuals” in the synthetic cohort live their whole lives under the mortality conditions of the year 2020. Of course no real individual can experience this scenario, hence the name “synthetic cohort.” Nevertheless, it is a useful measure for illustrating the state of affairs regarding mortality a given year.

Life table and survival function, females, Sweden 2020.

FIGURE 2.3: Life table and survival function, females, Sweden 2020.

It makes sense to plot the life table, see Figure 2.3. Exactly how this is done with individual data will be shown later in the chapter, but here we simply plot alive (divided by 100000) against age. Median life length is 86.8, see construction in the figure. By dividing all numbers by the original cohort size (in our case 100000), they may be interpreted as probabilities.\(\ \Box\)

The survival function \(S(t), \; t > 0\), is defined as the probability of surviving past \(t\), \(t > 0\), or with a formula,

\[\begin{equation} S(t) = P(T \ge t), \quad t > 0, \tag{2.1} \end{equation}\]

where \(T\) is the (random) life length under study. The symbol \(P(T \ge t)\) reads “the probability that \(T\) is equal to or larger than \(t\); here \(t\) is a fixed number while \(T\) denotes a “random quantity.” In statistical language, \(T\) is called a random variable. In our example, \(T\) is the (future) life length of a randomly chosen newborn female, of course unknown at birth.

We will, for the time being, assume that \(S\) is a “nice” function, smooth and differentiable. That will ensure that the following definitions are unambiguous.

2.1.2 The density function

The density function is defined as minus the derivative of the survival function, or

\[\begin{equation} f(t) = -\frac{d}{dt} S(t), \quad t > 0 \tag{2.2} \end{equation}\]

The intuitive interpretation of this definition is that, for small enough \(s\), the following approximation is good:

\[ P(t_0 \le T < t_0 + s) = S(t_0) - S(t_0 + s) \approx s f(t_0). \] This is illustrated in Figure 2.4, with \(t_0 = 10\) and \(s = 2\).

Interpretation of the density function.

FIGURE 2.4: Interpretation of the density function.

For a short enough interval \((t_0, t_0 +s]\), the probability of an observation falling in that interval is well approximated by the area of a rectangle with sides of lengths \(s\) and \(f(t_0)\), or \(sf(t_0)\). The formal mathematical definition as a limit is given in equation (2.3).

\[\begin{equation} f(t) = \lim_{s \rightarrow 0} \frac{P(t \le T < t + s)}{s}, \quad t > 0. \tag{2.3} \end{equation}\]

For the Swedish women 2020, the estimated density function is shown in Figure 2.5.

Density function for the life length of Swedish women 2020.

FIGURE 2.5: Density function for the life length of Swedish women 2020.

2.1.3 The hazard function

The hazard function is central to the understanding of survival analysis, so you are recommended to get at least an intuitive understanding of it. One way of thinking of it is as an ``instant probability’’ (per unit time); at a given age \(t\), it measures the risk of dying in a short interval \((t, t + s)\) immediately after \(t\), for an individual who still is alive at \(t\).

\[\begin{equation*} \begin{split} h(t) &= \lim_{s \rightarrow 0}\frac{P(t \le T < t + s \mid T \ge t)}{s} \\ &= \lim_{s \rightarrow 0}\frac{S(t) - S(t + s)}{s S(t)} \\ &= \frac{f(t)}{S(t)} \end{split} \end{equation*}\]

Note the difference between the density and the hazard functions. The former is (the limit of) an unconditional probability, while the latter is (the limit of) a conditional probability per time unit.

For the Swedish women 2020, the estimated hazard function (age-specific mortality) is shown in Figure 2.6.

Hazard function for the life length of Swedish women 2020.

FIGURE 2.6: Hazard function for the life length of Swedish women 2020.

2.1.4 The cumulative hazard function

The cumulative hazard function is defined as the integral of the hazard function,

\[\begin{equation*} H(t) = \int_0^t h(s) ds, \quad t \ge 0. \end{equation*}\] That is, an intuitive interpretation is that the cumulative hazard function successively accumulates the instant risks.

The cumulative hazard function is important because it is fairly easy to estimate nonparametrically (i.e., without any restrictions or assumptions), in contrast to the hazard and density functions.

For the Swedish women 2020, the estimated cumulative hazard function is shown in Figure 2.7.

Cumulative hazard function for Swedish women 2020.

FIGURE 2.7: Cumulative hazard function for Swedish women 2020.

Example 2.2 (The exponential distribution)

Perhaps the simplest continuous life length distribution is the exponential distribution. It is simple because its hazard function is constant:

\[\begin{equation*} h(t) = \lambda, \quad \lambda > 0, \; t \ge 0. \end{equation*}\] From this it is easy to calculate the other functions that characterize the exponential distribution. The cumulative hazards function is

\[\begin{equation*} H(t) = \lambda t, \quad \lambda > 0,\; t \ge 0, \end{equation*}\] the survival function is

\[\begin{equation*} S(t) = e^{-\lambda t}, \quad \lambda > 0,\; t \ge 0, \end{equation*}\] and the density function is

\[\begin{equation*} f(t) = \lambda e^{-\lambda t}, \quad \lambda > 0,\; t \ge 0. \end{equation*}\] The property of constant hazard implies no aging. This is not a realistic property for human mortality, but, as we will see, a useful benchmark, and a useful tool for modelling mortality over short time intervals (piecewise constant hazard). The exponential distribution is described in detail in Appendix B.\(\ \Box\)

2.2 Discrete Time Models

So far we have assumed, implicitly or explicitly, that time is continuous. We will now introduce discrete time survival models, and the reason is two-fold: (i) Even if data are generated from truly continuous time models, nonparametric estimation of these models will, as will be shown later, give rise to estimators corresponding to a discrete time model. This is an inherent property of nonparametric maximum likelihood estimation. Thus, in order to study the properties of these estimators, we need some knowledge of discrete time models. (ii) Data are discrete, usually through grouping. For instance, life lengths may be measured in full years, introducing tied data.

It is important to realize that in practice all data are discrete. For instance, it is impossible to measure time with infinite precision. Therefore, all data are more or less rounded. If data are so much rounded that the result is heavily tied data, true discrete-data models are called for.

Discrete time models will now be introduced. Let \(R\) be a discrete random variable with

  • support \((r_1, r_2, \ldots, r_k)\) (positive real numbers, usually \(1, 2, \ldots\) or \(0, 1, 2, \ldots\)),
  • probability mass function

\[\begin{equation*} p_i = P(R = r_i), \quad i = 1, \ldots, k, \end{equation*}\] with \(p_i > 0, \quad i = 1, \ldots, k\) and \(\sum_{i=1}^k p_i = 1\).

Then

\[\begin{equation*} F(t) = \sum_{i: r_i \le t} p_i, \quad -\infty < t < \infty, \end{equation*}\]

is the cumulative distribution function, and

\[\begin{equation*} S(t) = \sum_{i: r_i \ge t} p_i, \quad -\infty < t < \infty, \end{equation*}\]

is the survival function.

The discrete time hazard function is defined as

\[\begin{equation} h_i = P(R = r_i \mid R \ge r_i) = \frac{p_i}{\sum_{j=i}^k p_j}, \quad i = 1, \ldots, k. \tag{2.4} \end{equation}\]

Note that here, the hazard at any given time point is a conditional probability, so it must always be bounded to lie between zero and one. In the continuous case, on the other hand, the hazard function may take any positive value. Further note that if, like here, the support is finite, the last “hazard atom” is always equal to one (having lived to the “last station,” one is bound to die).

The system (2.4) of equations has a unique solution, easily found by recursion:

\[\begin{equation} p_i = h_i \prod_{j=1}^{i-1} (1 - h_j), \quad i = 1, \ldots, k. \tag{2.5} \end{equation}\]

From this we get the discrete time survival function at each support point as

\[\begin{equation*} S(r_i ) = \sum_{j = i}^k p_j = \prod_{j = 1}^{i-1} (1-h_j), \quad i = 1, \ldots, k, \end{equation*}\]

and the general definition

\[\begin{equation} S(t) = \prod_{j: r_j < t} \left(1 - h_j\right), \quad t \ge 0 \tag{2.6} \end{equation}\]

It is easily seen that \(S\) is decreasing, \(S(0) = 1\), and \(S(\infty) = 0\), as it should be.

Example 2.3 (The geometric distribution)

The geometric distribution has support on \(\{1, 2, \ldots\}\) (another version also includes zero in the support, this is the case for the one in R), and the hazard function \(h\) is constant:

\[\begin{equation*} h_i = h, \quad 0 < h < 1, \; i = 1, 2, \ldots. \end{equation*}\] Thus, the geometric distribution is the discrete analogue to the exponential distribution in that it implies no aging. The probability mass function is, from ,

\[\begin{equation*} p_i = h (1-h)^{i-1}, i = 1, 2, \ldots, \end{equation*}\] and the survival function becomes, from (2.6),

\[\begin{equation*} S(t) = (1-h)^{[t]}, \quad t > 0, \end{equation*}\] where \([t]\) denotes the largest integer smaller than or equal to \(t\) (rounding towards zero).

2.3 Nonparametric Estimators

As an introductory example, look at an extremely simple data set: \(4\), \(2^*\), \(6\), \(1\), \(3^*\) (starred observations are right censored; in the figure, deaths are marked with \(+\), censored observations with a small circle), see Figure 2.8.

A simple survival data set.

FIGURE 2.8: A simple survival data set.

How should the survival function be estimated? The answer is that we take it in steps. First, the hazard “atoms” are estimated. It is done nonparametrically, and the result as such is not very useful. Its potential lies in that it is used as the building block in constructing estimates of the cumulative hazards and survival functions.

2.3.1 The hazard atoms

In Figure 2.9, the observed event times in Figure 2.8 are marked by the vertical dashed lines at durations 1, 4, and 6, respectively. In the estimation of the hazard atoms, the concept of risk set is of vital importance.

Preliminaries for estimating the hazard function.

FIGURE 2.9: Preliminaries for estimating the hazard function.

The risk set \(R(t)\) at duration \(t, \; t > 0\) is defined mathematically as

\[\begin{equation} R(t) = \{\text{all individuals under observation at $t-$}\}, \tag{2.7} \end{equation}\] or in words, the risk set at time \(t\) consists of all individuals present and under observation just prior to \(t\). The reason we do not say ``present at time \(t\)’’ is that it is vital to include those individuals who have an event or are right censored at the exact time \(t\). In our example, the risk sets \(R(1)\), \(R(4)\), and \(R(6)\) are the interesting ones. They are:

\[\begin{equation*} \begin{split} R(1) &= \{1, 2, 3, 4, 5\} \\ R(4) &= \{1, 3\} \\ R(6) &= \{3\} \\ \end{split} \end{equation*}\]

The estimation of the hazard atoms is simple. First, we assume that the probability of an event at times where no event is observed, is zero. Then, at times where events do occur, we count the number of events and divides that number by the size of the corresponding risk set. The result is shown in (2.8).

\[\begin{equation} \begin{split} \hat{h}(1) &= \frac{1}{5} = 0.2 \\ \hat{h}(4) &= \frac{1}{2} = 0.5 \\ \hat{h}(6) &= \frac{1}{1} = 1 \\ \end{split} \tag{2.8} \end{equation}\]

See also Figure 2.10.

Nonparametric estimation of the hazard function.

FIGURE 2.10: Nonparametric estimation of the hazard function.

As is evident from Figure 2.10, the estimated hazard atoms will be too irregular to be of practical use; they need smoothing. The simplest way of smoothing them is to calculate the cumulative sums, which leads to the Nelson-Aalen estimator (Nelson 1972; Aalen 1978) of the cumulative hazards function, see Section 2.3.2. There are more direct smoothing techniques to get reasonable estimators of the hazard function itself, e.g., kernel estimators , but they will not be discussed here. See e.g. Silverman (1986) for a general introduction to kernel smoothing.

2.3.2 The Nelson-Aalen estimator

The Nelson-Aalen estimator.

FIGURE 2.11: The Nelson-Aalen estimator.

From the theoretical relation we immediately get

\[\begin{equation*} \hat{H}(t) = \sum_{s \le t} \hat{h}(s), \quad t \ge 0, \end{equation*}\]

which is the Nelson-Aalen estimator (Nelson 1972; Aalen 1978), see Figure 2.11. The sizes of the jumps are equal to the heights of the “spikes” in Figure 2.10.

2.3.3 The Kaplan-Meier estimator

From the theoretical relation (2.6) we get

\[\begin{equation} \hat{S}(t) = \prod_{s < t} \bigl(1 - \hat{h}(s)\bigr), \quad t \ge 0, \tag{2.9} \end{equation}\]

see also Figure 2.12. Equation (2.9) may be given a heuristic interpretation: In order to survive time \(t\), one must survive all “spikes” (or shocks) that come before time \(t\). The multiplication principle for conditional probabilities then gives equation (2.9).

The Kaplan-Meier estimator.

FIGURE 2.12: The Kaplan-Meier estimator.

2.4 Doing It in R

We show how to do it in R by using the male mortality data set mort, which is part of the package eha. By loading eha we have access to mort. The R function head is convenient for looking at a few (six by default) rows in a data frame.

library(eha)
head(mort)
  id  enter   exit event birthdate   ses
1  1  0.000 20.000     0  1800.010 upper
2  2  3.478 17.562     1  1800.015 lower
3  3  0.000 13.463     0  1800.031 upper
4  3 13.463 20.000     0  1800.031 lower
5  4  0.000 20.000     0  1800.064 lower
6  5  0.000  0.089     0  1800.084 lower

This is the normal form of storing left truncated and right censored data. Here, males are followed from their twentieth birthdate until death or their fortieth birthdate, whichever comes first. An indicator for death is introduced, called event. The value 1 indicates that the corresponding life time is fully observed, while the value 0 indicates a right censored life time. Another common coding scheme is TRUE and FALSE, respectively.

2.4.1 Nonparametric estimation

In the R package eha, the plot function can be used to plot both Nelson-Aalen (cumulative hazards) and Kaplan-Meier (survival) curves. Here is the code:

par(mfrow = c(1, 2))# Two panels, "one row, two columns".
with(mort, plot(Surv(enter, exit, event), fun = "cumhaz", 
     main = "Cumulativa hazards function",
     xlab = "Duration"))
with(mort, plot(Surv(enter, exit, event),
                main = "Survival function",
                xlab = "Duration"))

and the result is seen in Figure 2.13.

Nelson-Aalen plot (left panel) and Kaplan-Meier plot (right panel), male mortality data.

FIGURE 2.13: Nelson-Aalen plot (left panel) and Kaplan-Meier plot (right panel), male mortality data.

Note the use of the function with; it tells the plot function that it should get its data (enter, exit, event) from the data frame mort. The function Surv, imported from the package survival, creates a “survival object,” which is used in many places. It is for instance the response in all functions that perform regression analysis on survival data.

Note that the “Duration” in Figure 2.13 is duration (in years) since the day each man became twenty years of age. They are followed until death or age forty, whichever comes first. The right hand panel shows that approximately 25 per cent of the men alive at age twenty died before they became forty. Data come from a 19th century Swedish sawmill area.

2.4.2 Parametric estimation

It is also possible to fit a parametric model to data with the aid of the function phreg (it also works with the function aftreg). Just fit a “parametric proportional hazards model with no covariates.”

par(mfrow = c(1, 2), las = 1)
fit.w <- phreg(Surv(enter, exit, event) ~ 1, data = mort)
plot(fit.w, fn = "cum", main = "")
plot(fit.w, fn = "sur", main = "")
Male mortality, Weibull fit.

FIGURE 2.14: Male mortality, Weibull fit.

Note that the default distribution in phreg is the Weibull distribution, which means that if no distribution is specified (through the argument dist), then the Weibull distribution is assumed.

References

Aalen, O. O. 1978. “Nonparametric Inference for a Family of Counting Processes.” Annals of Statistics 6: 701–26.
Nelson, W. 1972. “Theory and Applications of Hazard Plotting for Censored Failure Data.” Technometrics 14: 945–65.
Silverman, B. W. 1986. Density Estimation. Chapman & Hall.