1.1 Survival data

Survival data (survival times) constitute the simplest form of event history data. A survival time is defined as the time it takes for an event to occur, measured from a well-defined start event. Thus, there are three basic elements which must be well defined: a time origin, a scale for measuring time, and an event. The response in a statistical analysis of such data is the exact time elapsed from the time origin to the time at which the event occurs. The challenge, which motivates special methods, is that in most applications, this duration is often not possible to observe exactly.

As an introduction to the research questions that are suitable for handling with event history and survival analysis, let us look at a data set found in the eha package (Broström 2017) in R (R Core Team 2017).

Example 1.1 (Old age mortality)

The data set oldmort in eha contains survival data from the parish Sundsvall in the mid-east of 19th century Sweden. The name oldmort is an acronym for old age mortality. The source is digitized information from historical parish registers, church books. More information about this can be found at the web page of the Centre for Demographic and Ageing Research at Umeå University (CEDAR).

The sampling was done as follows: Every person who was present and alive and 60 years of age or above anytime between 1 January 1860 and 31 December 1879 was followed from the entrance age (for most people that would be 60) until the age when last seen, determined by death, out-migration, or surviving until 31 December 1879. Those born during the eighteenth century would enter observation at an age above 60, given that they lived long enough, that is at least until January 1, 1860.

Two types of finishing the observation of a person are distinguished: Either it is by death or it is by something else, out-migration or end of study period. In the first case we say that the event of interest has occurred, in the second case not.

After installing the eha package and starting an R session (see Appendix C), the data set is loaded by loading eha as follows.

library(eha)

Loading required package: survival

Let us look at the first few lines of oldmort. It is conveniently done with the aid of the R function head:

head(oldmort, 3)

#          id  enter   exit event birthdate m.id f.id    sex       civ  ses.50
# 1 765000603 94.510 95.813  TRUE  1765.490   NA   NA female     widow unknown
# 2 765000669 94.266 95.756  TRUE  1765.734   NA   NA female unmarried unknown
# 3 768000648 91.093 91.947  TRUE  1768.907   NA   NA female     widow unknown
#   birthplace imr.birth   region
# 1     remote  22.20000    rural
# 2     parish  17.71845 industry
# 3     parish  12.70903    rural

The variables in oldmort have the following definitions and interpretations:

id A unique id number for each individual.
enter, exit The start age and stop age for this record (spell). For instance, in row No. 1, individual No. 765000603 enters under observation at age 94.51 and exits at age 95.81. Age is calculated as the number of days elapsed since birth and this number is then divided by 365.25 to get age in years. The denominator is the average length of a year, taking into account that every fourth year is 366 days long. The first individual was born around July 1, 1765, and so almost 95 years of age when the study started. Suppose that this woman had died at age 94; then she had not been in our study at all. This property of our sampling procedure is a special case of a phenomenon called length-biased sampling. That is, of those born in the eighteeenth century, only those who live well beyond 60 will be included. This bias must be compensated for in the analysis, and it is accomplished by conditioning on the fact that these persons were alive at January 1, 1860. This technique is called left truncation.
event A logical variable (taking values TRUE or FALSE) indicating if the exit is a death (TRUE) or not (FALSE). For our first individual, the value is TRUE, indicating that she died at the age of 95.81 years.
birthdate The birth date expressed as the time (in years) elapsed since January 1, year 0 (which by the way does not exist). For instance, the (pseudo) date 1765.490 is really June 27, 1765. The fraction 0.490 is the fraction of the year 1765 that elapsed until the birth of individual No. 765000603.
m.id Mother’s id. It is unknown for all the individuals listed above. That is the symbol NA, which stands for Not Available. The oldest people in the data set typically have no links to parents.
f.id Father’s id. See m.id.
sex A categorical variable with the levels female and male.
civ Civil status. A categorical variable with three levels; unmarried, married, and widow(er).
ses.50 Socio-economic status (SES) at age 50. Based on occupation information. There is a large proportion of NA (missing values) in this variable. This is quite natural, because this variable was of secondary interest to the record holder (the priest in the parish). The occupation is only noted in connection to a vital event in the family (such as a death, birth, marriage, or in- or out-migration). For those who were above 50 at the start of the period there is no information on SES at 50.
birthplace A categorical variable with two categories, parish and remote, representing born in parish and born outside parish, respectively.
imr.birth A rather specific variable. It measures the infant mortality rate in the birth parish at the time of birth (per cent).
region Present geographical area of residence. The parishes in the region are grouped into three regions, Sundsvall town, rural, and industry. The industry is the sawmill one, which grew rapidly in this area during the late part of the 19th century. The Sundsvall area was in fact one of the largest sawmill areas in Europe at this time.

Of special interest is the triple (enter, exit, event), because it represents the response variable, or what can be seen of it. More specifically, the sampling frame is all persons observed to be alive and above 60 years of age between 1 January 1860 and 31 December 1879. The start event for these individuals is their 60th anniversary and the stop event is death. Clearly, many individuals in the data set did not die before 1 January 1880, so for them we do not know the full duration between the start and stop events; such individuals are said to be right censored (the exact meaning of which will be given soon). The third component in the survival object (enter, exit, event), i.e., event is a logical variable taking the value TRUE if exit is the true duration (the interval ends with a death) and FALSE if the individual is still alive at the duration “last seen”.

Individuals aged 60 or above between 1 January 1860 and 31 December 1879 are included in the study. Those who are above 60 at this start date are included only if they did not die between the age of 60 and the age at 1 January 1860. If this is not taken into account, a bias in the estimation of mortality will result. The proper way of dealing with this problem is to use left truncation, which is indicated by the variable enter. If we look at the first rows of oldmort we see that the enter variable is very large; it is the age for each individual at 1 January 1860. You can add enter and birthdate for the first six individuals to see that:

oldmort$enter[1:6] + oldmort$birthdate[1:6]

## [1] 1860 1860 1860 1860 1860 1860

The statistical implication (description) of left truncation is that its presence forces the analysis to be conditional on survival up to the age enter.

A final important note: In order to get the actual duration at exit, we must subtract 60 from the value of exit. When we actually perform a survival analysis in R, we should subtract 60 from both enter and exit before we begin. It is not absolutely necessary in the case of Cox regression, because of the flexibility of the baseline hazard in the model (it is in fact left unspecified!). However, for parametric models, it may be important in order to avoid dealing with truncated distributions.

Now let us think of the research questions that could be answered by analyzing this data set. Since the data contain individual information on the length of life after 60, it is quite natural to study what determines a long life and what are the conditions that are negatively correlated with long life. Obvious questions are: (i) Do women live longer than men? (Yes), (ii) Is it advantageous for a long life to be married? (Yes), (iii) Does socio-economic status play any role for a long life? (Don’t know), and (iv) Does place of birth have any impact on a long life, and if so, is it different for women and men?\(\Box\)

The answers to these, and other, questions will be given later. The methods in later chapters of the book are all illustrated on a few core examples. They are all presented a first time in this chapter.

The data set oldmort contained only two states, referred to as Alive and Dead, and one possible transition, from Alive to Dead, see Figure 1.1.

FIGURE 1.1: Survival data.

The ultimate study object in survival analysis is the time it takes from entering state Alive (e.g., becoming 60 years of age) until entering state dead (e.g., death). This time interval is defined by the exact time of two events, which we may call birth and death, although in practice these two events may be almost any kind of events. Economists, for instance, are interested in the duration of out-of-work spells, where “birth” refers to the event of losing the job, and “death” refers to the event of getting a job. In a clinical trial regarding treatment of cancer, the starting event time may be time of operation, and the final event time is time of relapse (if any).

References

Broström, G. 2017. Eha: Event History Analysis.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.