Chapter 1 Event History and Survival Data
The thing that characterizes event history and survival data the most is its dynamic nature. Individuals are followed over time, and during that course, the timings of events of interest are noted. Naturally, things may happen that makes it necessary to interrupt an individual follow-up, such as the individual suddenly disappearing for some reason. With classical statistical tools, such as linear regression, these observations are difficult or impossible to handle in the analysis. The methods discussed in this book aim among other things, at solving such problems.
In this introductory chapter, the special techniques that constitute survival and event history analysis are motivated. The concepts of right censoring and left truncation are defined and discussed. The data sets used throughout the book are also presented in this chapter.
The environment R is freely (under a GPL license) available for download from CRAN. There you will find precompiled versions for Linux, Mac OS, and Microsoft Windows, as well as the full source code, which is open. See Appendix C for more information.
1.1 Survival Data
Survival data (survival times) constitute the simplest form of event history data. A survival time is defined as the time it takes for an event to occur, measured from a well-defined start event. Thus, there are three basic elements which must be well defined: a time origin, a scale for measuring time, and an event. The response in a statistical analysis of such data is the exact time elapsed from the time origin to the time at which the event occurs. The challenge, which motivates special methods, is that in most applications, this duration is often not possible to observe exactly.
As an introduction to the research questions that are suitable for handling with event history and survival analysis, let us look at a data set found in the eha package (Broström 2021) in R (R Core Team 2021).
The data set oldmort in eha contains survival data from the parish Sundsvall in the mid-east of 19th century Sweden. The name oldmort is an acronym for old age mortality. The source is digitized information from historical parish registers, church books. More information about this can be found at the web page of the Centre for Demographic and Ageing Research at Umeå University (CEDAR).
The sampling was done as follows: Every person who was present and alive and 60 years of age or above anytime between January 1, 1860 and December 31, 1879 was followed from the entrance age (for most people that would be 60) until the age when last seen, determined by death, out-migration, or surviving until December 31, 1879. Those born during the 18th century would enter observation at an age above 60, given that they lived long enough, that is at least until January 1, 1860.
Two types of finishing the observation of a person are distinguished: Either it is by death or it is by something else, out-migration or end of study period. In the first case we say that the event of interest has occurred, in the second case not.
After installing the eha package and starting an R session (see Appendix C), the data set is loaded by loading eha as follows.
library(eha)
Let us look at the first few lines of oldmort (slightly manipulated). It is conveniently done with the aid of the R function head:
head(oldmort, 3)
id enter exit event birthdate m.id f.id sex civ ses.50
1 1 94.510 95.813 TRUE 1765.490 NA NA female widow unknown
2 2 94.266 95.756 TRUE 1765.734 NA NA female unmarried unknown
3 3 91.093 91.947 TRUE 1768.907 NA NA female widow unknown
birthplace imr.birth region
1 remote 22.20000 rural
2 parish 17.71845 industry
3 parish 12.70903 rural
Note that the first printed column, without a name, contains the row names. If no row names are explicitly given, they are set to the row numbers. If you don’t want row names printed, you have to use the more cumbersome
print(oldmort[1:3, ], row.names = FALSE)
id enter exit event birthdate m.id f.id sex civ ses.50
1 94.510 95.813 TRUE 1765.490 NA NA female widow unknown
2 94.266 95.756 TRUE 1765.734 NA NA female unmarried unknown
3 91.093 91.947 TRUE 1768.907 NA NA female widow unknown
birthplace imr.birth region
remote 22.20000 rural
parish 17.71845 industry
parish 12.70903 rural
There are several packages for nice printing of tables available on CRAN.
In this book, written
with the aid of the R packages bookdown
(Xie 2016, 2020),
knitr
(Xie 2015, 2021), and kableExtra
(Zhu 2021), outputs of tables
and regression results are first given “asis,” that is, as you will see them
on the computer screen, but later, depending on circumstances, nice printing
is preferred. Code like the following is often used for regression results:
<- function(fit, caption, label){
fit.out if (knitr::is_latex_output()){ # PDF
if (!missing(label)){
<- paste0("tab:", label)
label
}<- drop1(fit, test = "Chisq")
dr ltx(fit, dr = dr, caption = caption, label = label)
else{ # HTML
}<- regtable(summary(fit), digits = 4)
xx <- ncol(xx)
nn <- c("Max Log", "Likelihood", "",
rr round(fit$loglik[2], 1), rep("", nn - 4))
<- rbind(xx, rr)
xx kbl(xx, booktabs = TRUE, caption = caption) %>%
kable_styling(font_size = 12,
full_width = FALSE)
} }
Essentially, if output is a pdf document (the printed book), the function
ltx
(from the package eha) is used, and if output is in HTML format, the
functionality of the packages knitr and kableExtra is used.
For rectangular tables (for instance, data), the code is simpler, we can use kableExtra for both HTML and pdf output:
<- function(tt, caption = "", fs = 11){
tbl library(kableExtra)
kbl(tt, caption = caption, booktabs = TRUE, row.names = FALSE) %>%
kable_styling(full_width = FALSE, font_size = fs,
position = "center")
}
The result (with selected variables and a random sample of records) is shown in Table ??.
The variables in oldmort have the following definitions and interpretations:
- id A unique id number for each individual.
- enter, exit The start age and stop age for this record (spell). For instance, in row No. 1, individual No. 1 enters under observation at age 94.51 and exits at age 95.81. Age is calculated as the number of days elapsed since birth and this number is then divided by 365.25 to get age in years. The denominator is the average length of a year, taking into account that every fourth year is 366 days long. The first individual was born on June 27, 1765, and so almost 95 years of age when the study started. Suppose that this woman had died at age 94; then she had not been in our study at all. This property of our sampling procedure introduces a phenomenon known as length-biased sampling. That is, of those born in the 18th century, only those who live well beyond 60 will be included. This bias must be compensated for in the analysis, and it is accomplished by conditioning on the fact that these persons were alive at January 1, 1860. This technique is called left truncation.
- event A logical variable (taking values TRUE or FALSE) indicating if the exit is a death (TRUE) or not (FALSE). For our first individual, the value is TRUE, indicating that she died at the age of 95.81 years. Another possible coding is to use event = 1 for a death and event = 0 otherwise.
- birthdate The birth date expressed as the time (in years) elapsed since January 1, year 0 (which by the way does not exist). For instance, the (pseudo) date 1765.490 is really June 27, 1765. The fraction 0.490 is the fraction of the year 1765 that elapsed until the birth of individual No. 1. However, here it is printed as a real date.
- m.id Mother’s id. It is unknown for all the individuals listed above. That is the symbol NA, which stands for Not Available. The oldest people in the data set typically have no links to parents.
- f.id Father’s id. See m.id.
- sex A categorical variable with the levels female and male.
- civ Civil status. A categorical variable with three levels; unmarried, married, and widow(er).
- ses.50 Socio-economic status (SES) at age 50. Based on occupation information. There is a large proportion of NA (missing values) in this variable. This is quite natural, because this variable was of secondary interest to the record holder (the priest in the parish). The occupation is only noted in connection to a vital event in the family (such as a death, birth, marriage, or in- or out-migration). For those who were above 50 at the start of the period there is no information on SES at 50.
- birthplace A categorical variable with two categories, parish and remote, representing born in parish and born outside parish, respectively.
- imr.birth A rather specific variable. It measures the infant mortality rate in the birth parish at the time of birth (per cent).
- region Present geographical area of residence. The parishes in the region are grouped into three regions, Sundsvall town, rural, and industry. The industry is the sawmill one, which grew rapidly in this area during the late part of the 19th century. The Sundsvall area was in fact one of the largest sawmill areas in Europe at this time.
Of special interest is the triple (enter, exit, event)
, because it
represents the response variable, or what can be seen of it. More
specifically, the sampling frame is all persons observed to be alive and
above 60 years of age between January 1, 1860 and December 31, 1879. The
start event for these individuals is their 60th anniversary and the stop
event is death. Clearly, many individuals in the data set did not die
before January 1, 1880, so for them we do not know the full duration
between the start and stop events; such individuals are said to be
right censored (the exact meaning of which will be given soon). The
third component in the survival object (enter, exit, event)
,
event
, is a logical variable taking the value TRUE
if exit
is the true duration (the interval ends with a death) and
FALSE
if the individual is still alive at the duration “last seen.”
Individuals aged 60 or above between January 1, 1860 and December 31, 1879
are included in the
study. Those who are above 60 at this start date are included only if they
did not die between
the age of 60 and the age at January 1, 1860. If this is not taken into
account, a bias in the estimation of mortality will result. The proper way
of dealing with this problem is to use left truncation, which is
indicated by the variable enter
. If we look at the first rows of oldmort we
see that the enter
variable is very large; it is the
age for each individual at January 1, 1860. You can add enter
and birthdate
for the first six individuals to see that:
$enter[1:6] + oldmort$birthdate[1:6] oldmort
[1] 1860 1860 1860 1860 1860 1860
The statistical implication (description) of left truncation is that its
presence forces the analysis to be
conditional on survival up to the
age enter
.
A final important note: In order to get the actual duration at exit, we
must subtract 60 from the value of exit
. When we actually perform a
survival analysis in R, we should subtract 60 from both enter
and
exit
before we begin. It is not absolutely necessary in the case of Cox
regression, because of the flexibility of the baseline hazard in the model
(it is in fact left unspecified!). However, for parametric models, it may
be important in order to avoid dealing with truncated distributions.
Now let us think of the research questions that could be answered by analyzing this data set. Since the data contain individual information on the length of life after 60, it is quite natural to study what determines a long life and what are the conditions that are negatively correlated with long life. Obvious questions are: (i) Do women live longer than men? (Yes), (ii) Is it advantageous for a long life to be married? (Yes), (iii) Does socio-economic status play any role for a long life? (Don’t know), and (iv) Does place of birth have any impact on a long life, and if so, is it different for women and men? \(\ \Box\)
The answers to these, and other, questions will be given later. The methods in later chapters of the book are all illustrated on a few core examples. They are all presented a first time in this chapter.
The data set oldmort contained only two states, referred to as Alive and Dead, and one possible transition, from Alive to Dead, see Figure 1.1.
FIGURE 1.1: Survival data.
The ultimate study object in survival analysis is the time it takes from entering state Alive (e.g., becoming 60 years of age) until entering state dead (e.g., death). This time interval is defined by the exact time of two events, which we may call birth and death, although in practice these two events may be almost any kind of events. Economists, for instance, are interested in the duration of out-of-work spells, where “birth” refers to the event of losing the job, and “death” refers to the event of getting a job. In a clinical trial regarding treatment of cancer, the starting event time may be time of operation, and the final event time is time of relapse (if any).
1.2 Right Censoring
When an individual is lost to follow-up, we say that she is right censored, see Figure 1.2.
FIGURE 1.2: Right censoring.
As indicated in Figure 1.2, the true age at death is T, but due to right censoring, the only information available is that death age T is larger than C. The number C is the age at which this individual was last seen. In ordinary, classical regression analysis, such data are difficult, if not impossible, to handle. Discarding such information may introduce bias. The modern theory of survival analysis offers simple ways to deal with right censored data.
A natural question to ask is: If there is right censoring, there should be something called left censoring, and if so, what is it? The answer to that is that yes, left censoring refers to a situation where the only thing known about a death age is that it is less than a certain value \(C\). Note carefully that this is different from left truncation, see Section 1.3.
1.3 Left Truncation
The concept of left truncation, or delayed entry, is well illustrated by the data set oldmort that was discussed in detail in Example 1.1. Please note the difference compared to left censoring. Unfortunately, you may still see scientific articles where these two concepts are mixed up.
It is illustrative to think of the construction of the data set oldmort as a statistical follow-up study, starting on January 1, 1860. At that day, all persons present in the parish and 60 years of age or above, are included in the study. It is decided that the study will end at December 31, 1879, that is, the study period (follow-up time) is 20 years. The interesting event in this study is death. This means that the start event is the sixtieth anniversary of birth and the final event is death. Due to the calendar time constraints (and migration), all individuals will not be observed to die (especially those who live long), and moreover, some individuals will enter the study after the “starting” event, the sixtieth anniversary. A person who enter late, say he is 65 on January 1, 1860, had not been included had he died at age 63 (say). Therefore, in the analysis, we must condition on the fact that he was alive at 65. Another way of saying this is to say that this observation is left truncated at age 65.
People being too young at the start date will be included from the day they reach 60, if that happens before the closing date, December 31, 1879. They are not left truncated, but will have a higher and higher probability of being right censored, the later they enter the study.
1.4 Time Scales
In demographic applications age is often a natural time scale, that is, time is measured from birth. In the old age data just discussed, time was measured from age 60 instead. In a case like this, where there is a common ``late start age’’, it doesn’t matter much, but in other situations it does. Imagine for instance that interest lies in studying the time it takes for a woman to give birth to her first child after marriage. The natural way of measuring time is to start the clock at the day of marriage, but a possible (but not necessarily recommended!) alternative is to start the clock at some (small) common age of the women, for instance at birth. This would give left truncated (at marriage) observations, since women were sampled at marriage. There are two clocks ticking, and you have to make a choice.
Generally, it is important to realize that there often are alternatives, and that the result of an analysis may depend strongly on the choice made.
1.4.1 The Lexis diagram
Two time scales are nearly always present in demographic research: Age (or duration) and calendar time. For instance, an investigation of mortality may be limited in these two directions. In Figure this is illustrated for a study of old age mortality during the years 1829 and 1895. ``Old age mortality’’ is defined as mortality from age 50 and onward to age 100. The Lexis diagram is a way of showing the interplay between the two time scales and (human) life lines. Age moves vertically and calendar time horizontally, which will imply that individual lives will move diagonally, from birth to death, from south-west to north-east, in the Lexis diagram. In our example study, we are only interested in the part of the life lines that appear inside the small rectangle.
FIGURE 1.3: Lexis diagram: time period 1829–1894 and ages 50–100.
Assume that the data set at hand is saved in the text file ‘lex.dat.’ Note that this data set is not part of eha; it is only used here for the illustration of the Lexis diagram.
<- read.table("Data/lex.dat", header = TRUE)
lex lex
id enter exit event birthdate sex
1 1 0 98.314 1 1735.333 male
2 2 0 87.788 1 1750.033 male
3 3 0 71.233 0 1760.003 female
4 4 0 87.965 1 1799.492 male
5 5 0 82.338 1 1829.003 female
6 6 0 45.873 1 1815.329 female
7 7 0 74.112 1 1740.513 female
How do we restrict the data to fit into the rectangle given by the Lexis
diagram in Figure 1.3? With the two functions age.window
and cal.window in eha
, it is easy. The former fixes the ‘age cut’ while the
latter makes the ‘calendar time cut.’
The age cut:
<- age.window(lex, c(50, 100))
lex lex
id enter exit event birthdate sex
1 1 50 98.314 1 1735.333 male
2 2 50 87.788 1 1750.033 male
3 3 50 71.233 0 1760.003 female
4 4 50 87.965 1 1799.492 male
5 5 50 82.338 1 1829.003 female
7 7 50 74.112 1 1740.513 female
Note that individual No. 6 dropped out completely because she died too young. Then the calendar time cut:
<- cal.window(lex, c(1829, 1895))
lex lex
id enter exit event birthdate sex
1 1 93.667 98.314 1 1735.333 male
2 2 78.967 87.788 1 1750.033 male
3 3 68.997 71.233 0 1760.003 female
4 4 50.000 87.965 1 1799.492 male
5 5 50.000 65.997 0 1829.003 female
and here individual No. 7 disappeared because she died before January 1, 1829. Her death date is her birth date plus her age at death, \(1740.513 + 74.112 = 1814.625\), or August 17, 1814.
1.5 Event History Data
Event history data arise, as the name suggests, by following subjects over time and making notes about what happens and when. Usually the interest is concentrated to a few specific kinds of events. The main application in this book is demography and epidemiology, hence events of primary interest are births, deaths, marriages and migration.
As a rather complex example, let us look at marital fertility in 19th century Sweden, see Figure 1.4.
FIGURE 1.4: Marital fertility.
In a marital fertility study, women are typically followed over time from the time of their marriage until the time the marriage is dissolved or her fertility period is over, say at age 50, whichever comes first. The marriage dissolution may be due to the death of the woman or of her husband, or it may be due to a divorce. If the study is limited to a given geographical area, women may get lost to follow-up due to out-migration. This event gives rise to a right-censored observation.
During the follow-up, the exact timings of child births are recorded. Interest in the analysis may lie in investigating which factors, if any, that affect the length of birth intervals. A data set may look like this:
id parity age year next.ivl event prev.ivl ses
1 0 24 1825 0.411 1 NA farmer
1 1 25 1826 22.348 0 0.411 farmer
2 0 18 1821 0.304 1 NA unknown
2 1 19 1821 1.837 1 0.304 unknown
2 2 21 1823 2.546 1 1.837 unknown
2 3 23 1826 2.541 1 2.546 unknown
2 4 26 1828 2.431 1 2.541 unknown
2 5 28 1831 2.472 1 2.431 unknown
2 6 31 1833 3.173 0 2.472 unknown
This is the first 9 rows, corresponding to the first two mothers in the data file. The variable id is mother’s id, a label that uniquely identifies each individual.
A birth interval has a start point (in time) and an end point. These points are the time points of births, except for the first interval, where the start point is time of marriage, and the last interval, which is open to the right. However, the last interval is stopped at the time of marriage dissolution or when the mother becomes 50, whatever comes first. The variable parity is zero for the first interval, between date of marriage and date of first birth, one for the next interval, and so forth. The last (highest) number is thus equal to the total number of births for a woman during her first marriage (disregarding twin births, etc.).
Here is a description variable by variable of the data set.
- id The mother’s unique id.
- parity Order of previous birth, see above for details. Starts at zero.
- age Mother’s age at the event defining the start of the interval.
- year Calendar year for the birth defining the start of the interval.
- next.ivl measures the time in years from the birth at parity to the birth at parity + 1, or, for the woman’s last interval, to the age of right censoring.
- event is an indicator for the interval ending with a birth. It is always equal to 1, except for the last interval, which always has event equal to zero.
- prev.ivl is the length of the interval preceding this one. For the first interval of a woman, it is always NA (Not Available).
- ses Socio-economic status (based on occupation data).
Just to make it clear: The first woman has id 1. She is represented by two records, meaning that she gave birth to one child. She waited 0.411 years from marriage to the first birth, and 22.348 years from the first birth to the second, which never happened. The second woman (2) is represented by seven records, implying that she gave birth to six children. And so on.
Of course, in an analysis of birth intervals we are interested in causal effects; why are some intervals short while others are long? The dependence of the history can be modeled by lengths of previous intervals (for the same mother), parity, survival of earlier births, and so on. Note that all relevant covariate information must refer to the past. More about that later.
The first interval of a woman is different from the others, since it starts with marriage. It therefore makes sense to analyze these intervals separately. The last interval of a woman is also special; it always ends with a right censoring, at the latest when the woman is 50 years of age. You should think of data for a woman generated sequentially in time, starting at the day of her marriage. Follow-up is made to the next birth, as long as she is alive, the marriage is still alive, and she is younger than 50 years of age. If there is no next birth, i.e., she reaches 50, or the marriage is dissolved (most often by death of one of the spouses), the interval is censored at the duration when she still was under observation. Censoring can also occur by emigration, and reaching the end of follow-up, in this case November 5, 1901.\(\ \Box\)
Another useful setup is the illness-death model, see Figure 1.5.
FIGURE 1.5: The illness-death model.
Individuals may move back and forth between the states Healthy and Diseased, and from each of these two states there is a pathway to Dead, which is an absorbing state, meaning that once in that state, you never leave it.\(\ \Box\)
1.6 More Data Sets
A few examples and data sets will be used repeatedly throughout the book, and we give a brief description of them here. They are all available in the R package eha, which is loaded into a running R session by the call
library(eha)
This loads the eha package. In the examples to follow, we assume that this is already done. The main data source is the CEDAR, Umeå University, Sweden. However, one data set is taken from the home page of Statistics Sweden.
This data set is included in the R (R Core Team 2021) package eha
(Broström 2021).
It contains information about 1023 males, age twenty between January 1,
1800 and December 31, 1819, and living in Skellefteå, a parish in the
north-east of Sweden. The total number of records in the data frame
is 1211, that
is, some individuals are represented by more than one record in the
data file. The reason for that is that the socio-economic status
(ses
) is
one of the covariates in the file, and it changes over time. Each time a
change is recorded, a new record is created for that individual, with the
new value of SES. For instance, the third and fourth rows in the data frame are
id enter exit event birthdate ses
3 0.000 13.463 0 1800.031 upper
3 13.463 20.000 0 1800.031 lower
Note that the variable id
is the same (3) for the two records, meaning
that both records are information about individual No. 3. The variable
enter
is age (in years) that has elapsed since the 20th
birth day anniversary, and exit
likewise. The information about him is that he
was born on 1800.031, or January 12, 1800, and he is followed from his
21th birth date, or from January 12, 1820. He is in an upper
socio-economic status until he is 20 + 13.463 = 33.463 years of age,
when he unfortunately is degraded to a lower ses
. He is then
followed until 20 years have elapsed, or until his fortieth birthday. The
variable event
tells us that he is alive we stop observing him. The
value zero indicates that the follow-up ends with right censoring.
In an analysis of male mortality with this data set we could ask whether there is a socio-economic difference in mortality, and also if it changes over time. That would typically be done by Cox regression or by a parametric proportional hazards model. More about that follows in later chapters.
The data set child
(in eha) contains follow-up data of children born in
Skellefteå, northern Sweden, between January 1, 1850 and December 31, 1884. Each
child is followed up to a maximum of fifteen years, and the follow-up is interrupted by death,
loss of follow-up (usually out-migration), or the reach of age fifteen.
These data constitute a cohort and can be analyzed as such. It may be represented in a Lexis diagram as in Figure 1.6.
FIGURE 1.6: Birth cohort 1850–1884, Skellefteå. A child mortality study.
The first few lines are
id m.id sex socBranch birthdate exit event illeg m.age
9 246606 male farming 1853-05-23 15.000 0 no 35.009
150 377744 male farming 1853-07-19 15.000 0 no 30.609
158 118277 male worker 1861-11-17 15.000 0 no 29.320
178 715337 male farming 1872-11-16 15.000 0 no 41.183
263 978617 female worker 1855-07-19 0.559 1 no 42.138
Not that the variable enter
really is redundant here: It is always equal to zero
since all children are followed from birth in Skellefteå.
This data set will be discussed in more detail in a later chapter. It is part of a study of child mortality \(\ \Box\)
This data set is taken from and concerns the interplay between infant and maternal mortality in 19th century Sweden (source: CEDAR, Umeå University, Sweden). More specifically, we are interested in estimating the effect of mother’s death on the infant’s survival chances. Because maternal mortality was rare (around one per 200 births), matching is used. This is performed as follows: for each child experiencing the death of its mother (before age one), two matched controls were selected. The criteria were: same age as the case at the event, same sex, birth year, parish, socio-economic status, marital status of mother. The triplets so created were followed until age one, and eventual deaths of the infants were recorded. The data collected in this way is part of the eha package under the name infants, and the first rows of the data frame are shown here:
id enter exit event mother age sex parish civst ses year
1 55 365 0 dead 26 boy Nedertornea married farmer 1877
1 55 365 0 alive 26 boy Nedertornea married farmer 1870
1 55 365 0 alive 26 boy Nedertornea married farmer 1882
A short description of the variables follows.
- id denotes the id of the mother, 35 in all.
- enter is the age in days of the case, when its mother died.
- exit is the age in days when follow-up ends. It takes the value 365 (one year) for those who survived their first anniversary.
- event indicates whether a death (1) or a survival (0) was observed.
- mother has value dead for all cases and the value alive for the controls.
- age Age of mother at infant’s birth.
- sex Sex of the infant.
- parish Birth parish.
- civst Civil status of mother, married or unmarried.
- ses Socio-economic status, often the father’s, based on registrations of occupation.
- year Calendar year of the birth.
This data set is discussed and analyzed in Chapter 8. \(\ \Box\)
These data sets (swepop
and swedeaths
) are taken from Statistics
Sweden, freely available on
the web site . The aggregated data sets contain yearly
information about population size and number of deaths by sex and age for the
years 1969–2020.
The first and last rows of swepop
for the year 2020 are
age sex year pop
0 women 2020 55505.5
0 men 2020 58980.5
1 women 2020 57158.0
age sex year pop
99 men 2020 367.0
100 women 2020 1933.5
100 men 2020 394.5
and the first and last rows of swedeaths
are
age sex year deaths
0 women 2020 112
0 men 2020 156
1 women 2020 8
age sex year deaths
99 men 2020 178
100 women 2020 1004
100 men 2020 243
The variables have the following meanings.
- pop Average population size 2007 in the age and for the sex given on the same row. The average is based on the population at the beginning and end of the year 2007.
- deaths The observed number of deaths in the age and for the sex given on the same row.
- sex Female or male.
- age Age in completed years.
See Chapter 7 for how to analyze this data set. \(\ \Box\)