1.6 More data sets

A few examples and data sets will be used repeatedly throughout the book, and we give a brief description of them here. They are all available in the R package eha, which is loaded into a running R session by the call

library(eha)

This loads the eha package. In the examples to follow, we assume that this is already done. The main data source is the CEDAR, Umeå University, Sweden. However, one data set is taken from the home page of Statistics Sweden.

Example 1.4 (Survival of males aged 20)

This data set is included in the R (R Core Team 2017) package eha (Broström 2017). It contains information about 1023 males, age twenty between January 1, 1800 and December 31, 1819, and living in Skellefteå, a parish in the north-east of Sweden. The total number of records in the data frame is 1211, that is, some individuals are represented by more than one record in the data file. The reason for that is that the socio-economic status (ses) is one of the covariates in the file, and it changes over time. Each time a change is recorded, a new record is created for that individual, with the new value of SES. For instance, the third and fourth rows in the data frame are

options(digits = 7)
mort[3:4, ]
##   id  enter   exit event birthdate   ses
## 3  3  0.000 13.463     0  1800.031 upper
## 4  3 13.463 20.000     0  1800.031 lower

Note that the variable id is the same (3) for the two records, meaning that both records are information about individual No. 3. The variable enter is age (in years) that has elapsed since the 20th birth day anniversary, and exit likewise. The information about him is that he was born on 1800.031, or January 12, 1800, and he is followed from his 21th birth date, or from January 12, 1820. He is in an upper socio-economic status until he is 20 + 13.463 = 33.463 years of age, when he unfortunately is degraded to a lower ses. He is then followed until 20 years have elapsed, or until his fortieth birthday. The variable event tells us that he is alive we stop observing him. The value zero indicates that the follow-up ends with right censoring.

In an analysis of male mortality with this data set we could ask whether there is a socio-economic difference in mortality, and also if it changes over time. That would typically be done by Cox regression or by a parametric proportional hazards model. More about that follows in later chapters.

Example 1.5 (Infant mortality)

This data set is taken from and concerns the interplay between infant and maternal mortality in 19th century Sweden (source: CEDAR, Umeå University, Sweden). More specifically, we are interested in estimating the effect of mother’s death on the infant’s survival chances. Because maternal mortality was rare (around one per 200 births), matching is used. This is performed as follows: for each child experiencing the death of its mother (before age one), two matched controls were selected. The criteria were: same age as the case at the event, same sex, birth year, parish, socio-economic status, marital status of mother. The triplets so created were followed until age one, and eventual deaths of the infants were recorded. The data collected in this way is part of the eha package under the name infants, and the first rows of the data frame are shown here:

##   stratum enter exit event mother age  sex      parish   civst    ses year
## 1       1    55  365     0   dead  26  boy Nedertornea married farmer 1877
## 2       1    55  365     0  alive  26  boy Nedertornea married farmer 1870
## 3       1    55  365     0  alive  26  boy Nedertornea married farmer 1882
## 4       2    13   76     1   dead  23 girl Nedertornea married  other 1847
## 5       2    13  365     0  alive  23 girl Nedertornea married  other 1847
## 6       2    13  365     0  alive  23 girl Nedertornea married  other 1848

A short description of the variables follows.

  • stratum denotes the id of the triplets, 35 in all.
  • enter is the age in days of the case, when its mother died.
  • exit is the age in days when follow-up ends. It takes the value 365 (one year) for those who survived their first anniversary.
  • event indicates whether a death (1) or a survival (0) was observed.
  • mother has value dead for all cases and the value alive for the controls.
  • age Age of mother at infant’s birth.
  • sex Sex of the infant.
  • parish Birth parish.
  • civst Civil status of mother, married or unmarried.
  • ses Socio-economic status, often the father’s, based on registrations of occupation.
  • year Calendar year of the birth.

This data set is discussed and analyzed in Chapter 8. \(\Box\)

Example 1.6 (Old age mortality, tabular data)

This data set is taken from Statistics Sweden. It is freely available on the web site . The aggregated data set contains information about population size and number of deaths by sex and age for the ages 61 and above for the year 2007.

head(swe07)
##     pop deaths    sex age  log.pop
## 1 63483    286 female  61 11.05853
## 2 63770    309 female  62 11.06304
## 3 64182    317 female  63 11.06948
## 4 63097    366 female  64 11.05243
## 5 61671    387 female  65 11.02957
## 6 57793    419 female  66 10.96462
tail(swe07)
##      pop deaths  sex age  log.pop
## 35 31074    884 male  75 10.34413
## 36 29718    904 male  76 10.29951
## 37 29722   1062 male  77 10.29964
## 38 28296   1112 male  78 10.25048
## 39 27550   1219 male  79 10.22376
## 40 25448   1365 male  80 10.14439

The variables have the following meanings.

  • pop Average population size 2007 in the age and for the sex given on the same row. The average is based on the population at the beginning and end of the year 2007.
  • deaths The observed number of deaths in the age and for the sex given on the same row.
  • sex Female or male.
  • age Age in completed years.
  • log.pop The natural logarithm of pop. This variable is used as offset in a Poisson regression.

See Chapter 4 for how to analyze this data set. \(\Box\)

References

Broström, G. 2017. Eha: Event History Analysis.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.