1.6 More data sets
A few examples and data sets will be used repeatedly throughout the book, and we give a brief description of them here. They are all available in the R package eha, which is loaded into a running R session by the call
This loads the eha package. In the examples to follow, we assume that this is already done. The main data source is the CEDAR, Umeå University, Sweden. However, one data set is taken from the home page of Statistics Sweden.
This data set is included in the R (R Core Team 2017) package eha
(Broström 2017).
It contains information about 1023 males, age twenty between January 1,
1800 and December 31, 1819, and living in Skellefteå, a parish in the
north-east of Sweden. The total number of records in the data frame
is 1211, that
is, some individuals are represented by more than one record in the
data file. The reason for that is that the socio-economic status
(ses) is
one of the covariates in the file, and it changes over time. Each time a
change is recorded, a new record is created for that individual, with the
new value of SES. For instance, the third and fourth rows in the data frame are
## id enter exit event birthdate ses
## 3 3 0.000 13.463 0 1800.031 upper
## 4 3 13.463 20.000 0 1800.031 lower
Note that the variable id is the same (3) for the two records, meaning
that both records are information about individual No. 3. The variable
enter is age (in years) that has elapsed since the 20th
birth day anniversary, and exit
likewise. The information about him is that he
was born on 1800.031, or January 12, 1800, and he is followed from his
21th birth date, or from January 12, 1820. He is in an upper
socio-economic status until he is 20 + 13.463 = 33.463 years of age,
when he unfortunately is degraded to a lower ses. He is then
followed until 20 years have elapsed, or until his fortieth birthday. The
variable event tells us that he is alive we stop observing him. The
value zero indicates that the follow-up ends with right censoring.
In an analysis of male mortality with this data set we could ask whether there is a socio-economic difference in mortality, and also if it changes over time. That would typically be done by Cox regression or by a parametric proportional hazards model. More about that follows in later chapters.
This data set is taken from and concerns the interplay between infant and maternal mortality in 19th century Sweden (source: CEDAR, Umeå University, Sweden). More specifically, we are interested in estimating the effect of mother’s death on the infant’s survival chances. Because maternal mortality was rare (around one per 200 births), matching is used. This is performed as follows: for each child experiencing the death of its mother (before age one), two matched controls were selected. The criteria were: same age as the case at the event, same sex, birth year, parish, socio-economic status, marital status of mother. The triplets so created were followed until age one, and eventual deaths of the infants were recorded. The data collected in this way is part of the eha package under the name infants, and the first rows of the data frame are shown here:
## stratum enter exit event mother age sex parish civst ses year
## 1 1 55 365 0 dead 26 boy Nedertornea married farmer 1877
## 2 1 55 365 0 alive 26 boy Nedertornea married farmer 1870
## 3 1 55 365 0 alive 26 boy Nedertornea married farmer 1882
## 4 2 13 76 1 dead 23 girl Nedertornea married other 1847
## 5 2 13 365 0 alive 23 girl Nedertornea married other 1847
## 6 2 13 365 0 alive 23 girl Nedertornea married other 1848
A short description of the variables follows.
- stratum denotes the id of the triplets, 35 in all.
- enter is the age in days of the case, when its mother died.
- exit is the age in days when follow-up ends. It takes the value 365 (one year) for those who survived their first anniversary.
- event indicates whether a death (1) or a survival (0) was observed.
- mother has value dead for all cases and the value alive for the controls.
- age Age of mother at infant’s birth.
- sex Sex of the infant.
- parish Birth parish.
- civst Civil status of mother, married or unmarried.
- ses Socio-economic status, often the father’s, based on registrations of occupation.
- year Calendar year of the birth.
This data set is discussed and analyzed in Chapter 8. \(\Box\)
This data set is taken from Statistics Sweden. It is freely available on the web site . The aggregated data set contains information about population size and number of deaths by sex and age for the ages 61 and above for the year 2007.
## pop deaths sex age log.pop
## 1 63483 286 female 61 11.05853
## 2 63770 309 female 62 11.06304
## 3 64182 317 female 63 11.06948
## 4 63097 366 female 64 11.05243
## 5 61671 387 female 65 11.02957
## 6 57793 419 female 66 10.96462
## pop deaths sex age log.pop
## 35 31074 884 male 75 10.34413
## 36 29718 904 male 76 10.29951
## 37 29722 1062 male 77 10.29964
## 38 28296 1112 male 78 10.25048
## 39 27550 1219 male 79 10.22376
## 40 25448 1365 male 80 10.14439
The variables have the following meanings.
- pop Average population size 2007 in the age and for the sex given on the same row. The average is based on the population at the beginning and end of the year 2007.
- deaths The observed number of deaths in the age and for the sex given on the same row.
- sex Female or male.
- age Age in completed years.
- log.pop The natural logarithm of pop. This variable is used as offset in a Poisson regression.
See Chapter 4 for how to analyze this data set. \(\Box\)
References
Broström, G. 2017. Eha: Event History Analysis.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.