Chapter 7 Register-Based Survival Data Models
During the last couple of decades, huge data bases with poulation data have popped up around the world. They are great sources of information in demographic and statistical research, but their sizes creates challenges for the statistical tools we traditionally use in the demographic analyses.
We distinguish between data sources with individual content and sources with tabular data. In both cases, however, we will end up analyzing tables, so we start by describing methods for analyzing tables.
Finally, in this chapter, we show how to combine the technique with communal covariates with tabulating data. The problem we solve is that the creation of an individual-based data set with communal covariates usually leads to uncomfortably large data sets. Tabulation is often a good solution to this problem.
7.1 Tabular Data
National statistical agencies produce statistics describing the population year
by year, in terms of population size by age and sex, number of births by mother’s
age and sex of the newborn, number of deaths by age and sex, and so on. In the
package eha, there are two data sets taken from Statistics Sweden,
swepop
and swedeaths
.
The content of swepop
is shown in Table ?? (first five rows out
of 10504 rows in total).
It contains the average Swedish population size by year, age, and sex, where age is grouped into one-year classes, but the highest age class, labelled 100, is in fact “ages 100 and above.” The original table gives the population sizes at the end of each year, but here we have calculated an average size over the year in question by taking the mean of the actual value and the corresponding value from the previous year. The population size can in this way be used as a proxy for, prediction of, “total time at risk” or exposure.
The content of swedeaths
(first five rows) is shown in Table ??.
The (age, sex, year)
columns in the two data frames are identical, so we can
create a suitable data frame for analysis simply by putting swedeaths$deaths
into the data frame swepop
and rename it to sw
. The first rows of sw
are
shown Table 7.1.
age | sex | year | pop | deaths |
---|---|---|---|---|
0 | women | 1969 | 52673.0 | 491 |
0 | men | 1969 | 55728.0 | 773 |
1 | women | 1969 | 56831.0 | 33 |
1 | men | 1969 | 59924.0 | 45 |
2 | women | 1969 | 58994.5 | 22 |
The response variable in our analysis is two-dimensional: (deaths, pop)
,
which is on an “occurrence/exposure” form, suitable for an analysis with the
function tpchreg
. The result is a proportional hazards analysis with
the assumption of a piecewise constant baseline hazard function:
$year.1995 <- sw$year - 1995
swsystem.time(fit <- tpchreg(oe(deaths, pop) ~ sex + year.1995,
time = age, last = 101, data = sw))
user system elapsed
0.686 0.000 0.686
Note several things here:
The fitting procedure is wrapped in a call to the function
system.time
. This results in the first report headed “user system elapsed,” giving the time in seconds for the whole operation (“elapsed”), split up by “user” (the time spent with our call) and “system” (time spent by the operating system, for instance reading, writing and organizing the user operation). The interesting value for us is that the total elapsed time was around half a second. Quite OK for a proportional hazards regression analysis involving 4745063 events.The response is
oe(deaths, pop)
where the function nameoe
is short for “occurrence/exposure.”All covariates have a reference value. For factors with the default coding, the reference category is given, but for continuous covariates we need to specify it, if we are not satisfied with the default value zero. In the case with the covariate
year
, we are certainly not satisfied with zero as reference point, so we choose a suitable value in the range of the observed values. This is done in practice by subtracting the reference value from the corresponding variable, formingyear - 1995
in our case. In the output below, this will only affect the value of Restricted mean survival, which is calculated for a “reference individual,” in this case a woman living her life during the year 1995. There are not many such women, of course, what we are doing here is comparing periods, not cohorts.Note the argument
last = 101
. Since we are fitting a survival distribution that is unbounded in time, we should specify the range for our data. The lower limit is given bymin(age)
(the argumenttime
), but there is no natural upper limit. The choice of the value forlast
do not affect parameter estimates and general conclusions, it only affects plots in the extreme of the right tail (slightly).
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.0000 | |||||
women | 0.503 | 0 | 1 | |||
men | 0.497 | 0.4400 | 1.5527 | 0.0009 | ||
year.1995 | 0.520 | -0.0150 | 0.9851 | 0.0000 | 0.0000 | |
Max Log | Likelihood | -19283066.9 | ||||
Restricted | mean | 80.855 |
Note the “Restr. mean survival”: It gives the expected remaining life at birth for a woman in the year 1995 (the reference values).
The graph of the baseline hazard function (“age-specific mortality”) is shown in Figure 7.1.
FIGURE 7.1: Baseline hazard function, women 1995. Model based.
Apart from the first year of life, mortality seems to be practically zero the first forty years of life, and then exponentially increasing with age.
This is simple enough, but some questions remain unanswered. First, the model
assumes proportional hazards, implying for instance that male mortality is around
50 per cent higher than female in all ages. It is fairly simple to organize a
formal test of the proportionality assumption, but it is close to meaningless,
because of the huge amount of data (almost five million deaths, “all” null
hypotheses will be rejected). A better
alternative is to do it graphically by stratifying on sex
.
FIGURE 7.2: Baseline hazard function, women and men 1995.
Due to the extremely low mortality in early ages, Figure 7.2 is quite non-informative. We try the same plot on a log scale, see Figure 7.3.
FIGURE 7.3: Baseline hazard functions, women and men 1995. Log scale.
The female advantage is largest in early adulthood, after which it slowly decreases. An even clearier picture is shown in Figure 7.4, where the ratio of male vs. female mortality is plotted.
FIGURE 7.4: Hazard ratio by age, men vs. women 1995.
This graph gives the clearest picture of the relative differences in mortality over the life span. We see, somewhat unexpectedly, that the drop in the ratio is broken around age 50, and then continues after age 70. This is not clearly visible in the earlier figures. A possible explanation is that the age interval 20–50 approximately coincides with women’s fertility period, that is, during this period, the female mortality advantage decreases more than pure age motivates.
This pattern may change over the time period 1969–2020. We split the data by year into quarters in order to check this. The rise after age 50 seems to vanish with time, according to Figure 7.5.
FIGURE 7.5: Baseline hazard ratios, men vs. women by time period.
Finally, we investigate the development of expected life (\(e_0\)) over time, see Figure 7.6.
FIGURE 7.6: Expected life at birth by sex and year (period data).
The difference in longevity between women and men is slowly decreasing over the time period. Also note the values in the last two years: Mortality in Sweden was unexpectedly low in the year 2019, resulting in high values of expected life, but in 2020, the pandemic virus covid-19 struck, and the expected life dropped by around five months for women and almost nine months for men, compared to 2019.
7.2 Individual Data
The analysis of individual survival data is what the rest of this book is all about, so what is new here? The answer lies in the size of the data sets, that is, the number of individuals is huge.
As an example, we have mortality data consisting of all Swedish citizens aged 30 and above for the years 2001–2013. It may be described as yearly censuses, where all Swedish persons alive at December 31 the specific year are registered together with vital information. It starts with register data of the form shown in Table ??.
The variables are
- id A unique id number, which allows us to link information from different sources (and years) to one individual.
- year The present year, 2001, \(\ldots\), 2013.
- sex Female or male.
- birthdate A constructed variable. Due to integrity reasons, birth time information is only given in the form of year and quarter of the year. So the birthdate is created by year and the first quarter is represented by the decimal places “.125” (equal roughly to February 15), the second quarter is “.375,” third “.625,” and finally the fourth quarter is represented by “.875”). The birthdate will be off by plus or minus one month and a half, which is deemed to be a good enough precision concerning our analysis of adult mortality.
- deathdate Is given by day in the sources. The missing value symbol (NA) is given to those who are not registered as dead before January 1, 2014. Each date is converted to decimal form.
- civst, ses, region Information about civil status, socioeconomic status, and geographic area, respectively.
From this yearly information, survival data is created by following each individual
one year ahead, until the next “census.” Age at start, enter
, is calculated by
subtracting birthdate
from the date of presence, year + 1
. For an individual
with a deathdate less than or equal to enter + 1 = year + 2
, the variable exit
is
calculated as deathdate - birthdate
and the variable event
is set equal to one, otherwise
exit = enter + 1
and event = 0
, a censored observation.
The result of this is shown for the same individuals as above in Table ??, with three covariates omitted in order to save space.
This file from the year 2001 (covering the year 2002!) contains 5.8 million individuals and 92 thousand deaths. To perform a Cox regression on such massive data sets (and there are thirteen of them) is time, software, and hardware consuming, but there is a simple remedy: Make tables and fit piecewise constant proportional hazards (pch) models.
This is done with the aid of the eha
function toTpch
. We decide to choose the pch
model as constant over five-year intervals 30–35, 35–40, \(\ldots\), 100–105,
and we replace birthdate
by birth year (more suitable as a covariate) and call it cohort
. It is done
with the aid of the R function floor
:
lisa2001$cohort <- floor(lisa2001$birthdate)
The function floor
simply removes the decimals from a positive number.
Then we apply the function toTpch
:
listab2001 <- toTpch(Surv(enter, exit, event) ~ sex + cohort + year +
civst + region + ses,
data = lisa2001,
cuts = c(seq(30, 100, by = 5), 105))
The result is shown in
Table ?? (four first rows)
which is a table with 35998 rows. That is, one row for each unique
combination of the six covariates for each interval, empty combinations excluded. Note the three created
variables, event
which is the total number of events for each combination of covariates,
exposure
, which is the total time at risk for each corresponding combination, and age
,
which is the age intervals with constant hazard.
In the survival analysis, the pair (event, exposure)
is the response, and
behind the scene, Poisson regression is performed with event
as response and
log(exposure)
as offset.
Finally, thirteen similar tables are created (one for each year 2001–2013) and
merged, using the R command rbind
.
A summary of the result gives
- Number of records: 489783.
- Number of deaths: 1.16 millions.
- Total time at risk: 76.75 million years.
Now this data set is conveniently analyzed with the aid of the function tpchreg
in eha
. An example:
$year <- listab$year - 2000 # Center!
listab<- aggregate(cbind(event, exposure) ~ sex + civst + ses +
list2 + age, FUN = sum, data = listab)
year <- tpchreg(oe(event, exposure) ~ strata(sex) + civst + ses +
fit data = list2, time = age, last = 105)
year, <- summary(fit)
xx <- round(xx$rmean[1], 2)
emen <- round(xx$rmean[2], 2) ewomen
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
civst | 0.0000 | |||||
married | 0.512 | 0 | 1 | |||
unmarried | 0.264 | 0.4816 | 1.6187 | 0.0032 | ||
prev.married | 0.224 | 0.3265 | 1.3861 | 0.0023 | ||
ses | 0.0000 | |||||
worker | 0.340 | 0 | 1 | |||
elite | 0.115 | -0.3272 | 0.7209 | 0.0056 | ||
middle | 0.318 | -0.1682 | 0.8452 | 0.0035 | ||
lower | 0.228 | 0.3629 | 1.4374 | 0.0030 | ||
year | 7.088 | -0.0023 | 0.9977 | 0.0003 | 0.0000 | |
Max Log | Likelihood | -4538559.5 | ||||
Restricted | mean | 52.568 | 57.023 |
The result is shown in Table 7.3.
We note a couple of things here:
The data frame
listab
contains 489783 records, and the execution time of a Poisson regression might take a long time. Since we did not use the covariatesregion
(a factor with 25 levels) andcohort
(92 levels), we could aggregate over these variables which resulted in the data framelist2
with only 4493 records. This cuts down computing time by approximately 99 per cent, or to 0.17 seconds from 17 seconds. The results are not affected by this. Also note the similarity between theformula
arguments of the two functionsaggregate
andtpchreg
. The differences are that inaggregate
, the functioncbind
is used instead ofoe
, andstrata()
is left out.All p-values are effectively zero, a consequence of of the huge amount of data. It is also reasonable to claim that this constitutes a total investigation, and therefore p-values are meaningless.
The restricted mean survival at age 30 for a married worker in the year 2002 is 52.57 years for men and 57.02 years for women, that is, the expected age of death is 82.57 years for men and 87.02 years for women alive at 30.
A graph of the conditional survival functions is shown in Figure 7.7.
FIGURE 7.7: Survival above 30, males and females, married worker, Sweden 2002.
7.3 Communal Covariates and Tabulation
The application of the function make.communal
(see previous chapter)
will often result in impractically
large data sets, and it is recommended to combine it with the tabulation technique
suggested in this chapter. We illustrate it with an investigation of the influence of
weather conditions on mortality for two time periods, 1901–1950 and 1990–2014.
In the first case focus is on the effect of low temperature on mortality, and the
second case has the opposite focus, is an extremely hot summer bad for survival.
The geographical area under study is the Swedish town Umeå, situated at the Gulf
of Bothnia on the latitude 64 degrees north.
7.3.1 Temperature and mortality, Umeå 1901–1950
Daily temperature data from SMHI for central Umeå, starting at November 1, 1858 and ending at September 30, 1979 (this particular weather station was replaced at this last date) are used together with population data from Umeå. Since we attempt to match these data to population data for the time period January 1, 1901–December 31, 1950, we cut the temperature data to the same time period. See Table ??.
library(eha)
<- read.table("~/Forskning/Data/ume_temp.csv",
temp header = TRUE,
skip = 10, sep = ",")[, 1:3]
## We need only the first three columns
names(temp) <- c("date", "time", "temp")
$date <- as.Date(temp$date)
temp$year <- as.numeric(format(temp$date, "%Y"))
temp$month <- as.numeric(format(temp$date, "%m"))
temp<- temp[order(temp$date, temp$time), ]
temp $quarter <- quarters(temp$date)
temp<- temp[temp$year %in% 1901:1950, ]
temp ##
$time <- as.factor(temp$time)
temp$quarter <- as.factor(temp$quarter) temp
Apparently, there are three measurements each day, and the temperature range is (−34.8, 30) degrees Celsius. The variables year, month, and quarter were extracted from the date.
We extract population data for Umeå, 1 Jan 1901 to 31 Dec 1950, and include all ages between 15 and 100, see Table ?? for the first few rows.
There are 186033 records describing 73780 individuals in this selection.
In order to simplify the presentation, only two covariates, sex
and urban
, are included.
The first question is what to do with the temperature measurements. They span 50 years, with mostly three measurements per day. In this first analysis we focus on the effect on mortality of low temperatures, so we start by taking the maximum of daily temperatures, then minimum over month. We thus end up with one measurement per month and year, in total 600 measurements.
Start by finding the maximum value each day, result in Table ??.
<- with(temp, tapply(date, date))
idx <- with(temp, tapply(temp, date, max))
maxtemp $maxtemp <- maxtemp[idx]
temp## Reduce to one measurement per day:
<- temp[!duplicated(temp$date),
mtemp c("date", "maxtemp", "month", "quarter", "year")]
rownames(mtemp) <- 1:NROW(mtemp)
<- head(mtemp, 3)
lmtemp $maxtemp <- um(lmtemp$maxtemp)
lmtemptbl(lmtemp, fs = 11,
caption = "Maximum daily temperature, Umeå 1901--1950.")
The next step is to take minimum over year and month, see Table ??.
<- aggregate(list(temp = mtemp$maxtemp),
atmp by = mtemp[, c("month", "year")],
FUN = min)
<- head(atmp, 3)
hatmp $temp <- um(hatmp$temp)
hatmptbl(hatmp, fs = 11,
caption = "Average daily max temperature by month, Umeå 1901--1950.")
Now we can apply make.communal
to our data and split spells by year and month.
The result is shown in Table ??.
$yearmonth <- atmp$year * 100 + atmp$month
atmp## A trick to get format 190101
<- system.time(
comtime <- make.communal(ume, atmp["yearmonth"],
nt start = 1901, period = 1/12))
$enter <- round(nt$enter, 3)
nt$exit <- round(nt$exit, 3)
nt<- nt$enter >= nt$exit - 0.0001 # Too short interval!
oj $enter[oj] <- nt$exit[oj] - 0.001 # Break accidental ties
nt$year <- nt$yearmonth %/% 100 # Restore 'year'.
nt$month <- nt$yearmonth %% 100 # Restore 'month'.
ntsaveRDS(nt, file = "mytables/nt.rds")
<- nt[nt$month %in% c(12, 1, 2), ]
nt tbl(head(nt[, -8], 3), fs = 10,
caption = "Population data by year and month, Umeå 1901--1950.")
Total time for making communal: 25 seconds.
The next step is to read temperature from atmp
to nt
, see
Table ?? (selected variables).
<- match(nt$yearmonth, atmp$yearmonth)
indx $temp <- atmp$temp[indx] nt
This data frame contains 2708779 records, maybe too much for a comfortable Cox regression. Let’s see what happens (Table 7.4):
<- proc.time()
ptm <- coxreg(Surv(enter, exit, event) ~
fit + socStatus + temp, data = nt)
sex <- proc.time() - ptm
ctime fit.out(fit, label = "coxph7",
caption = "Mortality and temperature, Umeå 1901--1950.")
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.3010 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.0380 | 0.9627 | 0.0367 | ||
socStatus | 0.3311 | |||||
low | 0.501 | 0 | 1 | |||
high | 0.499 | 0.0360 | 1.0367 | 0.0371 | ||
temp | -15.430 | -0.0061 | 0.9939 | 0.0035 | 0.0805 | |
Max Log | Likelihood | -21892.7 |
This is quite bad (the computing time, almost 2 minutes), but we already have a substantial result: Survival chances increase with temperature. But remember that Umeå is a northern town, not much warm weather here. And further, no signs of interactions with sex or social status (not shown).
The analyses so far are very rudimentary, including all ages in one and so on.
Here we show how extract small subsets of the full data set by some simple and
reasonable assumptions. It builds on categorical covariates and an assumption
of piecewise constant hazards. We use the function toTpch
in the eha
package,
see Table ??.
<- toTpch(Surv(enter, exit, event) ~ sex + socStatus +
told
yearmonth, data = nt, cuts = seq(15, 100, by = 5))
<- match(told$yearmonth, atmp$yearmonth)
indx $temp <- atmp$temp[indx]
toldsaveRDS(told, file = "mytables/told73.rds")
##told <- readRDS("mytables/told73.rds")
<- head(told, 6)
htold $temp <- um(htold$temp)
htold$age <- en(htold$age)
htoldtbl(htold, fs = 11,
caption = "Tabular Umeå 1901--1950 data.")
The tabulation reduces the data set from 2.5 million records to 19 thousands. And computing time accordingly (reduced to a quarter of a second!), without jeopardizing the results, see Table 7.5.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.3176 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.0367 | 0.9640 | 0.0367 | ||
socStatus | 0.3473 | |||||
low | 0.501 | 0 | 1 | |||
high | 0.499 | 0.0348 | 1.0354 | 0.0370 | ||
temp | -15.430 | -0.0060 | 0.9940 | 0.0035 | 0.0857 | |
Max Log | Likelihood | -13590.6 | ||||
Restricted | mean | 51.922 |
Essentially the same conclusion as in the earlier attempts: Mortality decreases with increasing temperature.
7.3.2 Hot summers in Umeå, 1990–2014
As the next example of combining the use of the function eha::make.communal
and
tabulation, the effect of hot summers on mortality
is examined. Data at hand relate to Umeå during the time period 1990–2014.
Temperature data for different places in Sweden can be downloaded from SMHI (The Swedish Meteorological and Hydrological Institute), data for Umeå is read and presented in Table ??.
There are 24 measurements taken every day, with few exceptions, in total 184398 values.
Population data for Umeå are extracted from the Longitudinal integrated database for health insurance and labour market studies (LISA). LISA contains information about all Swedes aged 16 and above on 31 December yearly, starting at 1990. See Table ??.
Then perform the aggregation, see Table ??.
The next step is to aggregate the population data, but we must first split it by
month and year, using the function make.communal
. We want to keep track of
year and month, so the simplest way to do it is to use the variable yearmonth
in temp
as our communal covariate. And then read temperature from temp:
<- eha::make.communal(pop, aggrtemp["yearmonth"],
compop start = 1990, period = 1 / 12)
<- match(compop$yearmonth, aggrtemp$yearmonth)
indx $temp <- aggrtemp$temp[indx]
compop$year <- compop$yearmonth %/% 100
compop$month <- compop$yearmonth %% 100 compop
<- toTpch(Surv(enter, exit, event) ~ yearmonth + sex,
out data = compop,
cuts = c(seq(15, 95, by = 5), 110))
<- match(out$yearmonth, aggrtemp$yearmonth)
indx $temp <- aggrtemp$temp[indx] out
Then the analysis of female mortality in the ages 80 and above and its dependence on high temperature (above 29 degrees Celsius). The July maximum temperatures by year are shown in Figure 7.8.
FIGURE 7.8: Maximum temperatures in July by year 1990–2014.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
I(temp > 29) | 0.0141 | |||||
FALSE | 0.876 | 0 | 1 | |||
TRUE | 0.124 | 0.3264 | 1.3859 | 0.1282 | ||
Max Log | Likelihood | -1413.3 | ||||
Restricted | mean | 9.472 |
With individual-based data and Cox regression, see Table 7.7.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
I(temp > 29) | 0.0165 | |||||
FALSE | 0.876 | 0 | 1 | |||
TRUE | 0.124 | 0.3190 | 1.3757 | 0.1284 | ||
Max Log | Likelihood | -2862.9 |
The results in the two tables are almost identical, women above 80 years of age are vulnerable to extreme temperatures: The risk increases with around 30 per cent (95 per cent confidence interval: from 7 to 77 per cent, quite wide, but a p-value of 1 per cent).