Chapter 7 Register-Based Survival Data Models

During the last couple of decades, huge data bases with poulation data have popped up around the world. They are great sources of information in demographic and statistical research, but their sizes creates challenges for the statistical tools we traditionally use in the demographic analyses.

We distinguish between data sources with individual content and sources with tabular data. In both cases, however, we will end up analyzing tables, so we start by describing methods for analyzing tables.

Finally, in this chapter, we show how to combine the technique with communal covariates with tabulating data. The problem we solve is that the creation of an individual-based data set with communal covariates usually leads to uncomfortably large data sets. Tabulation is often a good solution to this problem.

7.1 Tabular Data

National statistical agencies produce statistics describing the population year by year, in terms of population size by age and sex, number of births by mother’s age and sex of the newborn, number of deaths by age and sex, and so on. In the package eha, there are two data sets taken from Statistics Sweden, swepop and swedeaths.

The content of swepop is shown in Table ?? (first five rows out of 10504 rows in total).

It contains the average Swedish population size by year, age, and sex, where age is grouped into one-year classes, but the highest age class, labelled 100, is in fact “ages 100 and above.” The original table gives the population sizes at the end of each year, but here we have calculated an average size over the year in question by taking the mean of the actual value and the corresponding value from the previous year. The population size can in this way be used as a proxy for, prediction of, “total time at risk” or exposure.

The content of swedeaths (first five rows) is shown in Table ??.

The (age, sex, year) columns in the two data frames are identical, so we can create a suitable data frame for analysis simply by putting swedeaths$deaths into the data frame swepop and rename it to sw. The first rows of sw are shown Table 7.1.

TABLE 7.1: First rows of combined data frame.
age sex year pop deaths
0 women 1969 52673.0 491
0 men 1969 55728.0 773
1 women 1969 56831.0 33
1 men 1969 59924.0 45
2 women 1969 58994.5 22

The response variable in our analysis is two-dimensional: (deaths, pop), which is on an “occurrence/exposure” form, suitable for an analysis with the function tpchreg. The result is a proportional hazards analysis with the assumption of a piecewise constant baseline hazard function:

sw$year.1995 <- sw$year - 1995
system.time(fit <- tpchreg(oe(deaths, pop) ~ sex + year.1995, 
               time = age, last = 101, data = sw))
   user  system elapsed 
  0.686   0.000   0.686 

Note several things here:

  1. The fitting procedure is wrapped in a call to the function system.time. This results in the first report headed “user system elapsed,” giving the time in seconds for the whole operation (“elapsed”), split up by “user” (the time spent with our call) and “system” (time spent by the operating system, for instance reading, writing and organizing the user operation). The interesting value for us is that the total elapsed time was around half a second. Quite OK for a proportional hazards regression analysis involving 4745063 events.

  2. The response is oe(deaths, pop) where the function name oe is short for “occurrence/exposure.”

  3. All covariates have a reference value. For factors with the default coding, the reference category is given, but for continuous covariates we need to specify it, if we are not satisfied with the default value zero. In the case with the covariate year, we are certainly not satisfied with zero as reference point, so we choose a suitable value in the range of the observed values. This is done in practice by subtracting the reference value from the corresponding variable, forming year - 1995 in our case. In the output below, this will only affect the value of Restricted mean survival, which is calculated for a “reference individual,” in this case a woman living her life during the year 1995. There are not many such women, of course, what we are doing here is comparing periods, not cohorts.

  4. Note the argument last = 101. Since we are fitting a survival distribution that is unbounded in time, we should specify the range for our data. The lower limit is given by min(age) (the argument time), but there is no natural upper limit. The choice of the value for last do not affect parameter estimates and general conclusions, it only affects plots in the extreme of the right tail (slightly).

TABLE 7.2: Swedish mortality, 1968–2020.
Covariate level W_mean Coef HR SE LR_p
sex 0.0000
women 0.503 0 1
men 0.497 0.4400 1.5527 0.0009
year.1995 0.520 -0.0150 0.9851 0.0000 0.0000
Max Log Likelihood -19283066.9
Restricted mean 80.855

Note the “Restr. mean survival”: It gives the expected remaining life at birth for a woman in the year 1995 (the reference values).

The graph of the baseline hazard function (“age-specific mortality”) is shown in Figure 7.1.

Baseline hazard function, women 1995. Model based.

FIGURE 7.1: Baseline hazard function, women 1995. Model based.

Apart from the first year of life, mortality seems to be practically zero the first forty years of life, and then exponentially increasing with age.

This is simple enough, but some questions remain unanswered. First, the model assumes proportional hazards, implying for instance that male mortality is around 50 per cent higher than female in all ages. It is fairly simple to organize a formal test of the proportionality assumption, but it is close to meaningless, because of the huge amount of data (almost five million deaths, “all” null hypotheses will be rejected). A better alternative is to do it graphically by stratifying on sex.

Baseline hazard function, women and men 1995.

FIGURE 7.2: Baseline hazard function, women and men 1995.

Due to the extremely low mortality in early ages, Figure 7.2 is quite non-informative. We try the same plot on a log scale, see Figure 7.3.

Baseline hazard functions, women and men 1995. Log scale.

FIGURE 7.3: Baseline hazard functions, women and men 1995. Log scale.

The female advantage is largest in early adulthood, after which it slowly decreases. An even clearier picture is shown in Figure 7.4, where the ratio of male vs. female mortality is plotted.

Hazard ratio by age, men vs. women 1995.

FIGURE 7.4: Hazard ratio by age, men vs. women 1995.

This graph gives the clearest picture of the relative differences in mortality over the life span. We see, somewhat unexpectedly, that the drop in the ratio is broken around age 50, and then continues after age 70. This is not clearly visible in the earlier figures. A possible explanation is that the age interval 20–50 approximately coincides with women’s fertility period, that is, during this period, the female mortality advantage decreases more than pure age motivates.

This pattern may change over the time period 1969–2020. We split the data by year into quarters in order to check this. The rise after age 50 seems to vanish with time, according to Figure 7.5.

Baseline hazard ratios, men vs. women by time period.

FIGURE 7.5: Baseline hazard ratios, men vs. women by time period.

Finally, we investigate the development of expected life (\(e_0\)) over time, see Figure 7.6.

Expected life at birth by sex and year (period data).

FIGURE 7.6: Expected life at birth by sex and year (period data).

The difference in longevity between women and men is slowly decreasing over the time period. Also note the values in the last two years: Mortality in Sweden was unexpectedly low in the year 2019, resulting in high values of expected life, but in 2020, the pandemic virus covid-19 struck, and the expected life dropped by around five months for women and almost nine months for men, compared to 2019.

7.2 Individual Data

The analysis of individual survival data is what the rest of this book is all about, so what is new here? The answer lies in the size of the data sets, that is, the number of individuals is huge.

As an example, we have mortality data consisting of all Swedish citizens aged 30 and above for the years 2001–2013. It may be described as yearly censuses, where all Swedish persons alive at December 31 the specific year are registered together with vital information. It starts with register data of the form shown in Table ??.

The variables are

  • id A unique id number, which allows us to link information from different sources (and years) to one individual.
  • year The present year, 2001, \(\ldots\), 2013.
  • sex Female or male.
  • birthdate A constructed variable. Due to integrity reasons, birth time information is only given in the form of year and quarter of the year. So the birthdate is created by year and the first quarter is represented by the decimal places “.125” (equal roughly to February 15), the second quarter is “.375,” third “.625,” and finally the fourth quarter is represented by “.875”). The birthdate will be off by plus or minus one month and a half, which is deemed to be a good enough precision concerning our analysis of adult mortality.
  • deathdate Is given by day in the sources. The missing value symbol (NA) is given to those who are not registered as dead before January 1, 2014. Each date is converted to decimal form.
  • civst, ses, region Information about civil status, socioeconomic status, and geographic area, respectively.

From this yearly information, survival data is created by following each individual one year ahead, until the next “census.” Age at start, enter, is calculated by subtracting birthdate from the date of presence, year + 1. For an individual with a deathdate less than or equal to enter + 1 = year + 2, the variable exit is calculated as deathdate - birthdate and the variable event is set equal to one, otherwise exit = enter + 1 and event = 0, a censored observation.

The result of this is shown for the same individuals as above in Table ??, with three covariates omitted in order to save space.

This file from the year 2001 (covering the year 2002!) contains 5.8 million individuals and 92 thousand deaths. To perform a Cox regression on such massive data sets (and there are thirteen of them) is time, software, and hardware consuming, but there is a simple remedy: Make tables and fit piecewise constant proportional hazards (pch) models.

This is done with the aid of the eha function toTpch. We decide to choose the pch model as constant over five-year intervals 30–35, 35–40, \(\ldots\), 100–105, and we replace birthdate by birth year (more suitable as a covariate) and call it cohort. It is done with the aid of the R function floor:

lisa2001$cohort <- floor(lisa2001$birthdate)

The function floor simply removes the decimals from a positive number. Then we apply the function toTpch:

listab2001 <- toTpch(Surv(enter, exit, event) ~ sex + cohort + year + 
                                                civst + region + ses,
                     data = lisa2001, 
                     cuts = c(seq(30, 100, by = 5), 105))

The result is shown in Table ?? (four first rows) which is a table with 35998 rows. That is, one row for each unique combination of the six covariates for each interval, empty combinations excluded. Note the three created variables, event which is the total number of events for each combination of covariates, exposure, which is the total time at risk for each corresponding combination, and age, which is the age intervals with constant hazard.

In the survival analysis, the pair (event, exposure) is the response, and behind the scene, Poisson regression is performed with event as response and log(exposure) as offset.

Finally, thirteen similar tables are created (one for each year 2001–2013) and merged, using the R command rbind.

A summary of the result gives

  • Number of records: 489783.
  • Number of deaths: 1.16 millions.
  • Total time at risk: 76.75 million years.

Now this data set is conveniently analyzed with the aid of the function tpchreg in eha. An example:

listab$year <- listab$year - 2000 # Center!
list2 <- aggregate(cbind(event, exposure) ~ sex + civst + ses + 
                       year + age, FUN = sum, data = listab)
fit <- tpchreg(oe(event, exposure) ~ strata(sex) + civst + ses + 
                   year, data = list2, time = age, last = 105)
xx <- summary(fit)
emen <- round(xx$rmean[1], 2)
ewomen <- round(xx$rmean[2], 2)
TABLE 7.3: Mortality in Sweden 2002–2014, ages 30 and above.
Covariate level W_mean Coef HR SE LR_p
civst 0.0000
married 0.512 0 1
unmarried 0.264 0.4816 1.6187 0.0032
prev.married 0.224 0.3265 1.3861 0.0023
ses 0.0000
worker 0.340 0 1
elite 0.115 -0.3272 0.7209 0.0056
middle 0.318 -0.1682 0.8452 0.0035
lower 0.228 0.3629 1.4374 0.0030
year 7.088 -0.0023 0.9977 0.0003 0.0000
Max Log Likelihood -4538559.5
Restricted mean 52.568 57.023

The result is shown in Table 7.3.

We note a couple of things here:

  1. The data frame listab contains 489783 records, and the execution time of a Poisson regression might take a long time. Since we did not use the covariates region (a factor with 25 levels) and cohort (92 levels), we could aggregate over these variables which resulted in the data frame list2 with only 4493 records. This cuts down computing time by approximately 99 per cent, or to 0.17 seconds from 17 seconds. The results are not affected by this. Also note the similarity between the formula arguments of the two functions aggregate and tpchreg. The differences are that in aggregate, the function cbind is used instead of oe, and strata() is left out.

  2. All p-values are effectively zero, a consequence of of the huge amount of data. It is also reasonable to claim that this constitutes a total investigation, and therefore p-values are meaningless.

  3. The restricted mean survival at age 30 for a married worker in the year 2002 is 52.57 years for men and 57.02 years for women, that is, the expected age of death is 82.57 years for men and 87.02 years for women alive at 30.

A graph of the conditional survival functions is shown in Figure 7.7.

Survival above 30, males and females, married worker, Sweden 2002.

FIGURE 7.7: Survival above 30, males and females, married worker, Sweden 2002.

7.3 Communal Covariates and Tabulation

The application of the function make.communal (see previous chapter) will often result in impractically large data sets, and it is recommended to combine it with the tabulation technique suggested in this chapter. We illustrate it with an investigation of the influence of weather conditions on mortality for two time periods, 1901–1950 and 1990–2014. In the first case focus is on the effect of low temperature on mortality, and the second case has the opposite focus, is an extremely hot summer bad for survival. The geographical area under study is the Swedish town Umeå, situated at the Gulf of Bothnia on the latitude 64 degrees north.

7.3.1 Temperature and mortality, Umeå 1901–1950

Daily temperature data from SMHI for central Umeå, starting at November 1, 1858 and ending at September 30, 1979 (this particular weather station was replaced at this last date) are used together with population data from Umeå. Since we attempt to match these data to population data for the time period January 1, 1901–December 31, 1950, we cut the temperature data to the same time period. See Table ??.

library(eha)
temp <- read.table("~/Forskning/Data/ume_temp.csv", 
                   header = TRUE,
                   skip = 10, sep = ",")[, 1:3]
## We need only the first three columns
names(temp) <- c("date", "time", "temp")

temp$date <- as.Date(temp$date)
temp$year <- as.numeric(format(temp$date, "%Y"))
temp$month <- as.numeric(format(temp$date, "%m"))
temp <- temp[order(temp$date, temp$time), ]
temp$quarter <- quarters(temp$date)
temp <- temp[temp$year %in% 1901:1950, ]
##

temp$time <- as.factor(temp$time)
temp$quarter <- as.factor(temp$quarter)

Apparently, there are three measurements each day, and the temperature range is (−34.8, 30) degrees Celsius. The variables year, month, and quarter were extracted from the date.

We extract population data for Umeå, 1 Jan 1901 to 31 Dec 1950, and include all ages between 15 and 100, see Table ?? for the first few rows.

There are 186033 records describing 73780 individuals in this selection. In order to simplify the presentation, only two covariates, sex and urban, are included.

The first question is what to do with the temperature measurements. They span 50 years, with mostly three measurements per day. In this first analysis we focus on the effect on mortality of low temperatures, so we start by taking the maximum of daily temperatures, then minimum over month. We thus end up with one measurement per month and year, in total 600 measurements.

Start by finding the maximum value each day, result in Table ??.

idx <- with(temp, tapply(date, date))
maxtemp <- with(temp, tapply(temp, date, max))
temp$maxtemp <- maxtemp[idx]
## Reduce to one measurement per day:
mtemp <- temp[!duplicated(temp$date), 
              c("date", "maxtemp", "month", "quarter", "year")]
rownames(mtemp) <- 1:NROW(mtemp)
lmtemp <- head(mtemp, 3)
lmtemp$maxtemp <- um(lmtemp$maxtemp)
tbl(lmtemp, fs = 11, 
    caption = "Maximum daily temperature, Umeå 1901--1950.")

The next step is to take minimum over year and month, see Table ??.

atmp <- aggregate(list(temp = mtemp$maxtemp), 
                  by = mtemp[, c("month", "year")], 
                  FUN = min)
hatmp <- head(atmp, 3)
hatmp$temp <- um(hatmp$temp)
tbl(hatmp, fs = 11, 
    caption = "Average daily max temperature by month, Umeå 1901--1950.")

Now we can apply make.communal to our data and split spells by year and month. The result is shown in Table ??.

atmp$yearmonth <- atmp$year * 100 + atmp$month 
## A trick to get format 190101
comtime <- system.time(
    nt <- make.communal(ume, atmp["yearmonth"], 
                        start = 1901, period = 1/12))

nt$enter <- round(nt$enter, 3)
nt$exit <- round(nt$exit, 3)
oj <- nt$enter >= nt$exit - 0.0001 # Too short interval!
nt$enter[oj] <- nt$exit[oj] - 0.001  # Break accidental ties
nt$year <- nt$yearmonth %/% 100 # Restore 'year'.
nt$month <- nt$yearmonth %% 100 # Restore 'month'.
saveRDS(nt, file = "mytables/nt.rds")
nt <- nt[nt$month %in% c(12, 1, 2), ]
tbl(head(nt[, -8], 3), fs = 10, 
    caption = "Population data by year and month, Umeå 1901--1950.")

Total time for making communal: 25 seconds.

The next step is to read temperature from atmp to nt, see Table ?? (selected variables).

indx <- match(nt$yearmonth, atmp$yearmonth)
nt$temp <- atmp$temp[indx]

This data frame contains 2708779 records, maybe too much for a comfortable Cox regression. Let’s see what happens (Table 7.4):

ptm <- proc.time()
fit <- coxreg(Surv(enter, exit, event) ~ 
                  sex + socStatus + temp, data = nt)
ctime <- proc.time() - ptm
fit.out(fit, label = "coxph7", 
        caption = "Mortality and temperature, Umeå 1901--1950.")
TABLE 7.4: Mortality and temperature, Umeå 1901–1950.
Covariate level W_mean Coef HR SE LR_p
sex 0.3010
male 0.490 0 1
female 0.510 -0.0380 0.9627 0.0367
socStatus 0.3311
low 0.501 0 1
high 0.499 0.0360 1.0367 0.0371
temp -15.430 -0.0061 0.9939 0.0035 0.0805
Max Log Likelihood -21892.7

This is quite bad (the computing time, almost 2 minutes), but we already have a substantial result: Survival chances increase with temperature. But remember that Umeå is a northern town, not much warm weather here. And further, no signs of interactions with sex or social status (not shown).

The analyses so far are very rudimentary, including all ages in one and so on. Here we show how extract small subsets of the full data set by some simple and reasonable assumptions. It builds on categorical covariates and an assumption of piecewise constant hazards. We use the function toTpch in the eha package, see Table ??.

told <- toTpch(Surv(enter, exit, event) ~ sex + socStatus + 
                   yearmonth, 
               data = nt, cuts = seq(15, 100, by = 5))
indx <- match(told$yearmonth, atmp$yearmonth)
told$temp <- atmp$temp[indx]
saveRDS(told, file = "mytables/told73.rds")
##told <- readRDS("mytables/told73.rds")
htold <- head(told, 6)
htold$temp <- um(htold$temp)
htold$age <- en(htold$age)
tbl(htold, fs = 11, 
    caption = "Tabular Umeå 1901--1950 data.")

The tabulation reduces the data set from 2.5 million records to 19 thousands. And computing time accordingly (reduced to a quarter of a second!), without jeopardizing the results, see Table 7.5.

TABLE 7.5: Temperature and mortality, Umeå 1901–1950. PCH model.
Covariate level W_mean Coef HR SE LR_p
sex 0.3176
male 0.490 0 1
female 0.510 -0.0367 0.9640 0.0367
socStatus 0.3473
low 0.501 0 1
high 0.499 0.0348 1.0354 0.0370
temp -15.430 -0.0060 0.9940 0.0035 0.0857
Max Log Likelihood -13590.6
Restricted mean 51.922

Essentially the same conclusion as in the earlier attempts: Mortality decreases with increasing temperature.

7.3.2 Hot summers in Umeå, 1990–2014

As the next example of combining the use of the function eha::make.communal and tabulation, the effect of hot summers on mortality is examined. Data at hand relate to Umeå during the time period 1990–2014.

Temperature data for different places in Sweden can be downloaded from SMHI (The Swedish Meteorological and Hydrological Institute), data for Umeå is read and presented in Table ??.

There are 24 measurements taken every day, with few exceptions, in total 184398 values.

Population data for Umeå are extracted from the Longitudinal integrated database for health insurance and labour market studies (LISA). LISA contains information about all Swedes aged 16 and above on 31 December yearly, starting at 1990. See Table ??.

Then perform the aggregation, see Table ??.

The next step is to aggregate the population data, but we must first split it by month and year, using the function make.communal. We want to keep track of year and month, so the simplest way to do it is to use the variable yearmonth in temp as our communal covariate. And then read temperature from temp:

compop <- eha::make.communal(pop, aggrtemp["yearmonth"], 
                             start = 1990, period = 1 / 12)
indx <- match(compop$yearmonth, aggrtemp$yearmonth)
compop$temp <- aggrtemp$temp[indx]
compop$year <- compop$yearmonth %/% 100
compop$month <- compop$yearmonth %% 100
out <- toTpch(Surv(enter, exit, event) ~ yearmonth + sex, 
              data = compop, 
                   cuts = c(seq(15, 95, by = 5), 110))
indx <- match(out$yearmonth, aggrtemp$yearmonth)
out$temp <- aggrtemp$temp[indx]

Then the analysis of female mortality in the ages 80 and above and its dependence on high temperature (above 29 degrees Celsius). The July maximum temperatures by year are shown in Figure 7.8.

Maximum temperatures in July by year 1990--2014.

FIGURE 7.8: Maximum temperatures in July by year 1990–2014.

TABLE 7.6: Tabular PH analysis, female (age 80+) mortality and high temperature, Umeå 1990–2014.
Covariate level W_mean Coef HR SE LR_p
I(temp > 29) 0.0141
FALSE 0.876 0 1
TRUE 0.124 0.3264 1.3859 0.1282
Max Log Likelihood -1413.3
Restricted mean 9.472

With individual-based data and Cox regression, see Table 7.7.

TABLE 7.7: Cox regression, female (age 80+ ) mortality and high temperature, Umeå 1990–2014.
Covariate level W_mean Coef HR SE LR_p
I(temp > 29) 0.0165
FALSE 0.876 0 1
TRUE 0.124 0.3190 1.3757 0.1284
Max Log Likelihood -2862.9

The results in the two tables are almost identical, women above 80 years of age are vulnerable to extreme temperatures: The risk increases with around 30 per cent (95 per cent confidence interval: from 7 to 77 per cent, quite wide, but a p-value of 1 per cent).