9 min read

Temperature

Reading temperature data

(Updated 2021-03-16)

We got temperature data from SMHI for central Umeå, daily measurements starting at 1 November, 1858 and ending at 30 September, 1979 (this particular weather station was replaced at this last date).

Since we attempt to match these data to population data for the time period 1 January 1901 – 31 December 1950, we cut the temperature data to the same time period.

library(eha)
temp <- read.table("~/Forskning/Data/ume_temp.csv", 
                   header = TRUE,
                   skip = 10, sep = ",")[, 1:3]
## We need only the first three columns
names(temp) <- c("date", "time", "temp")

temp$date <- as.Date(temp$date)
temp$year <- as.numeric(format(temp$date, "%Y"))
temp$month <- as.numeric(format(temp$date, "%m"))
temp <- temp[order(temp$date, temp$time), ]
temp$quarter <- quarters(temp$date)
temp <- temp[temp$year %in% 1901:1950, ]
##

temp$time <- as.factor(temp$time)
temp$quarter <- as.factor(temp$quarter)
library(kableExtra)
source("~/Documents/EHAR2/R/tbl.R")
tbl(head(temp, 10), fs = 12) # fs stands for 'font size'
Table 1:
date time temp year month quarter
1901-01-01 07:00:00 -17.5 1901 1 Q1
1901-01-01 13:00:00 -17.6 1901 1 Q1
1901-01-01 20:00:00 -22.1 1901 1 Q1
1901-01-02 07:00:00 -13.1 1901 1 Q1
1901-01-02 13:00:00 -12.2 1901 1 Q1
1901-01-02 20:00:00 -18.9 1901 1 Q1
1901-01-03 07:00:00 -18.4 1901 1 Q1
1901-01-03 13:00:00 -17.1 1901 1 Q1
1901-01-03 20:00:00 -18.3 1901 1 Q1
1901-01-04 07:00:00 -16.3 1901 1 Q1

Obviously, there are three measurements each day, and the temperature range is -34.8, 30 degrees Celsius. The variables year, month, and quarter were extracted from the date.

Reading individual data

We extract population data for Umeå, 1 Jan 1901 to 31 Dec 1950, and include all ages between 15 and 100:

library(skum)
ume <- obs[obs$region == "ume", ]
ume$event <- (ume$sluttyp == 2)
ume <- cal.window(ume, c(1901, 1951))
ume <- age.window(ume, c(15, 101))
ume <- ume[, c("id", "sex", "birthdate", "enter", "exit", 
               "event", "civst", "socStatus", "ortnmn")]
ume$urban <- ume$ortnmn == "UMEÅ"
ume$ortnmn <- NULL
noll <- ume$enter >= ume$exit
ume$enter[noll] <- ume$enter[noll] - 0.001 # Fix zero length spells
show <- head(ume, 10)
show$birthdate <- toDate(show$birthdate)
tbl(show, fs = 12)
Table 2:
id sex birthdate enter exit event civst socStatus urban
19 male 1922-09-28 26.489 27.16600 0 unmarried high FALSE
31 female 1888-10-08 18.448 18.45300 0 unmarried NA FALSE
31 female 1888-10-08 18.453 20.12900 0 unmarried NA FALSE
31 female 1888-10-08 20.128 20.12900 0 unmarried low FALSE
53 female 1922-02-16 18.714 19.58400 0 unmarried NA FALSE
64 male 1925-05-22 17.200 25.61146 0 unmarried NA TRUE
84 male 1916-03-29 15.000 20.00000 0 unmarried low TRUE
84 male 1916-03-29 20.000 34.75880 0 unmarried NA TRUE
128 male 1871-12-18 48.866 51.03200 0 married low FALSE
128 male 1871-12-18 50.035 62.85100 0 married high FALSE

There are 186033 records about 73780 individuals in this selection.

First analysis

The first question is what to do with the temperature measurements. They span 50 years, with mostly three measurements per day. In this first analysis we focus on the effect on mortlity of low temperatures, so we start by taking the maximum of daily temperatures, then minimum over month. We thus end up with one measurement per month and year, in total 600 measurements.

Start by finding the maximum value each day:

idx <- with(temp, tapply(date, date))
maxtemp <- with(temp, tapply(temp, date, max))
temp$maxtemp <- maxtemp[idx]
## Reduce to one measurement per day:
mtemp <- temp[!duplicated(temp$date), 
              c("date", "maxtemp", "month", "quarter", "year")]
rownames(mtemp) <- 1:NROW(mtemp)
tbl(head(mtemp, 12), fs = 12)
Table 3:
date maxtemp month quarter year
1901-01-01 -17.5 1 Q1 1901
1901-01-02 -12.2 1 Q1 1901
1901-01-03 -17.1 1 Q1 1901
1901-01-04 -2.9 1 Q1 1901
1901-01-05 3.9 1 Q1 1901
1901-01-06 1.8 1 Q1 1901
1901-01-07 -3.1 1 Q1 1901
1901-01-08 -12.3 1 Q1 1901
1901-01-09 -8.0 1 Q1 1901
1901-01-10 -10.3 1 Q1 1901
1901-01-11 0.2 1 Q1 1901
1901-01-12 0.1 1 Q1 1901

The next step is to take minimum over year and month:

atmp <- aggregate(list(temp = mtemp$maxtemp), 
                  by = mtemp[, c("month", "year")], 
                  FUN = min)
tbl(head(atmp, 12), fs = 12)
Table 4:
month year temp
1 1901 -17.5
2 1901 -15.1
3 1901 -12.2
4 1901 -4.1
5 1901 0.4
6 1901 8.7
7 1901 15.0
8 1901 11.9
9 1901 7.6
10 1901 4.5
11 1901 -8.2
12 1901 -15.8

Now we can try to apply make.communal to our data, to split spells by year and month:

atmp$yearmonth <- atmp$year * 100 + atmp$month 
## A trick to get format 190101
comtime <- system.time(
    nt <- make.communal(ume, atmp["yearmonth"], 
                        start = 1901, period = 1/12))

oj <- nt$enter >= nt$exit - 0.0001 # Too short interval!
nt$enter[oj] <- nt$exit[oj] - 0.001  # Break accidental ties
tbl(head(nt, 12), fs = 12)
Table 5:
enter exit event birthdate id sex civst socStatus urban yearmonth
26.48900 26.50901 0 1922.741 19 male unmarried high FALSE 194903
26.50901 26.59235 0 1922.741 19 male unmarried high FALSE 194904
26.59235 26.67568 0 1922.741 19 male unmarried high FALSE 194905
26.67568 26.75901 0 1922.741 19 male unmarried high FALSE 194906
26.75901 26.84235 0 1922.741 19 male unmarried high FALSE 194907
26.84235 26.92568 0 1922.741 19 male unmarried high FALSE 194908
26.92568 27.00901 0 1922.741 19 male unmarried high FALSE 194909
27.00901 27.09235 0 1922.741 19 male unmarried high FALSE 194910
27.09235 27.16600 0 1922.741 19 male unmarried high FALSE 194911
18.44800 18.45300 0 1888.772 31 female unmarried NA FALSE 190703
18.45300 18.47823 0 1888.772 31 female unmarried NA FALSE 190703
18.47823 18.56156 0 1888.772 31 female unmarried NA FALSE 190704

Total time for making communal: 25 seconds.

The next step is to read temperature from atmp to nt:

indx <- match(nt$yearmonth, atmp$yearmonth)
nt$temp <- atmp$temp[indx]
nt$year <- nt$yearmonth %/% 100 # Restore 'year'.
nt$month <- nt$yearmonth %% 100
tbl(head(nt, 10), fs = 12)
Table 6:
enter exit event birthdate id sex civst socStatus urban yearmonth temp year month
26.48900 26.50901 0 1922.741 19 male unmarried high FALSE 194903 -10.5 1949 3
26.50901 26.59235 0 1922.741 19 male unmarried high FALSE 194904 -1.4 1949 4
26.59235 26.67568 0 1922.741 19 male unmarried high FALSE 194905 7.8 1949 5
26.67568 26.75901 0 1922.741 19 male unmarried high FALSE 194906 10.4 1949 6
26.75901 26.84235 0 1922.741 19 male unmarried high FALSE 194907 12.8 1949 7
26.84235 26.92568 0 1922.741 19 male unmarried high FALSE 194908 13.0 1949 8
26.92568 27.00901 0 1922.741 19 male unmarried high FALSE 194909 10.4 1949 9
27.00901 27.09235 0 1922.741 19 male unmarried high FALSE 194910 -0.8 1949 10
27.09235 27.16600 0 1922.741 19 male unmarried high FALSE 194911 -8.0 1949 11
18.44800 18.45300 0 1888.772 31 female unmarried NA FALSE 190703 -7.8 1907 3

This data frame contains 10785662 records, maybe too much for a comfortable Cox regression. Let’s see what happens:

##library(survival)
ptm <- proc.time()
fit <- coxreg(Surv(enter, exit, event) ~ 
                  sex + temp, data = nt)
res <- summary(fit)
ctime <- proc.time() - ptm
tbl(regtable(res))
Table 7:
Covariate level W_mean Coef HR SE LR_p
sex 0.000
male 0.490 0 1
female 0.510 -0.095 0.910 0.019
temp -2.090 -0.008 0.993 0.001 0.000

This is quite bad (the computing time, almost 2 minutes), but we already have a substantial result: Survival chances increase with temperature. But remember that Umeå is a northern town, not much warm weather here. But it can also be the case that both high and low temperatures are critical, which may motivate the categorization of temperature. Let us look at its distribution over the year.

hist(atmp$temp, xlab = "Temperature (centigrade)", main = "")
Monthly minimum  temperatures, Umeå 1901-1950.

Figure 1: Monthly minimum temperatures, Umeå 1901-1950.

And the winter distribution:

hist(atmp[atmp$month %in% c(1:3, 10:12), ]$temp, 
     xlab = "Temperature (centigrade)", main = "")
Monthly minimum  temperatures, Umeå 1901-1950. October-March.

Figure 2: Monthly minimum temperatures, Umeå 1901-1950. October-March.

And in the summer:

hist(atmp[atmp$month %in% c(4:9), ]$temp, 
     xlab = "Temperature (centigrade)", main = "")
Monthly minimum  temperatures, Umeå 1901-1950. April-September.

Figure 3: Monthly minimum temperatures, Umeå 1901-1950. April-September.

Winter survival analysis results, with cut points at -20, -5 degrees Celsius:

nt$ctemp <- cut(nt$temp, breaks = c(-30, -20, -5, 5))
fit <- coxreg(Surv(enter, exit, event) ~ sex + ctemp, 
              data = nt[nt$month %in% c(1:3, 10:12), ])
tbl(regtable(summary(fit)))
Table 8:
Covariate level W_mean Coef HR SE LR_p
sex 0.006
male 0.490 0 1
female 0.510 -0.072 0.931 0.026
ctemp 0.004
(-30,-20] 0.110 0 1
(-20,-5] 0.658 -0.036 0.965 0.042
(-5,5] 0.233 -0.133 0.876 0.048

Obviously, mortality increases with decreasing temperature, but there is not much difference between the two coldest intervals: Clearly best is a temperature around the freezing point.

Summer results with cut points 0, 10 degrees Celsius:

nt$ctemp <- cut(nt$temp, breaks = c(-20, 0, 10, 17))
fit <- coxreg(Surv(enter, exit, event) ~ sex + ctemp, 
              data = nt[nt$month %in% c(4:9), ])
tbl(regtable(summary(fit)))
Table 9:
Covariate level W_mean Coef HR SE LR_p
sex 0.000
male 0.490 0 1
female 0.510 -0.121 0.886 0.028
ctemp 0.000
(-20,0] 0.147 0 1
(0,10] 0.508 -0.130 0.878 0.039
(10,17] 0.344 -0.226 0.798 0.042

Also here is it clear that the months with the highest min temperature are the healthiest.

Reducing the size of the data set

The analyses so far are very rudimentary, including all ages in one and so on. Here we show how extract small subsets of the full data set by some simple and reasonable assumptions. It builds on categorical covariates and an assumption of piecewise constant hazards. We use the function toTpch in the eha package:

told <- toTpch(Surv(enter, exit, event) ~ sex + urban + yearmonth, 
               data = nt, cuts = seq(15, 100, by = 5))
indx <- match(told$yearmonth, atmp$yearmonth)
told$temp <- atmp$temp[indx]
tbl(head(told, 6), fs = 12)
Table 10:
sex urban yearmonth age event exposure temp
male FALSE 190101 15-20 1 59.69168 -17.5
female FALSE 190101 15-20 1 58.20506 -17.5
male TRUE 190101 15-20 0 15.11353 -17.5
female TRUE 190101 15-20 1 16.23145 -17.5
male FALSE 190102 15-20 1 60.21995 -15.1
female FALSE 190102 15-20 0 58.64377 -15.1

The tabulation reduces the data set from 2.5 million records to 19 thousands. And computing time accordingly (reduced to a quarter of a second!), without jeopardizing the results. Let’s see:

fit <- tpchreg(oe(event, exposure) ~ sex + temp, 
               data = told, time = age)
tbl(regtable(summary(fit)))
Table 11:
Covariate level W_mean Coef HR SE LR_p
sex 0.000
male 0.490 0 1
female 0.510 -0.093 0.911 0.019
temp -2.090 -0.008 0.993 0.001 0.000

Essentially the same conclusion as in the earlier attempts: Mortality decreases with increasing temperature.

Conclusion

This was a very superficial analysis of the effect of temperature on mortality. The main purpose of the presentation was to show how to include communal covariates (here: temperature) in survival analysis, and how to reduce huge data sets by tabulation. Read more about it on EHAR2, Chapters 6 and 7.

Final note: The timings above are of course heavily dependent on both hardware and software. I used my laptop, a DELL M6800 with 32 Gb memory and running Ubuntu 20.10, R 4.0.4 in RStudio 1.4.1103.