Reading temperature data
(Updated 2021-03-16)
We got temperature data from SMHI for central Umeå, daily measurements starting at 1 November, 1858 and ending at 30 September, 1979 (this particular weather station was replaced at this last date).
Since we attempt to match these data to population data for the time period 1 January 1901 – 31 December 1950, we cut the temperature data to the same time period.
library(eha)
temp <- read.table("~/Forskning/Data/ume_temp.csv",
header = TRUE,
skip = 10, sep = ",")[, 1:3]
## We need only the first three columns
names(temp) <- c("date", "time", "temp")
temp$date <- as.Date(temp$date)
temp$year <- as.numeric(format(temp$date, "%Y"))
temp$month <- as.numeric(format(temp$date, "%m"))
temp <- temp[order(temp$date, temp$time), ]
temp$quarter <- quarters(temp$date)
temp <- temp[temp$year %in% 1901:1950, ]
##
temp$time <- as.factor(temp$time)
temp$quarter <- as.factor(temp$quarter)
library(kableExtra)
source("~/Documents/EHAR2/R/tbl.R")
tbl(head(temp, 10), fs = 12) # fs stands for 'font size'
date | time | temp | year | month | quarter |
---|---|---|---|---|---|
1901-01-01 | 07:00:00 | -17.5 | 1901 | 1 | Q1 |
1901-01-01 | 13:00:00 | -17.6 | 1901 | 1 | Q1 |
1901-01-01 | 20:00:00 | -22.1 | 1901 | 1 | Q1 |
1901-01-02 | 07:00:00 | -13.1 | 1901 | 1 | Q1 |
1901-01-02 | 13:00:00 | -12.2 | 1901 | 1 | Q1 |
1901-01-02 | 20:00:00 | -18.9 | 1901 | 1 | Q1 |
1901-01-03 | 07:00:00 | -18.4 | 1901 | 1 | Q1 |
1901-01-03 | 13:00:00 | -17.1 | 1901 | 1 | Q1 |
1901-01-03 | 20:00:00 | -18.3 | 1901 | 1 | Q1 |
1901-01-04 | 07:00:00 | -16.3 | 1901 | 1 | Q1 |
Obviously, there are three measurements each day, and the temperature range is -34.8, 30 degrees Celsius. The variables year, month, and quarter were extracted from the date.
Reading individual data
We extract population data for Umeå, 1 Jan 1901 to 31 Dec 1950, and include all ages between 15 and 100:
library(skum)
ume <- obs[obs$region == "ume", ]
ume$event <- (ume$sluttyp == 2)
ume <- cal.window(ume, c(1901, 1951))
ume <- age.window(ume, c(15, 101))
ume <- ume[, c("id", "sex", "birthdate", "enter", "exit",
"event", "civst", "socStatus", "ortnmn")]
ume$urban <- ume$ortnmn == "UMEÅ"
ume$ortnmn <- NULL
noll <- ume$enter >= ume$exit
ume$enter[noll] <- ume$enter[noll] - 0.001 # Fix zero length spells
show <- head(ume, 10)
show$birthdate <- toDate(show$birthdate)
tbl(show, fs = 12)
id | sex | birthdate | enter | exit | event | civst | socStatus | urban |
---|---|---|---|---|---|---|---|---|
19 | male | 1922-09-28 | 26.489 | 27.16600 | 0 | unmarried | high | FALSE |
31 | female | 1888-10-08 | 18.448 | 18.45300 | 0 | unmarried | NA | FALSE |
31 | female | 1888-10-08 | 18.453 | 20.12900 | 0 | unmarried | NA | FALSE |
31 | female | 1888-10-08 | 20.128 | 20.12900 | 0 | unmarried | low | FALSE |
53 | female | 1922-02-16 | 18.714 | 19.58400 | 0 | unmarried | NA | FALSE |
64 | male | 1925-05-22 | 17.200 | 25.61146 | 0 | unmarried | NA | TRUE |
84 | male | 1916-03-29 | 15.000 | 20.00000 | 0 | unmarried | low | TRUE |
84 | male | 1916-03-29 | 20.000 | 34.75880 | 0 | unmarried | NA | TRUE |
128 | male | 1871-12-18 | 48.866 | 51.03200 | 0 | married | low | FALSE |
128 | male | 1871-12-18 | 50.035 | 62.85100 | 0 | married | high | FALSE |
There are 186033 records about 73780 individuals in this selection.
First analysis
The first question is what to do with the temperature measurements. They span 50 years, with mostly three measurements per day. In this first analysis we focus on the effect on mortlity of low temperatures, so we start by taking the maximum of daily temperatures, then minimum over month. We thus end up with one measurement per month and year, in total 600 measurements.
Start by finding the maximum value each day:
idx <- with(temp, tapply(date, date))
maxtemp <- with(temp, tapply(temp, date, max))
temp$maxtemp <- maxtemp[idx]
## Reduce to one measurement per day:
mtemp <- temp[!duplicated(temp$date),
c("date", "maxtemp", "month", "quarter", "year")]
rownames(mtemp) <- 1:NROW(mtemp)
tbl(head(mtemp, 12), fs = 12)
date | maxtemp | month | quarter | year |
---|---|---|---|---|
1901-01-01 | -17.5 | 1 | Q1 | 1901 |
1901-01-02 | -12.2 | 1 | Q1 | 1901 |
1901-01-03 | -17.1 | 1 | Q1 | 1901 |
1901-01-04 | -2.9 | 1 | Q1 | 1901 |
1901-01-05 | 3.9 | 1 | Q1 | 1901 |
1901-01-06 | 1.8 | 1 | Q1 | 1901 |
1901-01-07 | -3.1 | 1 | Q1 | 1901 |
1901-01-08 | -12.3 | 1 | Q1 | 1901 |
1901-01-09 | -8.0 | 1 | Q1 | 1901 |
1901-01-10 | -10.3 | 1 | Q1 | 1901 |
1901-01-11 | 0.2 | 1 | Q1 | 1901 |
1901-01-12 | 0.1 | 1 | Q1 | 1901 |
The next step is to take minimum over year and month:
atmp <- aggregate(list(temp = mtemp$maxtemp),
by = mtemp[, c("month", "year")],
FUN = min)
tbl(head(atmp, 12), fs = 12)
month | year | temp |
---|---|---|
1 | 1901 | -17.5 |
2 | 1901 | -15.1 |
3 | 1901 | -12.2 |
4 | 1901 | -4.1 |
5 | 1901 | 0.4 |
6 | 1901 | 8.7 |
7 | 1901 | 15.0 |
8 | 1901 | 11.9 |
9 | 1901 | 7.6 |
10 | 1901 | 4.5 |
11 | 1901 | -8.2 |
12 | 1901 | -15.8 |
Now we can try to apply make.communal
to our data, to split spells by year and month:
atmp$yearmonth <- atmp$year * 100 + atmp$month
## A trick to get format 190101
comtime <- system.time(
nt <- make.communal(ume, atmp["yearmonth"],
start = 1901, period = 1/12))
oj <- nt$enter >= nt$exit - 0.0001 # Too short interval!
nt$enter[oj] <- nt$exit[oj] - 0.001 # Break accidental ties
tbl(head(nt, 12), fs = 12)
enter | exit | event | birthdate | id | sex | civst | socStatus | urban | yearmonth |
---|---|---|---|---|---|---|---|---|---|
26.48900 | 26.50901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194903 |
26.50901 | 26.59235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194904 |
26.59235 | 26.67568 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194905 |
26.67568 | 26.75901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194906 |
26.75901 | 26.84235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194907 |
26.84235 | 26.92568 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194908 |
26.92568 | 27.00901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194909 |
27.00901 | 27.09235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194910 |
27.09235 | 27.16600 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194911 |
18.44800 | 18.45300 | 0 | 1888.772 | 31 | female | unmarried | NA | FALSE | 190703 |
18.45300 | 18.47823 | 0 | 1888.772 | 31 | female | unmarried | NA | FALSE | 190703 |
18.47823 | 18.56156 | 0 | 1888.772 | 31 | female | unmarried | NA | FALSE | 190704 |
Total time for making communal: 25 seconds.
The next step is to read temperature from atmp
to nt
:
indx <- match(nt$yearmonth, atmp$yearmonth)
nt$temp <- atmp$temp[indx]
nt$year <- nt$yearmonth %/% 100 # Restore 'year'.
nt$month <- nt$yearmonth %% 100
tbl(head(nt, 10), fs = 12)
enter | exit | event | birthdate | id | sex | civst | socStatus | urban | yearmonth | temp | year | month |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26.48900 | 26.50901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194903 | -10.5 | 1949 | 3 |
26.50901 | 26.59235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194904 | -1.4 | 1949 | 4 |
26.59235 | 26.67568 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194905 | 7.8 | 1949 | 5 |
26.67568 | 26.75901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194906 | 10.4 | 1949 | 6 |
26.75901 | 26.84235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194907 | 12.8 | 1949 | 7 |
26.84235 | 26.92568 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194908 | 13.0 | 1949 | 8 |
26.92568 | 27.00901 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194909 | 10.4 | 1949 | 9 |
27.00901 | 27.09235 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194910 | -0.8 | 1949 | 10 |
27.09235 | 27.16600 | 0 | 1922.741 | 19 | male | unmarried | high | FALSE | 194911 | -8.0 | 1949 | 11 |
18.44800 | 18.45300 | 0 | 1888.772 | 31 | female | unmarried | NA | FALSE | 190703 | -7.8 | 1907 | 3 |
This data frame contains 10785662 records, maybe too much for a comfortable Cox regression. Let’s see what happens:
##library(survival)
ptm <- proc.time()
fit <- coxreg(Surv(enter, exit, event) ~
sex + temp, data = nt)
res <- summary(fit)
ctime <- proc.time() - ptm
tbl(regtable(res))
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.000 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.095 | 0.910 | 0.019 | ||
temp | -2.090 | -0.008 | 0.993 | 0.001 | 0.000 |
This is quite bad (the computing time, almost 2 minutes), but we already have a substantial result: Survival chances increase with temperature. But remember that Umeå is a northern town, not much warm weather here. But it can also be the case that both high and low temperatures are critical, which may motivate the categorization of temperature. Let us look at its distribution over the year.
hist(atmp$temp, xlab = "Temperature (centigrade)", main = "")

Figure 1: Monthly minimum temperatures, Umeå 1901-1950.
And the winter distribution:
hist(atmp[atmp$month %in% c(1:3, 10:12), ]$temp,
xlab = "Temperature (centigrade)", main = "")

Figure 2: Monthly minimum temperatures, Umeå 1901-1950. October-March.
And in the summer:
hist(atmp[atmp$month %in% c(4:9), ]$temp,
xlab = "Temperature (centigrade)", main = "")

Figure 3: Monthly minimum temperatures, Umeå 1901-1950. April-September.
Winter survival analysis results, with cut points at -20, -5 degrees Celsius:
nt$ctemp <- cut(nt$temp, breaks = c(-30, -20, -5, 5))
fit <- coxreg(Surv(enter, exit, event) ~ sex + ctemp,
data = nt[nt$month %in% c(1:3, 10:12), ])
tbl(regtable(summary(fit)))
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.006 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.072 | 0.931 | 0.026 | ||
ctemp | 0.004 | |||||
(-30,-20] | 0.110 | 0 | 1 | |||
(-20,-5] | 0.658 | -0.036 | 0.965 | 0.042 | ||
(-5,5] | 0.233 | -0.133 | 0.876 | 0.048 |
Obviously, mortality increases with decreasing temperature, but there is not much difference between the two coldest intervals: Clearly best is a temperature around the freezing point.
Summer results with cut points 0, 10 degrees Celsius:
nt$ctemp <- cut(nt$temp, breaks = c(-20, 0, 10, 17))
fit <- coxreg(Surv(enter, exit, event) ~ sex + ctemp,
data = nt[nt$month %in% c(4:9), ])
tbl(regtable(summary(fit)))
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.000 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.121 | 0.886 | 0.028 | ||
ctemp | 0.000 | |||||
(-20,0] | 0.147 | 0 | 1 | |||
(0,10] | 0.508 | -0.130 | 0.878 | 0.039 | ||
(10,17] | 0.344 | -0.226 | 0.798 | 0.042 |
Also here is it clear that the months with the highest min temperature are the healthiest.
Reducing the size of the data set
The analyses so far are very rudimentary, including all ages in one and so on.
Here we show how extract small subsets of the full data set by some simple and
reasonable assumptions. It builds on categorical covariates and an assumption
of piecewise constant hazards. We use the function toTpch
in the eha
package:
told <- toTpch(Surv(enter, exit, event) ~ sex + urban + yearmonth,
data = nt, cuts = seq(15, 100, by = 5))
indx <- match(told$yearmonth, atmp$yearmonth)
told$temp <- atmp$temp[indx]
tbl(head(told, 6), fs = 12)
sex | urban | yearmonth | age | event | exposure | temp |
---|---|---|---|---|---|---|
male | FALSE | 190101 | 15-20 | 1 | 59.69168 | -17.5 |
female | FALSE | 190101 | 15-20 | 1 | 58.20506 | -17.5 |
male | TRUE | 190101 | 15-20 | 0 | 15.11353 | -17.5 |
female | TRUE | 190101 | 15-20 | 1 | 16.23145 | -17.5 |
male | FALSE | 190102 | 15-20 | 1 | 60.21995 | -15.1 |
female | FALSE | 190102 | 15-20 | 0 | 58.64377 | -15.1 |
The tabulation reduces the data set from 2.5 million records to 19 thousands. And computing time accordingly (reduced to a quarter of a second!), without jeopardizing the results. Let’s see:
fit <- tpchreg(oe(event, exposure) ~ sex + temp,
data = told, time = age)
tbl(regtable(summary(fit)))
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
sex | 0.000 | |||||
male | 0.490 | 0 | 1 | |||
female | 0.510 | -0.093 | 0.911 | 0.019 | ||
temp | -2.090 | -0.008 | 0.993 | 0.001 | 0.000 |
Essentially the same conclusion as in the earlier attempts: Mortality decreases with increasing temperature.
Conclusion
This was a very superficial analysis of the effect of temperature on mortality. The main purpose of the presentation was to show how to include communal covariates (here: temperature) in survival analysis, and how to reduce huge data sets by tabulation. Read more about it on EHAR2, Chapters 6 and 7.
Final note: The timings above are of course heavily dependent on both hardware and software. I used my laptop, a DELL M6800 with 32 Gb memory and running Ubuntu 20.10, R 4.0.4 in RStudio 1.4.1103.