6.6 Discrete time models
There are two ways of looking at discrete duration data; either time is truly discrete, for instance the number of trials until an event occurs, or an approximation due to rounding of continuous time data. In a sense all data are discrete, because it is impossible to measure anything on a continuous scale with infinite precision, but from a practical point of view it is reasonable to say that data is discrete when tied events occur embarrassingly often.
When working with register data, time is often measured in years which
makes it necessary and convenient to work with discrete models. A typical
data format is the so-called wide format, where there is one record
(row) per individual, and measurements for many years. We have so far only
worked with the long format. The data sets created by survSplit
are in long format; there is one record per individual and age category.
The R work horse in switching back and forth between the long and wide
formats is the function reshape. It may look confusing at first, but
if data follow some simple rules, it is quite easy to use reshape.
The function reshape is typically used with longitudinal data,
where there are several measurements at different time points for each
individual. If the data for one individual is registered within one record
(row), we say that data are in wide format, and if there is one record (row)
per time (several records per individual), data are in long format. Using
wide format, the rule is that time-varying variable names must end in a
numeric value indicating at which time the measurement was taken. For
instance, if the variable civ (civil status) is noted at times 1, 2,
and 3, there must be variables named civ.1, civ.2, civ.3,
respectively. It is optional to use any separator between the base
name (civ) and the time, but it should be one character or empty. The
“.” is what reshape expects by default, so using that form
simplifies coding somewhat.
We start by creating an example data set as an illustration. This is
accomplished by starting off with the data set oldmort in eha
and “trimming” it.
data(oldmort)
om <- oldmort[oldmort$enter == 60, ]
om <- age.window(om, c(60, 70))
om$m.id <- om$f.id <- om$imr.birth <- om$birthplace <- NULL
om$birthdate <- om$ses.50 <- NULL
om1 <- survSplit(om, cut = 61:69, start = "enter", end = "exit",
event = "event", episode = "agegrp")
om1$agegrp <- factor(om1$agegrp, labels = 60:69)
om1 <- om1[order(om1$id, om1$enter), ]
head(om1)## id sex civ region enter exit event
## 1 800000625 male widow rural 60 61.00 0
## 2 800000625 male widow rural 61 62.00 0
## 3 800000625 male widow rural 62 63.00 0
## 4 800000625 male widow rural 63 63.41 0
## 5 800000631 female widow industry 60 61.00 0
## 6 800000631 female widow industry 61 62.00 0
## agegrp
## 1 60
## 2 61
## 3 62
## 4 63
## 5 60
## 6 61
We may change the row numbers and recode the id so they are easier to read.
## id sex civ region enter exit event agegrp
## 1 1 male widow rural 60 61.00 0 60
## 2 1 male widow rural 61 62.00 0 61
## 3 1 male widow rural 62 63.00 0 62
## 4 1 male widow rural 63 63.41 0 63
## 5 2 female widow industry 60 61.00 0 60
## 6 2 female widow industry 61 62.00 0 61
This is the long format, each individual has as many records as “presence
ages”. For instance, person No. 1 has four records, for the ages 60–63.
The maximum possible No. of records for one individual is 10. We can check
the distribution of No. of records per person by using the function
tapply:
## recs
## 1 2 3 4 5 6 7 8 9 10
## 400 397 351 315 307 250 192 208 146 657
It is easier to get to grips with the distribution with a graph, in this case a barplot,see Figure~.
FIGURE 6.17: Barplot of the number of records per person.
Now, let us turn om1 into a data frame in wide format. This is
done with the function reshape. First we remove the redundant
variables enter and exit.
om1$exit <- om1$enter <- NULL
om2 <- reshape(om1, v.names = c("event", "civ", "region"),
idvar = "id", direction = "wide",
timevar = "agegrp")
names(om2)## [1] "id" "sex" "event.60" "civ.60"
## [5] "region.60" "event.61" "civ.61" "region.61"
## [9] "event.62" "civ.62" "region.62" "event.63"
## [13] "civ.63" "region.63" "event.64" "civ.64"
## [17] "region.64" "event.65" "civ.65" "region.65"
## [21] "event.66" "civ.66" "region.66" "event.67"
## [25] "civ.67" "region.67" "event.68" "civ.68"
## [29] "region.68" "event.69" "civ.69" "region.69"
Here there are two time-fixed variables, id and sex, and three
time-varying variables, event, civ, and region. The
time-varying variables have suffix of the type .xx, where xx varies
from 60 to 69.
This is how data in wide format usually show up; the suffix may start with
something else than ., but it must be a single character, or nothing.
The real problem is how to switch from wide format to long, because our
survival analysis tools want it that way. The solution is to use
reshape again, but other specifications.
## id sex time event civ region
## 1.60 1 male 60 0 widow rural
## 2.60 2 female 60 0 widow industry
## 3.60 3 male 60 0 married town
## 4.60 4 male 60 0 married town
## 5.60 5 female 60 0 married town
## 6.60 6 female 60 0 married town
There is a new variable time created, which goes from 60 to 69, one
step for each of the ages. We would
like to have the file sorted primarily by id and secondary by time.
## id sex time event civ region
## 1.60 1 male 60 0 widow rural
## 1.61 1 male 61 0 widow rural
## 1.62 1 male 62 0 widow rural
## 1.63 1 male 63 0 widow rural
## 1.64 1 male 64 NA <NA> <NA>
## 1.65 1 male 65 NA <NA> <NA>
## 1.66 1 male 66 NA <NA> <NA>
## 1.67 1 male 67 NA <NA> <NA>
## 1.68 1 male 68 NA <NA> <NA>
## 1.69 1 male 69 NA <NA> <NA>
## 2.60 2 female 60 0 widow industry
Note that all individuals got 10 records here, even those who only are observed for fewer years. Individual No. 1 is only observed for the ages 60–63, and the next six records are redundant; they will not be used in an analysis if kept, so it is from a practical point of view a good idea to remove them.
## [1] 32230
## [1] 17434
The data frame shrunk to almost half of what it was originally. First, let us summarize data.
## id sex time
## Min. : 1 male : 7045 Min. :60.0
## 1st Qu.: 655 female:10389 1st Qu.:61.0
## Median :1267 Median :63.0
## Mean :1328 Mean :63.1
## 3rd Qu.:1948 3rd Qu.:65.0
## Max. :3223 Max. :69.0
## event civ region
## Min. :0.0000 unmarried: 1565 town :2485
## 1st Qu.:0.0000 married :11380 industry:5344
## Median :0.0000 widow : 4489 rural :9605
## Mean :0.0259
## 3rd Qu.:0.0000
## Max. :1.0000
The key variables in the discrete time analysis are event and
time. For the baseline hazard we need one parameter per value of
time, so it is practical to transform the continuous variable time
to a factor.
## id sex time
## Min. : 1 male : 7045 60 :3223
## 1st Qu.: 655 female:10389 61 :2823
## Median :1267 62 :2426
## Mean :1328 63 :2075
## 3rd Qu.:1948 64 :1760
## Max. :3223 65 :1453
## (Other):3674
## event civ region
## Min. :0.0000 unmarried: 1565 town :2485
## 1st Qu.:0.0000 married :11380 industry:5344
## Median :0.0000 widow : 4489 rural :9605
## Mean :0.0259
## 3rd Qu.:0.0000
## Max. :1.0000
##
The summary now produces a frequency table for time.
For a given time point and a given individual, the response is whether an
event has occurred or not, that is, it is modeled as a
outcome, which is a special
case of the distribution.
The discrete time analysis may now be performed in several ways. Most
straightforward is to run a
logistic regression with event as
response through the basic glm function with
family = binomial(link=cloglog). The so-called
link is
used in order to preserve the proportional hazards property.
fit.glm <- glm(event ~ sex + civ + region + time,
family = binomial(link = cloglog), data = om3)
summary(fit.glm)##
## Call:
## glm(formula = event ~ sex + civ + region + time, family = binomial(link = cloglog),
## data = om3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.426 -0.243 -0.218 -0.194 2.950
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4213 0.2117 -16.16 < 2e-16
## sexfemale -0.3633 0.0993 -3.66 0.00025
## civmarried -0.3368 0.1560 -2.16 0.03085
## civwidow -0.1901 0.1681 -1.13 0.25817
## regionindustry 0.1078 0.1444 0.75 0.45526
## regionrural -0.2242 0.1403 -1.60 0.11008
## time61 0.0490 0.1851 0.26 0.79137
## time62 0.4600 0.1740 2.64 0.00819
## time63 0.0559 0.2020 0.28 0.78213
## time64 0.4632 0.1888 2.45 0.01416
## time65 0.3175 0.2085 1.52 0.12782
## time66 0.4558 0.2123 2.15 0.03175
## time67 0.9151 0.1955 4.68 2.9e-06
## time68 0.6855 0.2258 3.04 0.00239
## time69 0.6054 0.2490 2.43 0.01506
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4186.8 on 17433 degrees of freedom
## Residual deviance: 4125.2 on 17419 degrees of freedom
## AIC: 4155
##
## Number of Fisher Scoring iterations: 7
This output is not so pleasant, but we can anyway see that females (as
usual) have lower mortality than males, that married are better off than
unmarried, and that regional differences maybe are not so large. To get a
better understanding of the statistical significance of the findings we run
drop1 on the fit.
## Single term deletions
##
## Model:
## event ~ sex + civ + region + time
## Df Deviance AIC LRT Pr(>Chi)
## <none> 4125 4155
## sex 1 4139 4167 13.3 0.00027
## civ 2 4130 4156 5.0 0.08300
## region 2 4136 4162 10.5 0.00518
## time 9 4161 4173 36.0 4e-05
Mildly surprisingly, civil status is not that statistically
significant, but region (and the other variables) is. The strong
significance of the time variable is of course expected; mortality is
expected to increase with increasing age.
An equivalent way, with a nicer printed output, is to use the function
in the package glmmML.
fit.boot <- glmmML::glmmboot(event ~ sex + civ + region, cluster = time,
family = binomial(link = cloglog),
data = om3)
fit.boot##
## Call: glmmML::glmmboot(formula = event ~ sex + civ + region, family = binomial(link = cloglog), data = om3, cluster = time)
##
##
## coef se(coef) z Pr(>|z|)
## sexfemale -0.363 0.0993 -3.659 0.00025
## civmarried -0.337 0.1560 -2.158 0.03100
## civwidow -0.190 0.1681 -1.131 0.26000
## regionindustry 0.108 0.1445 0.746 0.46000
## regionrural -0.224 0.1404 -1.597 0.11000
##
## Residual deviance: 4270 on 17419 degrees of freedom AIC: 4300
The parameter estimates corresponding to time are contained in the
variable \(fit\$frail\). They need to be transformed to get the “baseline
hazards”.
## [1] 0.03164 0.03317 0.04921 0.03339 0.04936 0.04295
## [7] 0.04901 0.07542 0.06089 0.05647
A plot of the hazard function is shown in Figure 6.18.
FIGURE 6.18: Baseline hazards, old age mortality.
By some data manipulation we can also use eha for the analysis. For
that to succeed we need intervals as responses, and the way of doing that
is to add two variables, exit and enter. The latter must be
slightly smaller than the former:
om3$exit <- as.numeric(as.character(om3$time))
om3$enter <- om3$exit - 0.5
fit.ML <- coxreg(Surv(enter, exit, event) ~ sex + civ + region,
method = "ml", data = om3)
fit.ML## Call:
## coxreg(formula = Surv(enter, exit, event) ~ sex + civ + region,
## data = om3, method = "ml")
##
## Covariate Mean Coef Rel.Risk S.E. Wald p
## sex
## male 0.404 0 1 (reference)
## female 0.596 -0.363 0.695 0.099 0.000
## civ
## unmarried 0.090 0 1 (reference)
## married 0.653 -0.337 0.714 0.156 0.031
## widow 0.257 -0.190 0.827 0.168 0.258
## region
## town 0.143 0 1 (reference)
## industry 0.307 0.108 1.114 0.144 0.455
## rural 0.551 -0.224 0.799 0.140 0.110
##
## Events 451
## Total time at risk 8717
## Max. log. likelihood -2062.6
## LR test statistic 27.23
## Degrees of freedom 5
## Overall p-value 5.14285e-05
Plots of the cumulative hazards and the survival function are easily achieved, see Figures @ref{fig:cumML6} and 6.20.
FIGURE 6.19: The cumulative hazards, from the coxreg fit.
FIGURE 6.20: The survival function, from the coxreg fit.
Finally, the proportional hazards assumption can be tested in the discrete
time framework by creating an interaction between time and the
covariates in question. It is only possible by using glm.
fit2.glm <- glm(event ~ (sex + civ + region) * time,
family = binomial(link = cloglog),
data = om3)
drop1(fit2.glm, test = "Chisq")## Single term deletions
##
## Model:
## event ~ (sex + civ + region) * time
## Df Deviance AIC LRT Pr(>Chi)
## <none> 4077 4197
## sex:time 9 4088 4190 11.1 0.27
## civ:time 18 4100 4184 22.4 0.21
## region:time 18 4094 4178 16.6 0.55
There is no sign of non-proportionality.