9.5 Matching
Matching is a way of eliminating the effect of confounders and therefore important to discuss in connection with causality. One reason for matching is the wish to follow the causal inference paradigm with counterfactuals. Another reason is that the exposure we want to study is very rare in the population. It will then be inefficient to take a simple random sample from the population; in order to get enough cases in the sample, the sample size needs to be very large. On the other hand, today’s register based research tends to analyze whole populations, and the limitations in terms of computer memory and process power are more or less gone. Nevertheless, small properly matched data sets may contain more information about the specific question at hand than whole register!
When creating a matched data set, you have to decide how many controls you want per case. In the true counterfactual paradigm in “causal inference”, it is common practice to choose one control per case, to “fill in” the unobservable in the pair of counterfactuals. We first look at the case with matched pairs, then a case study with two controls per case.
9.5.1 Paired data
Certain kinds of observations come naturally in pairs. The most obvious situation is data from twin studies. Many countries keep registers of twins, and the obvious advantage with twin data is that it is possible to control for genetic variation; monozygotic twins have identical genes. Humans have pairs of many body parts; arms, legs, eyes, ears, etc. This can be utilized in medical studies concerning comparison of treatments, the two in a pair simply get one treatment each.
In a survival analysis context, pairs are followed over time and it is noted who first experience the event of interest. In each pair, one is treated and the other is a control, but otherwise they are considered more or less identical. Right censoring may result in that it is impossible to decide who in the pair experienced the event first. Such pairs are simply discarded in the analysis.
The model is \[\begin{equation} h_i(t; x) = h_{0i}(t) e^{\beta x}, \end{equation}\] where \(x\) is treatment (0-1) and \(i\) is pair No. Note that each pair has its own baseline hazard function, which means that the proportional hazards assumption is only required within pairs. This is a special form of stratified analysis in that each stratum only contains two individuals, a case and its control. If we denote by \(T_{i1}\) the life length of the case in pair No. i and \(T_{i0}\) the life length of the corresponding control, we get
\[\begin{equation} P(T_{1i} < T_{0i}) = \frac{e^\beta}{1 + e^\beta}, \quad i = 1, \ldots, n, \tag{9.1} \end{equation}\] and the result of the analysis is simply a study of binomial data; how many times did the case die before the control? We can estimate this probability, and also test the null hypothesis that it is equal to one half, which corresponds to \(\beta = 0\) in (9.1).
9.5.2 More than one control
The situation with more than one control per case is as simple as the paired data case to handle. Simply stratify, with one case and its controls per stratum. It is also possible to have a varying number of controls per case.
As an example where two controls per case were used, let us see how a study of maternal death and its effect on infant mortality was performed.
This is a study on historical data from northern Sweden, 1820–1895 (Broström 1987). The simple question asked was: How much does the death risk increase for an infant that loses her mother? More precisely, by a maternal death we mean that a mother dies within one year after the birth of her child. In this particular study, only first births were studied. The total number of such births was 5641 and of these 35 resulted in a maternal death (with the infant still alive). Instead of analyzing the full data set, it was decided to use matching. To each case of maternal death, two controls were selected in the following way. At each time an infant changed status from mother alive to mother dead, two controls are selected without replacement from the subset of the current risk set, where all infants have mother alive and not already used as controls and with correct matching characteristics. If a control changes status to case (its mother dies), it is immediately censored as a control and starts as a case with two controls linked to it. However, this situation never occurred in the study.
Let us load the data into R and look at it.
## stratum enter exit event mother age sex parish civst ses year
## 1 1 55 365 0 dead 26 boy Nedertornea married farmer 1877
## 2 1 55 365 0 alive 26 boy Nedertornea married farmer 1870
## 3 1 55 365 0 alive 26 boy Nedertornea married farmer 1882
## 4 2 13 76 1 dead 23 girl Nedertornea married other 1847
## 5 2 13 365 0 alive 23 girl Nedertornea married other 1847
## 6 2 13 365 0 alive 23 girl Nedertornea married other 1848
Here we see the two first triplets in the data set, which consists of 35
triplets, or 105 individuals. Infant No. 1 (the first row) is a case, his
mother died when he was 55 days old. At that moment, two controls were
selected, that is, two boys 55 days of age, and with the same characteristics
as the case. The matching was not completely successful; we had to look a
few years back and forth in calendar time (covariate year). Note also
that in this triplet all infants survived one year of age, so there is no
information of risk differences in that triplet. It will be automatically
removed in the analysis.
The second triplet, on the other hand, will be used, because the case dies at age 76 days, while both controls survive one year of age. This is information suggesting that cases have higher mortality than controls after the fatal event of a maternal death.
The matched data analysis is performed by stratifying on triplets (the
variable stratum).
## Covariate Mean Coef Rel.Risk S.E. LR p
## mother 0.000
## alive 0.763 0 1 (reference)
## dead 0.237 2.605 13.534 0.757
##
## Events 21
## Total time at risk 21616
## Max. log. likelihood -10.815
## LR test statistic 19.65
## Degrees of freedom 1
## Overall p-value 9.31744e-06
The result is that mother’s death increases the death risk of the infant wit a factor 13.5! It statistically very significant, but the number of infant deaths (21) is very small, so the \(p\)-value may be unreliable.
In a stratified analysis it is normally not possible to include covariates that are constant within strata. However, here it is possible to estimate the interaction between a stratum-constant covariate and exposure, mother’s death. However, it is important not to include the main effect corresponding to the stratum-constant covariate! This is a rare exception to the rule that when an interaction term is present, the corresponding main effects must be included in the model.
We investigate whether the effect of mother’s death is modified by her age
by first calculating an interaction term (mage):
Note the use of the function ifelse:
It takes three arguments, the first is a logical expression (resulting in
a logical vector, with values TRUE and FALSE. Note that
in this example, the length is the same as the length of mother (in
infants). For each component that is TRUE, the result is given by
the second argument, and for each component that is {FALSE} the value
is given by the third argument.
Including the created covariate in the analysis gives
fit <- coxreg(Surv(enter, exit, event) ~ mother + mage +
strata(stratum), data = infants)
summary(fit)## Covariate Mean Coef Rel.Risk S.E. LR p
## mother 0.562
## alive 0.763 0 1 (reference)
## dead 0.237 2.916 18.467 4.860
## mage 6.684 -0.012 0.988 0.188 0.949
##
## Events 21
## Total time at risk 21616
## Max. log. likelihood -10.813
## LR test statistic 19.65
## Degrees of freedom 2
## Overall p-value 5.40649e-05
What happened here? The effect of mother’s age is even larger than in the
case without mage, but the statistical significance is gone
altogether. There are two problems with one solution here. First, due to the
inclusion of the interaction, the (main) effect of mother’s death is now
measured at mother’s age equal to 0 (zero!), which of course is completely
nonsensical, and second, the construction makes the two covariates strongly
correlated: When mother is alive, mage is zero, and
when mother is dead,mage` takes a rather large
value. This is a case of collinearity, however
not very severe case, because it is very easy to get rid of.
The solution? Center mother’s age (age)! Recreate the
variables and run the analysis again:
infants$age <- infants$age - mean(infants$age)
infants$mage <- ifelse(infants$mother == "dead", infants$age, 0)
fit1 <- coxreg(Surv(enter, exit, event) ~ mother + mage +
strata(stratum), data = infants)
summary(fit1)## Covariate Mean Coef Rel.Risk S.E. LR p
## mother 0.000
## alive 0.763 0 1 (reference)
## dead 0.237 2.586 13.274 0.809
## mage 0.280 -0.012 0.988 0.188 0.949
##
## Events 21
## Total time at risk 21616
## Max. log. likelihood -10.813
## LR test statistic 19.65
## Degrees of freedom 2
## Overall p-value 5.40649e-05
The two fits are equivalent (look at the Max. log. likelihoods), but with different parametrizations. The second, with a centered mother’s age, is preferred, because the two covariates are close to uncorrelated there.
A subject-matter conclusion in any case is that mother’s age does not affect the effect of her death on the survival chances of her infant.
A general advice: Always center continuous covariates before the analysis! This is especially important in the presence of models with interaction terms, and the advice is valid for all kinds of regression analyzes, not only Cox regression. Common practice is to subtract the mean value, but it is usually a better idea to use a fixed value, a value that remains fixed as subsets of the original data set are analyzed.
References
Broström, G. 1987. “The Influence of Mother’s Mortality on Infant Mortality: A Case Study in Matched Data Survival Analysis.” Scandinavian Journal of Statistics 14: 113–23.