9.5 Matching

Matching is a way of eliminating the effect of confounders and therefore important to discuss in connection with causality. One reason for matching is the wish to follow the causal inference paradigm with counterfactuals. Another reason is that the exposure we want to study is very rare in the population. It will then be inefficient to take a simple random sample from the population; in order to get enough cases in the sample, the sample size needs to be very large. On the other hand, today’s register based research tends to analyze whole populations, and the limitations in terms of computer memory and process power are more or less gone. Nevertheless, small properly matched data sets may contain more information about the specific question at hand than whole register!

When creating a matched data set, you have to decide how many controls you want per case. In the true counterfactual paradigm in “causal inference”, it is common practice to choose one control per case, to “fill in” the unobservable in the pair of counterfactuals. We first look at the case with matched pairs, then a case study with two controls per case.

9.5.1 Paired data

Certain kinds of observations come naturally in pairs. The most obvious situation is data from twin studies. Many countries keep registers of twins, and the obvious advantage with twin data is that it is possible to control for genetic variation; monozygotic twins have identical genes. Humans have pairs of many body parts; arms, legs, eyes, ears, etc. This can be utilized in medical studies concerning comparison of treatments, the two in a pair simply get one treatment each.

In a survival analysis context, pairs are followed over time and it is noted who first experience the event of interest. In each pair, one is treated and the other is a control, but otherwise they are considered more or less identical. Right censoring may result in that it is impossible to decide who in the pair experienced the event first. Such pairs are simply discarded in the analysis.

The model is \[\begin{equation} h_i(t; x) = h_{0i}(t) e^{\beta x}, \end{equation}\] where \(x\) is treatment (0-1) and \(i\) is pair No. Note that each pair has its own baseline hazard function, which means that the proportional hazards assumption is only required within pairs. This is a special form of stratified analysis in that each stratum only contains two individuals, a case and its control. If we denote by \(T_{i1}\) the life length of the case in pair No. i and \(T_{i0}\) the life length of the corresponding control, we get

\[\begin{equation} P(T_{1i} < T_{0i}) = \frac{e^\beta}{1 + e^\beta}, \quad i = 1, \ldots, n, \tag{9.1} \end{equation}\] and the result of the analysis is simply a study of binomial data; how many times did the case die before the control? We can estimate this probability, and also test the null hypothesis that it is equal to one half, which corresponds to \(\beta = 0\) in (9.1).

9.5.2 More than one control

The situation with more than one control per case is as simple as the paired data case to handle. Simply stratify, with one case and its controls per stratum. It is also possible to have a varying number of controls per case.

As an example where two controls per case were used, let us see how a study of maternal death and its effect on infant mortality was performed.

Example 9.1 (Maternal death and infant mortality)

This is a study on historical data from northern Sweden, 1820–1895 (Broström 1987). The simple question asked was: How much does the death risk increase for an infant that loses her mother? More precisely, by a maternal death we mean that a mother dies within one year after the birth of her child. In this particular study, only first births were studied. The total number of such births was 5641 and of these 35 resulted in a maternal death (with the infant still alive). Instead of analyzing the full data set, it was decided to use matching. To each case of maternal death, two controls were selected in the following way. At each time an infant changed status from mother alive to mother dead, two controls are selected without replacement from the subset of the current risk set, where all infants have mother alive and not already used as controls and with correct matching characteristics. If a control changes status to case (its mother dies), it is immediately censored as a control and starts as a case with two controls linked to it. However, this situation never occurred in the study.

Let us load the data into R and look at it.

library(eha)
head(infants)
##   stratum enter exit event mother age  sex      parish   civst    ses year
## 1       1    55  365     0   dead  26  boy Nedertornea married farmer 1877
## 2       1    55  365     0  alive  26  boy Nedertornea married farmer 1870
## 3       1    55  365     0  alive  26  boy Nedertornea married farmer 1882
## 4       2    13   76     1   dead  23 girl Nedertornea married  other 1847
## 5       2    13  365     0  alive  23 girl Nedertornea married  other 1847
## 6       2    13  365     0  alive  23 girl Nedertornea married  other 1848

Here we see the two first triplets in the data set, which consists of 35 triplets, or 105 individuals. Infant No. 1 (the first row) is a case, his mother died when he was 55 days old. At that moment, two controls were selected, that is, two boys 55 days of age, and with the same characteristics as the case. The matching was not completely successful; we had to look a few years back and forth in calendar time (covariate year). Note also that in this triplet all infants survived one year of age, so there is no information of risk differences in that triplet. It will be automatically removed in the analysis.

The second triplet, on the other hand, will be used, because the case dies at age 76 days, while both controls survive one year of age. This is information suggesting that cases have higher mortality than controls after the fatal event of a maternal death.

The matched data analysis is performed by stratifying on triplets (the variable stratum).

fit <- coxreg(Surv(enter, exit, event) ~ mother + strata(stratum), 
data = infants)
summary(fit)
## Covariate           Mean       Coef     Rel.Risk   S.E.    LR p
## mother                                                     0.000 
##            alive    0.763     0         1 (reference)
##             dead    0.237     2.605    13.534     0.757
## 
## Events                    21 
## Total time at risk         21616 
## Max. log. likelihood      -10.815 
## LR test statistic         19.65 
## Degrees of freedom        1 
## Overall p-value           9.31744e-06

The result is that mother’s death increases the death risk of the infant wit a factor 13.5! It statistically very significant, but the number of infant deaths (21) is very small, so the \(p\)-value may be unreliable.

In a stratified analysis it is normally not possible to include covariates that are constant within strata. However, here it is possible to estimate the interaction between a stratum-constant covariate and exposure, mother’s death. However, it is important not to include the main effect corresponding to the stratum-constant covariate! This is a rare exception to the rule that when an interaction term is present, the corresponding main effects must be included in the model.

We investigate whether the effect of mother’s death is modified by her age by first calculating an interaction term (mage):

infants$mage <- ifelse(infants$mother == "dead", infants$age, 0)

Note the use of the function ifelse: It takes three arguments, the first is a logical expression (resulting in a logical vector, with values TRUE and FALSE. Note that in this example, the length is the same as the length of mother (in infants). For each component that is TRUE, the result is given by the second argument, and for each component that is {FALSE} the value is given by the third argument.

Including the created covariate in the analysis gives

fit <- coxreg(Surv(enter, exit, event) ~ mother + mage + 
              strata(stratum), data = infants)
summary(fit)
## Covariate           Mean       Coef     Rel.Risk   S.E.    LR p
## mother                                                     0.562 
##            alive    0.763     0         1 (reference)
##             dead    0.237     2.916    18.467     4.860
## mage                6.684    -0.012     0.988     0.188    0.949 
## 
## Events                    21 
## Total time at risk         21616 
## Max. log. likelihood      -10.813 
## LR test statistic         19.65 
## Degrees of freedom        2 
## Overall p-value           5.40649e-05

What happened here? The effect of mother’s age is even larger than in the case without mage, but the statistical significance is gone altogether. There are two problems with one solution here. First, due to the inclusion of the interaction, the (main) effect of mother’s death is now measured at mother’s age equal to 0 (zero!), which of course is completely nonsensical, and second, the construction makes the two covariates strongly correlated: When mother is alive, mage is zero, and when mother is dead,mage` takes a rather large value. This is a case of collinearity, however not very severe case, because it is very easy to get rid of.

The solution? Center mother’s age (age)! Recreate the variables and run the analysis again:

infants$age <- infants$age - mean(infants$age)
infants$mage <- ifelse(infants$mother == "dead", infants$age, 0)
fit1 <- coxreg(Surv(enter, exit, event) ~ mother + mage + 
              strata(stratum), data = infants)
summary(fit1)
## Covariate           Mean       Coef     Rel.Risk   S.E.    LR p
## mother                                                     0.000 
##            alive    0.763     0         1 (reference)
##             dead    0.237     2.586    13.274     0.809
## mage                0.280    -0.012     0.988     0.188    0.949 
## 
## Events                    21 
## Total time at risk         21616 
## Max. log. likelihood      -10.813 
## LR test statistic         19.65 
## Degrees of freedom        2 
## Overall p-value           5.40649e-05

The two fits are equivalent (look at the Max. log. likelihoods), but with different parametrizations. The second, with a centered mother’s age, is preferred, because the two covariates are close to uncorrelated there.

A subject-matter conclusion in any case is that mother’s age does not affect the effect of her death on the survival chances of her infant.

A general advice: Always center continuous covariates before the analysis! This is especially important in the presence of models with interaction terms, and the advice is valid for all kinds of regression analyzes, not only Cox regression. Common practice is to subtract the mean value, but it is usually a better idea to use a fixed value, a value that remains fixed as subsets of the original data set are analyzed.

References

Broström, G. 1987. “The Influence of Mother’s Mortality on Infant Mortality: A Case Study in Matched Data Survival Analysis.” Scandinavian Journal of Statistics 14: 113–23.