3.10 Doing it in R

We utilize the male mortality data, Skellefteå 1800–1820, to illustrate the aspects of an ordinary Cox regression. Males aged 20 (exact) are sampled between 1820 and 1840 and followed to death or reaching the age of 40, when they are right censored. So in effect male mortality in the ages 20–40 are studied. The survival time starts counting at zero for each individual at his 21st birthdate (that is, at exact age 20 years).

There are two R functions that are handy for a quick look at a data frame, str and head. The first give a summary description of the structure of an R object, often a data frame

library(eha) #Loads also the data frame 'mort'
str(mort)

## 'data.frame':	1208 obs. of  6 variables:
##  $ id       : int  1 2 3 3 4 5 5 6 7 7 ...
##  $ enter    : num  0 3.48 0 13.46 0 ...
##  $ exit     : num  20 17.6 13.5 20 20 ...
##  $ event    : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ birthdate: num  1800 1800 1800 1800 1800 ...
##  $ ses      : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...

First, there is information about the data frame: It is a data.frame, with six variables measured on 1208 objects. Then each variable is individually described: name, type, and a few of the first values. The values are usually rounded to a few digits. The Factor line is worth noticing: It is of course a factor covariate, it takes two levels, lower and upper. Internally, the levels are coded 1, 2, respectively. Then (by default), lower (with internal code equal to 1) is the reference category. If desired, this can be changed with the relevel function.

mort$ses2 <- relevel(mort$ses, ref = "upper")
str(mort)

## 'data.frame':	1208 obs. of  7 variables:
##  $ id       : int  1 2 3 3 4 5 5 6 7 7 ...
##  $ enter    : num  0 3.48 0 13.46 0 ...
##  $ exit     : num  20 17.6 13.5 20 20 ...
##  $ event    : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ birthdate: num  1800 1800 1800 1800 1800 ...
##  $ ses      : Factor w/ 2 levels "lower","upper": 2 1 2 1 1 1 2 2 2 1 ...
##  $ ses2     : Factor w/ 2 levels "upper","lower": 1 2 1 2 2 2 1 1 1 2 ...

Comparing the information about the variables ses and ses2, you find that the codings, and also the reference categories, are reversed. Otherwise the variables are identical, and any analysis involving any of them as an explanatory variable would lead to identical conclusions, but not identical parameter estimates, see below.

The function head prints the first few lines (observations) of a data frame (there is also a corresponding tail function that prints a few of the last rows).

head(mort)

##   id  enter   exit event birthdate   ses  ses2
## 1  1  0.000 20.000     0  1800.010 upper upper
## 2  2  3.478 17.562     1  1800.015 lower lower
## 3  3  0.000 13.463     0  1800.031 upper upper
## 4  3 13.463 20.000     0  1800.031 lower lower
## 5  4  0.000 20.000     0  1800.064 lower lower
## 6  5  0.000  0.089     0  1800.084 lower lower

Apparently, the variables ses and ses2 are identical, but we know that behind the scenes they are formally different; different coding and (therefore) different reference category. This shows up in analyzes involving the two variables, one at a time.

First the analysis is run with ses.

res <- coxreg(Surv(enter, exit, event) ~ ses + birthdate, 
              data = mort)
res

## Call:
## coxreg(formula = Surv(enter, exit, event) ~ ses + birthdate, 
##     data = mort)
## 
## Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
## ses 
##            lower    0.416     0         1 (reference)
##            upper    0.584    -0.484     0.616     0.121     0.000 
## birthdate        1811.640     0.029     1.029     0.011     0.008 
## 
## Events                    276 
## Total time at risk         17038 
## Max. log. likelihood      -1841.7 
## LR test statistic         23.08 
## Degrees of freedom        2 
## Overall p-value           9.72167e-06

Then the variable ses2 is inserted instead.

res2 <- coxreg(Surv(enter, exit, event) ~ ses2 + birthdate, 
               data = mort)
res2

## Call:
## coxreg(formula = Surv(enter, exit, event) ~ ses2 + birthdate, 
##     data = mort)
## 
## Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
## ses2 
##            upper    0.584     0         1 (reference)
##            lower    0.416     0.484     1.623     0.121     0.000 
## birthdate        1811.640     0.029     1.029     0.011     0.008 
## 
## Events                    276 
## Total time at risk         17038 
## Max. log. likelihood      -1841.7 
## LR test statistic         23.08 
## Degrees of freedom        2 
## Overall p-value           9.72167e-06

Notice the difference; it is only formal. The substantial results are completely equivalent.

3.10.1 Likelihood Ratio Test

If we judge the result by the Wald \(p\)-values, it seems as if both variables are highly significant. To be sure it is advisable to apply the likelihood ratio test. In R, this is easily achieved by applying the drop1 function on the result, res, of the Cox regression.

drop1(res, test = "Chisq")

## Single term deletions
## 
## Model:
## Surv(enter, exit, event) ~ ses + birthdate
##           Df    AIC     LRT  Pr(>Chi)
## <none>       3687.3                  
## ses        1 3701.4 16.1095 5.978e-05
## birthdate  1 3692.6  7.2752  0.006991

In fact, we have performed two likelihood ratio tests here. First, the effect of removing ses from the full model. The \(p\)-value for this test is very small, \(p = 0.00005998\). The scientific notation, 5.998e-05 means that the decimal point should be moved five steps to the left (that’s the minus sign). The reason for this notation is essentially space-saving. Then, the second LR test concerns the absence or presence of birthdate in the full model. This also gives a small \(p\)-value, \(p = 0.006972\).

The earlier conclusion is not altered; both variables are highly significant.

3.10.2 The estimated baseline cumulative hazard function

The estimated baseline cumulative function (see Equation (3.7)) is best plotted by the plot command, see Figure 3.10.

FIGURE 3.10: Estimated cumulative baseline hazard function.

The figure shows the baseline cumulative hazard function for an average individual (i.e., with the value zero on all covariates when all covariates are centered around their weighted means.