A Basic Statistical Concepts
The statistical concepts that are important for understanding what is going on in this book are gathered here, but briefly treated. The interested reader who wants a deeper understanding of statistical concepts should have no problems in finding suitable text books. There are for instance some recent texts that teaches statistics and how to use it in R (Dalgaard 2008).
A.1 Statistical Inference
Statistical inference is the science that help us draw conclusions about real world phenomena by observing and analyzing samples from them. The theory rests on probability theory and the concept of random sampling. The statistical analysis never gives absolute truths, but only statements coupled to certain measures of their validity. These measures are almost always probability statements.
The crucial concept is that of a model, despite the fact that the present trend in statistical inference is toward nonparametric statistics. It is often stated that with today’s huge data sets, statistical models are unnecessary, but nothing could be more wrong.
The important idea in a statistical model is the concept of a parameter. It is often confused with its estimator from data. For instance, when we talk about mortality in a population, it is a hypothetical concept that is different from the ratio between the observed number of deaths and the population size (or any other measure based on data). The latter is an estimate (at best) of the former. The whole idea about statistical inference is to extract information about a population parameter from observing data.
A.1.1 Point estimation
The case in point estimation is to find the best guess (in some sense) of a population parameter from data. That is, we try to find the best single value that is closest to the true, but unknown, value of the population parameter.
Of course, a point estimator is useless if it is not connected to some measure of its uncertainty. That takes us to the concept of interval estimation.
A.1.2 Interval estimation
The philosophy behind interval estimation is that a guess on a single value of the unknown population parameter is useless without an accompanying measure of the uncertainty of that guess. A confidence interval is an interval, which we say covers the true value of the population parameter with a certain probability (often 95 per cent).
A.1.3 Hypothesis testing
We are often interested in a specific value of a parameter, and in regression problems this value is almost always zero (0). The reason is that regression parameters measure effects, and to test for no effect is then equivalent to testing that the corresponding parameter has value zero.
There is a connection between interval estimation and hypothesis testing: To test the hypothesis that a parameter value is zero can be done through constructing a confidence interval for the parameter. The test rule is then: If the interval does not cover zero, reject the hypothesis, otherwise do not.
A.1.3.1 The log-rank test
The general hypothesis testing theory behind the log-rank test builds on the hypergeometric distribution. The calculations under the null hypothesis of no difference in survival chances between the two groups are performed conditional on both margins. In Table A.1, if the margins are fixed there is only one degree of freedom left; for a given value of (say) \(d_1\), the three values \(d_2\), \((n_1 - d_1)\), and \((n_2 - d_2)\) are determined.
Group | Deaths | Survivors | Total |
---|---|---|---|
I | d1 | n1−d1 | n1 |
II | d2 | n2−d2 | n2 |
Total | d | n−d | n |
Utilizing the fact that, under the null, \(d_1\) is hyper-geometrically distributed, results in the following algorithm for calculating a test statistic:
Observe \(O = d_1\)
Calculate the expected value \(E\) of \(O\) (under the null):
\[ E = d \frac{n_1}{n}. \]
- Calculate the variance \(V\) of \(O\) (under the null):
\[V = \frac{(n - d) d n_1 n_2}{n^2 (n - 1)}.\]
- Repeat 1 to 3 for all tables and aggregate according to equation (A.1).
The log rank test statistic \(T\) is
\[\begin{equation} T = \frac{\sum_{i=1}^k \left(O_i - E_i\right)}{\sqrt{\sum_{i=1}^k V_i}} \tag{A.1} \end{equation}\]
Note carefully that this procedure is not equivalent to aggregating all tables of raw data!
Properties of the log rank test;
- The test statistic \(T^2\) is approximately distributed as \(\chi^2(1)\).
- It is available in most statistical software.
- It can be generalized to comparisons of more than two groups.
- For \(s\) groups, the test statistic is approximately \(\chi^2(s-1)\).
- The test has high power against alternatives with proportional hazards, but can be weak against non-proportional alternatives.
A.2 Presentation of Results
The presentation of the results of fitting a regression model to given data is usually done both graphically and in tabular form. Some general guidelines that must be followed are given here. In order to make the presentation easier to understand, a real data, but fictive research project is followed.
A.2.1 The project
In a mortality study of the small town of Umeå in 20th century northern Sweden, we
are investigating the mortality among persons aged 50 and above.
We are especially interested in whether there were any differences in mortality
between social branches. For this purpose, the data set ume
, with selected
variables, is used. A summary of the data is shown
in Table ??.
As you can see, data is tabulated, so a piecewise constant hazard model is called for.
A.2.2 Tabular presentation
We fit a simple model with only one covariate, but select only
married men. This can be done with the help of Poisson regression and the
glm
function:
Call:
glm(formula = event ~ offset(log(exposure)) + age + socBranch,
family = "poisson", data = umetab, subset = sex == "male" &
civst == "married")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7219 -0.9084 -0.1202 0.6842 2.5910
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.41828 0.09766 -45.240 < 2e-16
age53-56 0.33971 0.11042 3.077 0.00209
age56-59 0.51119 0.10978 4.657 3.21e-06
age59-62 0.83266 0.10635 7.830 4.89e-15
age62-65 1.10338 0.10483 10.526 < 2e-16
age65-68 1.38179 0.10395 13.292 < 2e-16
age68-71 1.80158 0.10196 17.669 < 2e-16
age71-74 1.96600 0.10539 18.655 < 2e-16
age74-77 2.16921 0.11027 19.672 < 2e-16
age77-80 2.58718 0.11432 22.632 < 2e-16
age80-83 2.66962 0.13539 19.718 < 2e-16
age83-86 3.01993 0.15727 19.202 < 2e-16
age86-89 3.38008 0.19776 17.092 < 2e-16
age89-92 3.26400 0.36312 8.989 < 2e-16
age92-95 3.11514 0.71204 4.375 1.21e-05
socBranchfarming -0.40721 0.06612 -6.158 7.35e-10
socBranchbusiness 0.03182 0.09226 0.345 0.73017
socBranchworker -0.32630 0.07125 -4.580 4.65e-06
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1752.29 on 192 degrees of freedom
Residual deviance: 240.43 on 175 degrees of freedom
AIC: 969.83
Number of Fisher Scoring iterations: 5
Besides the slightly disturbing fact that all age estimates are printed, the
presentation of the result for the covariate of interest, socBranch
, is less
satisfactory from a “survival analysis” point of view: We are generally not
interested of looking at the estimated baseline in tabular form (we want a
graph). There is a more general critique of the presentation: Too many p-values
are given, and not the relevant ones either.
In eha
, the function tpchreg
does the same thing, but with a more satisfactory
presentation of the result, see Table A.2.
Covariate | level | W_mean | Coef | HR | SE | LR_p |
---|---|---|---|---|---|---|
socBranch | 0.0000 | |||||
official | 0.109 | 0 | 1 | |||
farming | 0.520 | -0.4072 | 0.6655 | 0.0661 | ||
business | 0.072 | 0.0318 | 1.0323 | 0.0923 | ||
worker | 0.299 | -0.3263 | 0.7216 | 0.0712 | ||
Max Log | Likelihood | -9685.8 | ||||
Restricted | mean | 20.543 |
First, we got rid of the printing of the numerous baseline hazards estimates, and second, the
factor socBranch
is presented in full, with all four levels. And finally,
the relevant p-value. Because, first, the truth is that Wald p-values are
notoriously unreliably in nonlinear regression, and second, we want p-values for
covariates and not for levels of covariates.
However, note that the essential findings with the two approaches are identical.
A.2.3 Graphics
The baseline hazard function is shown in Figure A.1.
FIGURE A.1: The hazard function for old age mortality in Umeå, first half of 20th century.
A.3 Asymptotic Theory
A.3.1 Partial likelihood
Here is a very brief summary of the asymptotics concerning the partial likelihood. Once defined, it turns out that you may treat it as an ordinary likelihood function . The setup is as follows.
Let \(t_{(1)}, t_{(2)}, \ldots, t_{(k)}\) the ordered observed event times and let \(R_i = R(t_{(i)})\) be the risk set at \(t_{(i)}, \; i = 1, \ldots, k\), see equation . At \(t_{(i)}\), condition with respect to the composition of \(R_i\) and that one event occurred (for tied events, a correction is necessary).
Then the contribution to the partial likelihood from \(t_{(i)}\) is
\[\begin{multline*} L_i(\boldsymbol{\beta}) = P(\mbox{No. $m_i$ dies} \mid \mbox{one event occur}, R_i) \\ = \frac{h_0(t_{(i)}) \exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}h_0(t_{(i)})\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} = \frac{\exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} \end{multline*}\] and the full partial likelihood is
\[\begin{equation*} L(\boldsymbol{\beta}) = \prod_{i=1}^k L_i(\boldsymbol{\beta}) = \prod_{i=1}^k \frac{ \exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} \end{equation*}\] This is where the doubt about the partial likelihood comes in; the conditional probabilities multiplied together do not have a proper interpretation as a conditional probability. Nevertheless, it is prudent to proceed as if the expression really is a likelihood function. The log partial likelihood becomes
\[\begin{equation}\label{eq:logplA} \log\big(L(\boldsymbol{\beta})\big) = \sum_{i=1}^k \left\{\boldsymbol{\beta} \mathbf{x}_{m_i} - \log\left(\sum_{\ell \in R_i} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)\right)\right\}, \end{equation}\] and the components of the score vector are
\[\begin{equation}\label{eq:scoreA} \frac{\partial}{\partial \beta_j} \log L(\boldsymbol{\beta}) = \sum_{i=1}^k \mathbf{x}_{m_i j} - \sum_{i=1}^k \frac{\sum_{\ell \in R_i} x_{\ell j} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)} {\sum_{\ell \in R_i} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)}, \quad j = 1, \ldots, s. \end{equation}\] The maximum partial likelihood (MPL) estimator of \(\boldsymbol{\beta}\), \(\hat{\boldsymbol{\beta}}\), is found by setting equal to zero and solve for \(\boldsymbol{\beta}\).
For inference, we need to calculate the inverse of minus the Hessian, evaluated at \(\hat{\boldsymbol{\beta}}\). This gives the estimated covariance matrix. The Hessian is the matrix of the second partial derivatives. The expectation of minus the Hessian is called the information matrix. The observed information matrix is
\[\begin{equation*} \hat{I}(\hat{\boldsymbol{\beta}})_{j,m} = -\frac{\partial^2 \log L(\boldsymbol{\beta})} {\partial \beta_j \partial \beta_m}\mid_{\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}} \end{equation*}\] and asymptotic theory says that
\[\begin{equation*} \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \hat{I}^{-1}(\hat{\boldsymbol{\beta}})) \end{equation*}\] This is to say that \(\hat{\boldsymbol{\beta}}\) is asymptotically unbiased and normally distributed with the given covariance matrix (or the limit of it). Further, \(\hat{\boldsymbol{\beta}}\) is a consistent estimator of \(\boldsymbol{\beta}\). These results are used for for hypothesis testing, confidence intervals, and variable selection.
Note that these are only asymptotic results, i.e., useful in large to medium sized samples. In small samples, bootstrapping is a possibility. This option is available in the R package eha.
Here a warning is in order: Tests based on standard errors (Wald) tests) may be highly unreliable, as in all non-linear regression (Hauck and Donner 1977). A better alternative is the likelihood ratio test.
A.4 Model Selection
In regression models, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines.
A.4.1 Comparing nested models
The meaning of nesting of models is best described by an example.
- \({\cal M}_2:\; h(t; (x_1, x_2)) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2)\)
- \({\cal M}_1:\; h(t; (x_1, x_2)) = h_0(t) \exp(\beta_1 x_1)\): \(x_2\) has no effect.
Thus, the model \({\cal M}_1\) is a special case of \({\cal M}_2\) (\(\beta_2 = 0\)). We say that \({\cal M}_1\) is nested in \({\cal M}_2\). Now, assume that \({\cal M}_2\) is true. Then, testing the hypothesis \(H_0: \; {\cal M}_1\) is true (as well) is the same as testing the hypothesis \(H_0;\; \beta_2 = 0\).
The formal theory for and procedure for performing the likelihood ratio test (LRT) can be summarized as follows:
Maximize \(\log L(\beta_1, \beta_2)\) under \({\cal M}_2\); gives \(\log L(\hat{\beta}_1, \hat{\beta}_2)\).
Maximize \(\log L(\beta_1, \beta_2)\) under \({\cal M}_1\), that is, maximize \(\log L(\beta_1, 0)\); gives \(\log L(\beta_1^*, 0)\).
Calculate the test statistic \[\begin{equation*} T = 2\big(\log L(\hat{\beta}_1, \hat{\beta}_2) - \log L(\beta_1^*, 0)\big) \end{equation*}\]
Under \(H_0\), \(T\) has a \(\chi^2\) (chi-square) distribution with \(d\) degrees of freedom: \(T \sim \chi^2(d)\), where \(d\) is the difference in numbers of parameters in the two competing models, in this case \(2-1=1\).
Reject \(H_0\) if \(T\) is large enough. Exactly how much that is depends on the level of significance; if it is \(\alpha\), choose the limit \(t_d\) equal to the \(100 (1 - \alpha)\) percentile of the \(\chi^2(d)\) distribution.
This result is a large sample approximation.
The Wald test is theoretically performed as follows:
Maximize \(\log L(\beta_1, \beta_2)\) under \({\cal M}_2\); this gives \(\log L(\hat{\beta}_1, \hat{\beta}_2)\), and \(\hat{\beta}_2\), se(\(\hat{\beta}_2\)).
Calculate the test statistic \[\begin{equation*} T_W = \frac{\hat{\beta}_2}{\mbox{se}(\hat{\beta}_2)} \end{equation*}\]
Under \(H_0\), \(T_W\) has a standard normal distribution: \(T_W \sim N(0, 1)\).
Reject \(H_0\) if the absolute value of \(T_W\) is larger than 1.96 on a significance level of 5%.
This is a large sample approximation, with the advantage that it is automatically available in all software. In comparison to the LRT, one model less has to be fitted. This saves time and efforts, unfortunately on the expense of accuracy, because it may occasionally give nonsensic results. This phenomenon, as already mentioned, is known as the Hauck-Donner effect (Hauck and Donner 1977). \(\Box\)
A.4.2 Comparing non-nested models
Non-nested models cannot be compared by a likelihood ratio test, but there are a couple of alternatives that are based on comparing maximized likelihood values modified with consideration of the number of parameters that needs to be estimated. One such alternative is the Akaike Information Criterion (AIC), see Leeuw (1992).