A Basic Statistical Concepts

The statistical concepts that are important for understanding what is going on in this book are gathered here, but briefly treated. The interested reader who wants a deeper understanding of statistical concepts should have no problems in finding suitable text books. There are for instance some recent texts that teaches statistics and how to use it in R (Dalgaard 2008).

A.1 Statistical Inference

Statistical inference is the science that help us draw conclusions about real world phenomena by observing and analyzing samples from them. The theory rests on probability theory and the concept of random sampling. The statistical analysis never gives absolute truths, but only statements coupled to certain measures of their validity. These measures are almost always probability statements.

The crucial concept is that of a model, despite the fact that the present trend in statistical inference is toward nonparametric statistics. It is often stated that with today’s huge data sets, statistical models are unnecessary, but nothing could be more wrong.

The important idea in a statistical model is the concept of a parameter. It is often confused with its estimator from data. For instance, when we talk about mortality in a population, it is a hypothetical concept that is different from the ratio between the observed number of deaths and the population size (or any other measure based on data). The latter is an estimate (at best) of the former. The whole idea about statistical inference is to extract information about a population parameter from observing data.

A.1.1 Point estimation

The case in point estimation is to find the best guess (in some sense) of a population parameter from data. That is, we try to find the best single value that is closest to the true, but unknown, value of the population parameter.

Of course, a point estimator is useless if it is not connected to some measure of its uncertainty. That takes us to the concept of interval estimation.

A.1.2 Interval estimation

The philosophy behind interval estimation is that a guess on a single value of the unknown population parameter is useless without an accompanying measure of the uncertainty of that guess. A confidence interval is an interval, which we say covers the true value of the population parameter with a certain probability (often 95 per cent).

A.1.3 Hypothesis testing

We are often interested in a specific value of a parameter, and in regression problems this value is almost always zero (0). The reason is that regression parameters measure effects, and to test for no effect is then equivalent to testing that the corresponding parameter has value zero.

There is a connection between interval estimation and hypothesis testing: To test the hypothesis that a parameter value is zero can be done through constructing a confidence interval for the parameter. The test rule is then: If the interval does not cover zero, reject the hypothesis, otherwise do not.

A.1.3.1 The log-rank test

The general hypothesis testing theory behind the log-rank test builds on the hypergeometric distribution. The calculations under the null hypothesis of no difference in survival chances between the two groups are performed conditional on both margins. In Table A.1, if the margins are fixed there is only one degree of freedom left; for a given value of (say) $d_1$, the three values $d_2$, $(n_1 - d_1)$, and $(n_2 - d_2)$ are determined.

TABLE A.1: General table in a log rank test.
Group	Deaths	Survivors	Total
I	d1	n1−d1	n1
II	d2	n2−d2	n2
Total	d	n−d	n

Utilizing the fact that, under the null, $d_1$ is hyper-geometrically distributed, results in the following algorithm for calculating a test statistic:

Observe $O = d_1$
Calculate the expected value $E$ of $O$ (under the null):

\[ E = d \frac{n_1}{n}. \]

Calculate the variance $V$ of $O$ (under the null):

\[V = \frac{(n - d) d n_1 n_2}{n^2 (n - 1)}.\]

Repeat 1 to 3 for all tables and aggregate according to equation (A.1).

The log rank test statistic $T$ is

\[\begin{equation} T = \frac{\sum_{i=1}^k \left(O_i - E_i\right)}{\sqrt{\sum_{i=1}^k V_i}} \tag{A.1} \end{equation}\]

Note carefully that this procedure is not equivalent to aggregating all tables of raw data!

Properties of the log rank test;

The test statistic $T^2$ is approximately distributed as $\chi^2(1)$.
It is available in most statistical software.
It can be generalized to comparisons of more than two groups.
For $s$ groups, the test statistic is approximately $\chi^2(s-1)$.
The test has high power against alternatives with proportional hazards, but can be weak against non-proportional alternatives.

A.2 Presentation of Results

The presentation of the results of fitting a regression model to given data is usually done both graphically and in tabular form. Some general guidelines that must be followed are given here. In order to make the presentation easier to understand, a real data, but fictive research project is followed.

A.2.1 The project

In a mortality study of the small town of Umeå in 20th century northern Sweden, we are investigating the mortality among persons aged 50 and above. We are especially interested in whether there were any differences in mortality between social branches. For this purpose, the data set ume, with selected variables, is used. A summary of the data is shown in Table ??.

As you can see, data is tabulated, so a piecewise constant hazard model is called for.

A.2.2 Tabular presentation

We fit a simple model with only one covariate, but select only married men. This can be done with the help of Poisson regression and the glm function:


Call:
glm(formula = event ~ offset(log(exposure)) + age + socBranch, 
    family = "poisson", data = umetab, subset = sex == "male" & 
        civst == "married")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7219  -0.9084  -0.1202   0.6842   2.5910  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -4.41828    0.09766 -45.240  < 2e-16
age53-56           0.33971    0.11042   3.077  0.00209
age56-59           0.51119    0.10978   4.657 3.21e-06
age59-62           0.83266    0.10635   7.830 4.89e-15
age62-65           1.10338    0.10483  10.526  < 2e-16
age65-68           1.38179    0.10395  13.292  < 2e-16
age68-71           1.80158    0.10196  17.669  < 2e-16
age71-74           1.96600    0.10539  18.655  < 2e-16
age74-77           2.16921    0.11027  19.672  < 2e-16
age77-80           2.58718    0.11432  22.632  < 2e-16
age80-83           2.66962    0.13539  19.718  < 2e-16
age83-86           3.01993    0.15727  19.202  < 2e-16
age86-89           3.38008    0.19776  17.092  < 2e-16
age89-92           3.26400    0.36312   8.989  < 2e-16
age92-95           3.11514    0.71204   4.375 1.21e-05
socBranchfarming  -0.40721    0.06612  -6.158 7.35e-10
socBranchbusiness  0.03182    0.09226   0.345  0.73017
socBranchworker   -0.32630    0.07125  -4.580 4.65e-06

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1752.29  on 192  degrees of freedom
Residual deviance:  240.43  on 175  degrees of freedom
AIC: 969.83

Number of Fisher Scoring iterations: 5

Besides the slightly disturbing fact that all age estimates are printed, the presentation of the result for the covariate of interest, socBranch, is less satisfactory from a “survival analysis” point of view: We are generally not interested of looking at the estimated baseline in tabular form (we want a graph). There is a more general critique of the presentation: Too many p-values are given, and not the relevant ones either.

In eha, the function tpchreg does the same thing, but with a more satisfactory presentation of the result, see Table A.2.

TABLE A.2: Adult mortality in Umeå, first half of 20th century, PH regression.
Covariate	level	W_mean	Coef	HR	SE	LR_p
socBranch						0.0000
	official	0.109	0	1
	farming	0.520	-0.4072	0.6655	0.0661
	business	0.072	0.0318	1.0323	0.0923
	worker	0.299	-0.3263	0.7216	0.0712
Max Log	Likelihood		-9685.8
Restricted	mean	20.543

First, we got rid of the printing of the numerous baseline hazards estimates, and second, the factor socBranch is presented in full, with all four levels. And finally, the relevant p-value. Because, first, the truth is that Wald p-values are notoriously unreliably in nonlinear regression, and second, we want p-values for covariates and not for levels of covariates.

However, note that the essential findings with the two approaches are identical.

A.2.3 Graphics

The baseline hazard function is shown in Figure A.1.

FIGURE A.1: The hazard function for old age mortality in Umeå, first half of 20th century.

A.3 Asymptotic Theory

A.3.1 Partial likelihood

Here is a very brief summary of the asymptotics concerning the partial likelihood. Once defined, it turns out that you may treat it as an ordinary likelihood function . The setup is as follows.

Let $t_{(1)}, t_{(2)}, \ldots, t_{(k)}$ the ordered observed event times and let $R_i = R(t_{(i)})$ be the risk set at $t_{(i)}, \; i = 1, \ldots, k$, see equation . At $t_{(i)}$, condition with respect to the composition of $R_i$ and that one event occurred (for tied events, a correction is necessary).

Then the contribution to the partial likelihood from $t_{(i)}$ is

\[\begin{multline*} L_i(\boldsymbol{\beta}) = P(\mbox{No. $m_i$ dies} \mid \mbox{one event occur}, R_i) \\ = \frac{h_0(t_{(i)}) \exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}h_0(t_{(i)})\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} = \frac{\exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} \end{multline*}\] and the full partial likelihood is

\[\begin{equation*} L(\boldsymbol{\beta}) = \prod_{i=1}^k L_i(\boldsymbol{\beta}) = \prod_{i=1}^k \frac{ \exp(\boldsymbol{\beta} \mathbf{x}_{m_i})} {\sum_{\ell \in R_i}\exp(\boldsymbol{\beta} \mathbf{x}_\ell)} \end{equation*}\] This is where the doubt about the partial likelihood comes in; the conditional probabilities multiplied together do not have a proper interpretation as a conditional probability. Nevertheless, it is prudent to proceed as if the expression really is a likelihood function. The log partial likelihood becomes

\[\begin{equation}\label{eq:logplA} \log\big(L(\boldsymbol{\beta})\big) = \sum_{i=1}^k \left\{\boldsymbol{\beta} \mathbf{x}_{m_i} - \log\left(\sum_{\ell \in R_i} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)\right)\right\}, \end{equation}\] and the components of the score vector are

\[\begin{equation}\label{eq:scoreA} \frac{\partial}{\partial \beta_j} \log L(\boldsymbol{\beta}) = \sum_{i=1}^k \mathbf{x}_{m_i j} - \sum_{i=1}^k \frac{\sum_{\ell \in R_i} x_{\ell j} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)} {\sum_{\ell \in R_i} \exp(\boldsymbol{\beta} \mathbf{x}_\ell)}, \quad j = 1, \ldots, s. \end{equation}\] The maximum partial likelihood (MPL) estimator of $\boldsymbol{\beta}$, $\hat{\boldsymbol{\beta}}$, is found by setting equal to zero and solve for $\boldsymbol{\beta}$.

For inference, we need to calculate the inverse of minus the Hessian, evaluated at $\hat{\boldsymbol{\beta}}$. This gives the estimated covariance matrix. The Hessian is the matrix of the second partial derivatives. The expectation of minus the Hessian is called the information matrix. The observed information matrix is

\[\begin{equation*} \hat{I}(\hat{\boldsymbol{\beta}})_{j,m} = -\frac{\partial^2 \log L(\boldsymbol{\beta})} {\partial \beta_j \partial \beta_m}\mid_{\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}} \end{equation*}\] and asymptotic theory says that

\[\begin{equation*} \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \hat{I}^{-1}(\hat{\boldsymbol{\beta}})) \end{equation*}\] This is to say that $\hat{\boldsymbol{\beta}}$ is asymptotically unbiased and normally distributed with the given covariance matrix (or the limit of it). Further, $\hat{\boldsymbol{\beta}}$ is a consistent estimator of $\boldsymbol{\beta}$. These results are used for for hypothesis testing, confidence intervals, and variable selection.

Note that these are only asymptotic results, i.e., useful in large to medium sized samples. In small samples, bootstrapping is a possibility. This option is available in the R package eha.

Here a warning is in order: Tests based on standard errors (Wald) tests) may be highly unreliable, as in all non-linear regression (Hauck and Donner 1977). A better alternative is the likelihood ratio test.

A.4 Model Selection

In regression models, there is often several competing models for describing data. In general, there are no strict rules for ``correct selection’’. However, for nested models, there are some formal guidelines.

A.4.1 Comparing nested models

The meaning of nesting of models is best described by an example.

${\cal M}_2:\; h(t; (x_1, x_2)) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2)$
${\cal M}_1:\; h(t; (x_1, x_2)) = h_0(t) \exp(\beta_1 x_1)$: $x_2$ has no effect.

Thus, the model ${\cal M}_1$ is a special case of ${\cal M}_2$ ($\beta_2 = 0$). We say that ${\cal M}_1$ is nested in ${\cal M}_2$. Now, assume that ${\cal M}_2$ is true. Then, testing the hypothesis $H_0: \; {\cal M}_1$ is true (as well) is the same as testing the hypothesis $H_0;\; \beta_2 = 0$.

The formal theory for and procedure for performing the likelihood ratio test (LRT) can be summarized as follows:

Maximize $\log L(\beta_1, \beta_2)$ under ${\cal M}_2$; gives $\log L(\hat{\beta}_1, \hat{\beta}_2)$.
Maximize $\log L(\beta_1, \beta_2)$ under ${\cal M}_1$, that is, maximize $\log L(\beta_1, 0)$; gives $\log L(\beta_1^*, 0)$.
Calculate the test statistic \[\begin{equation*} T = 2\big(\log L(\hat{\beta}_1, \hat{\beta}_2) - \log L(\beta_1^*, 0)\big) \end{equation*}\]
Under $H_0$, $T$ has a $\chi^2$ (chi-square) distribution with $d$ degrees of freedom: $T \sim \chi^2(d)$, where $d$ is the difference in numbers of parameters in the two competing models, in this case $2-1=1$.
Reject $H_0$ if $T$ is large enough. Exactly how much that is depends on the level of significance; if it is $\alpha$, choose the limit $t_d$ equal to the $100 (1 - \alpha)$ percentile of the $\chi^2(d)$ distribution.

This result is a large sample approximation.

The Wald test is theoretically performed as follows:

Maximize $\log L(\beta_1, \beta_2)$ under ${\cal M}_2$; this gives $\log L(\hat{\beta}_1, \hat{\beta}_2)$, and $\hat{\beta}_2$, se($\hat{\beta}_2$).
Calculate the test statistic \[\begin{equation*} T_W = \frac{\hat{\beta}_2}{\mbox{se}(\hat{\beta}_2)} \end{equation*}\]
Under $H_0$, $T_W$ has a standard normal distribution: $T_W \sim N(0, 1)$.
Reject $H_0$ if the absolute value of $T_W$ is larger than 1.96 on a significance level of 5%.

This is a large sample approximation, with the advantage that it is automatically available in all software. In comparison to the LRT, one model less has to be fitted. This saves time and efforts, unfortunately on the expense of accuracy, because it may occasionally give nonsensic results. This phenomenon, as already mentioned, is known as the Hauck-Donner effect (Hauck and Donner 1977). $\Box$

A.4.2 Comparing non-nested models

Non-nested models cannot be compared by a likelihood ratio test, but there are a couple of alternatives that are based on comparing maximized likelihood values modified with consideration of the number of parameters that needs to be estimated. One such alternative is the Akaike Information Criterion (AIC), see Leeuw (1992).

References

Dalgaard, P. 2008. Introductory Statistics with R, Second Edition. Springer, Berlin.

Hauck, W. W., and A. Donner. 1977. “Wald’s Test as Applied to Hyptheses in Logit Analysis.” Journal of the American Statistical Association 72: 851–53.

Leeuw, J. de. 1992. “Introduction to Akaike (1973) Information Theory and an Extension of the Maximum Likelihood Principle.” In Breakthroughs in Statistics I, edited by S. Kotz and N. L. Johnson, 599–609. Springer.