## Introduction

If you suffer from large execution times with huge data sets, and/or integrity problems, this is for you. I show how the theory of *sufficient statistics* may solve your problem, given that you are willing to organize your data properly. Your integrity problem (if any) is solved on the fly.

I am illustrating everything by example using **R** and *RStudio*. However, the principle is universal, and you could, as an exercise, think of how you would implement it in Stata, if that is part of your inclinations. In any case, you may find pleasure in studying the **R** code included all along.

## The problem

I am analyzing a large data set from *Lisa*, trying some selected statistical models for a project based on survival analysis. Since *Lisa* contains yearly data I’m contemplating a discrete-time proportional hazards model, which I plan to analyze with logistic regression (*cloglog* link). This is my data (first few rows):

`head(suff)`

```
## id age hisclass sex event
## 1 183091 66 1 female FALSE
## 2 183091 67 1 female FALSE
## 3 183091 67 1 female FALSE
## 4 183091 68 1 female FALSE
## 5 183091 68 1 female FALSE
## 6 183091 69 1 female FALSE
```

The variable *hisclass* is of primary interest. It is grouped into four ordered classes, where “1” is the “uppermost class” and “4” is the “lowermost.” All other “control variables” except *sex* are omitted, to make this example clear and easy. The variable *event* is the response, with *TRUE* signifying an observed death.

This data frame is quite large (but not huge), 1715655 records with 97776 individuals. There are in all 6115 deaths.

Let us run a logistic regression and at the same time measure execution time (on *singasten* running Ubuntu 12.04 (*very* old!), 64 GB RAM).

```
(tt <- system.time(fit <- glm(event ~ factor(age) + factor(hisclass) + sex,
data = suff, family = binomial(link = cloglog))))
```

```
## user system elapsed
## 84.263 3.660 87.930
```

```
res <- summary(fit)$coef[-(1:50), 1:2]
round(res, 4)
```

```
## Estimate Std. Error
## factor(hisclass)2 0.1760 0.0409
## factor(hisclass)3 0.3319 0.0505
## factor(hisclass)4 0.3718 0.0398
## sexfemale -0.4726 0.0274
```

The problem with this is the *execution time*: It takes 88 seconds to perform this simple analysis!

## The solution: *Sufficient statistics*!

We realize that the explanatory variables can only be combined in *50 x 4 x 2 = 400* distinct combinations, so the solution is to *aggregate*. For this to work we need a variable that counts the number of repetitions of each combination. Let’s call it *count*:

`suff$count <- 1`

It is equal to one on all our original records, and now we aggregate *event* and *count* over all combinations:

```
newsuff <- aggregate(suff[, c("event", "count")],
by = suff[, c("age", "hisclass", "sex")],
FUN = sum)
head(newsuff)
```

```
## age hisclass sex event count
## 1 40 1 male 6 6884
## 2 41 1 male 1 6932
## 3 42 1 male 1 6986
## 4 43 1 male 2 6982
## 5 44 1 male 1 7003
## 6 45 1 male 1 7030
```

This is the first six rows of a total of 398 in the new dataframe. (Shouldn’t it be 400? Yes, but those combinations not occurring in the data are removed: Two in our case.)

Now we can use our new data frame to run the same logistic regression as in the first place, but we must formulate the response as a vector *(successes, failures)*:

```
(tt2 <- system.time(fit2 <- glm(cbind(event, count - event) ~
factor(age) + factor(hisclass) + sex,
data = newsuff, family = binomial(link = cloglog))))
```

```
## user system elapsed
## 0.009 0.000 0.009
```

```
res2 <- summary(fit2)$coef[-(1:50), 1:2]
round(res2, 4)
```

```
## Estimate Std. Error
## factor(hisclass)2 0.1760 0.0409
## factor(hisclass)3 0.3319 0.0505
## factor(hisclass)4 0.3718 0.0398
## sexfemale -0.4726 0.0274
```

As you can see, the two results are *identical*, but the execution time in the aggregated case is only 0.009 seconds! Compared to 88 seconds in the original case! That is, 9770 times faster for the aggregated case!!

You may wonder if I didn’ lose that time in the aggregation process? No, it took around five seconds (for the computer) and besides, it is a one-time job, if done cleverly.

*Why does this work?* Can we be sure that this wasn’t just a happy coincidence? Yes, the answer is given by *the statistical theory about sufficient statistics*: The table we created is a sufficient statistic for the model and sample at hand. This means that, given the sufficient statistic (our table) and the model, the conditional distribution of the original sample does not depend on the model, it is pure random noise. Thus, it has no information about the parameters in our model.

## Conclusion

The principle of tabulating a huge data set works well if the sufficient statistic reduces the dimensionality significantly of your data set. A necessary start is to avoid continuous covariates (group them). The distribution of the response must also allow for categorization in many cases. For instance, in a survival analysis situation (continuous time) you may want to apply Cox regression. It will not admit any reduction by sufficiency, but if you are willing to assume a *piecewise constant hazards* model, you are OK. It will work essentially as above, but with *Poisson regression* instead of logistic, and aggregation of *exposure times* instead of pure counts.

In “real” problems, the generated table will be much larger than in this example. The one I work with now generates a table with around 40000 rows, but that is still a huge gain compared to the original dataframe with three million observations. So don’t hesitate, *just do it*.

This is btw a good example of the principle *“think first”*, sadly forgotten these days: Computers do not think (although some argue and believe differently).

Oh, I almost forgot: Where does *integrity* come into the picture? But it is obvious: In the generated table, no individual information is available. As always, though, this is a truth with modification: If some combination of attributes is very rare, it may be possible to identify an individual for an initiated observer. In general, it should even be possible to publish tables generated in this way.