C A Brief Introduction to R

There are already many excellent sources of information available for those who want to learn more about R. A visit to the R home page, is highly recommendable. When there, look under Documentation. The Task Views on the home page of CRAN contains a section Survival worth examining.

When R is installed on a computer, it is easy to access the documentation that comes with the distribution; start R and read the help page ?help.start. A good start for a beginner is also the book Introductory Statistics with R by (Dalgaard 2008). The book on S Programming (Venables 2000) is recommended for the more advanced studies of the topic that its title implies.

That said, in this chapter we will only look at some aspects of working with R, aspects that are relevant to our topic.

C.1 R in General

In this section we investigate how to work with R in general, without special reference to event history analysis, which will be treated separately in a later section in this chapter.

C.1.1 R objects

R is an object-oriented programming language, which have certain implications for terminology. So is for instance everything i R objects. We will not draw too much on this fact, but it may be useful to know.

C.1.2 Expressions and assignments

Commands are either expressions or assignments. Commands are separated by newline (normally) or semi-colon. The hash mark (#) marks the rest of the line as a comment.

1 + exp(2.3)
[1] 10.97418
x <- 1 + exp(2.3)
x
[1] 10.97418

Note that the R prompt is >. If the previous expression is incomplete, it changes to +.

The first line above is an expression. It is evaluated as return is pressed, and the result is normally printed. The second line is an assignment, the result is stored in x and not printed. By typing x, its content is normally printed.

The assignment symbol is <-, which consists of two consecutive key strokes, “<” followed by “-.” It is strongly recommended to enclose the assignment symbol (and any arithmetic operator) between spaces. It is forbidden to separate the two symbols by a space! You will then get comparison with something negative:

x <- 3
x < - 3
[1] FALSE

It ia also allowed, nowadays, to use = as an assignment symbol, but be aware of the fact that <- and = are not always exchangeable, for instance when giving values to arguments in function calls.

C.1.3 Objects

Everything in R is an object (object-oriented programming). All objects have a mode and a length. Data objects have modes numeric, complex, logical, or character, and language objects have modes function, call, expression, name, etc. All data objects are vectors; there are no scalars in R.

Vectors contain elements of the same kind; a list can be thought of a vector where each element can be anything, even a list, and the elements can be completely different.

There are more properties of objects worth mentioning, but we save that for the situations where we need them.

C.1.4 Vectors and matrices

There are five basic types of vectors: logical, integer, double, complex, and character. The function c (for concatenate) creates vectors:

dat <- c(2.0, 3.4, 5.6)
cols <- c("red", "green", "blue")

The elements of a vector may be named:

names(dat) <- c("Sam", "John", "Kate")
names(dat)
[1] "Sam"  "John" "Kate"
dat
 Sam John Kate 
 2.0  3.4  5.6 
dat["John"]
John 
 3.4 

Note the first two lines; names may either display (second row) or replace (first row).

The function matrix creates a matrix (surprise!):

Y <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
Y
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
dim(Y) <- c(3, 2)
Y
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
as.vector(Y)
[1] 1 2 3 4 5 6
mode(Y)
[1] "numeric"
attributes(Y)
$dim
[1] 3 2

It is also possible to create vectors by the functions numeric, character, and logical.

x <- logical(7)
x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x <- numeric(7)
x
[1] 0 0 0 0 0 0 0
x <- character(7)
x
[1] "" "" "" "" "" "" ""

These functions are useful for allocating storage (memory), which later can be filled. This is better practice than letting a vector grow sequentially.

The functions nrow and ncol extracts the number of rows and columns, respectively, of a matrix.

An array is an extension of a matrix to more than two dimensions.

C.1.5 Lists

Family records may be produced as a list:

fam <- list(FamilyID = 1, man = "John", wife = "Kate", 
            children = c("Sam", "Bart"))
fam
$FamilyID
[1] 1

$man
[1] "John"

$wife
[1] "Kate"

$children
[1] "Sam"  "Bart"
fam$children
[1] "Sam"  "Bart"

C.1.6 Data frames

A data frame is a special case of a list. It consists of variables of the same length, but not necessarily of the same type. Data frames are created by either data.frame or read.table, where the latter reads data from an ASCII file on disk.

dat <- data.frame(name = c("Homer", "Flanders", "Skinner"), 
                  income = c(1000, 23000, 85000))
dat
      name income
1    Homer   1000
2 Flanders  23000
3  Skinner  85000

A data frame has row names (here: 1, 2, 3) and variable (column) names (here: name, income). The data frame is the normal object for data storage and statistical analysis in R.

C.1.7 Factors

Categorical variables are conveniently stored as factors in R.

country <- factor(c("se", "no", "uk", "uk", "us"))
country
[1] se no uk uk us
Levels: no se uk us
print.default(country)
[1] 2 1 3 3 4

Factors are internally coded by integers (1, 2, …). The levels are by default sorted into alphabetical order. Can be changed:

country <- factor(c("se", "no", "uk", "uk", "us"), 
                  levels = c("uk", "us", "se", "no"))
country
[1] se no uk uk us
Levels: uk us se no

The first level often has a special meaning (“reference category”) in statistical analyses in R.

C.1.8 Operators

Arithmetic operations are performed on vectors, element by element. The common operators are + - * / ^, where ^ is exponentiation. For instance

x <- c(1, 2, 3)
x^2
[1] 1 4 9
y <- 2 * x
y / x
[1] 2 2 2

C.1.9 Recycling

The last examples above were examples of recycling; If two vectors of different lengths occur in an expression, the shorter one is recycled (repeated) as many times as necessary. For instance

op <- options(warn = -1) # Turn off warnings.
x <- c(1, 2, 3)
y <- c(2, 3)
x / y
[1] 0.5000000 0.6666667 1.5000000
options(op) # Set warnings to original state.

so the actual operation performed is

c(1, 2, 3) / c(2, 3, 2) # Truncated to equal length
[1] 0.5000000 0.6666667 1.5000000

It is most common with the shorter vector being of length one, for instance, a scalar multiplication. If the length of the longer vector is not a multiple of the length of the shorter vector, a warning is issued: longer object length is not a multiple of shorter object length (not shown above). This is often a sign of a mistake in the code (so do not turn off warnings!).

C.1.10 Precedence

Multiplication and division is performed before addition and subtraction, as in pure mathematics. The (almost) full set of rules, from highest to lowest, are:

  1. $, [[: Extraction of elements in a list.
  2. [: Vector extraction.
  3. ^: Exponentiation.
  4. -: Unary minus.
  5. :: Sequence generation.
  6. %%, %/%, %*%: Special operators.
  7. *, /: & Multiplication and division.
  8. +, -: Addition and subtraction.
  9. <, >, <=, >=, ==, !=: Comparison.
  10. !: Logical negation.
  11. &, |, &&, ||: Logical operators.
  12. $\sim$: Formula.
  13. <-: Assignment.

It is, however, highly recommended to use parentheses often rather than relying on (and remembering) those rules. Compare, for instance

1:3 + 1
[1] 2 3 4
1:(3 + 1)
[1] 1 2 3 4

For descriptions of specific operators, consult the help system.

C.2 Some Standard R Functions

  1. round, floor, ceiling: Rounds to nearest integer, downwards, and upwards, respectively.
round(2.5) # Round to even
[1] 2
floor(2.5)
[1] 2
ceiling(2.5)
[1] 3
  1. %/% and %% for integer division and modulo reduction.
5 %/% 2
[1] 2
5 %% 2
[1] 1
  1. Mathematical functions: abs, sign, log, exp, etc.
  2. sum, prod, cumsum, cumprod for sums and products.
  3. min, max, cummin, cummax for extreme values.
  4. pmin, pmax parallel min and max.
x <- c(1, 2, 3)
y <- c(2, 1, 4)
max(x, y)
[1] 4
pmax(x, y)
[1] 2 2 4
  1. sort, order for sorting and ordering. Especially useful is order for rearranging a data frame according to a specific ordering of one variable:
require(eha)
data(mort)
mt <- mort[order(mort$id, -mort$enter), ]
last <- mt[!duplicated(mt$id), ]

First, mort is sorted after id, and within id decreasingly after enter (notice the minus sign in the formula!). Then all rows with duplicated id are removed, only the first appearance is kept. In this way we get a data frame with exactly one row per individual, the row that corresponds to the individual’s last (in age or calendar time). See also the next item!

  1. duplicated and unique for marking duplicated elements in a vector and removing duplicates, respectively.
x <- c(1,2,1)
duplicated(x)
[1] FALSE FALSE  TRUE
unique(x)
[1] 1 2
x[!duplicated(x)]
[1] 1 2

C.2.1 Sequences

The operator : is used to generate sequences of numbers.

1:5
[1] 1 2 3 4 5
5:1
[1] 5 4 3 2 1
-1:5
[1] -1  0  1  2  3  4  5
-(1:5)
[1] -1 -2 -3 -4 -5
-1:-5
[1] -1 -2 -3 -4 -5

Don’t trust that you remember rules of precedence; use parentheses!

There is also the functions seq and rep, see the help pages!

C.2.2 Logical expression

Logical expressions can take only two distinct values, TRUE and FALSE (note uppercase; R acknowledges the difference between lower case and upper case).

Logical vectors may be used in ordinary arithmetic, where they are coerced into numeric vectors with FALSE equal to zero and TRUE equal to one.

C.2.3 Indexing

Indexing is a powerful feature in R. It comes in several flavors.

  • A logical vector Must be of the same length as the vector it is applied to.
x <- c(-1:4)
x
[1] -1  0  1  2  3  4
x[x > 0]
[1] 1 2 3 4
  • Positive integers Selects the corresponding values.
  • Negative integers Removes the corresponding values.
  • Character strings For named vectors; select the elements with the given names.
  • Empty Selects all values. Often used to select all rows or columns from a matrix,
x <- matrix(c(1,2,3,4), nrow = 2)
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4
x[, 1]
[1] 1 2
x[, 1, drop = FALSE]
     [,1]
[1,]    1
[2,]    2

Note that a dimension is lost when one row or column is selected. This can be overridden by the argument drop = FALSE (don’t drop dimension(s)).

C.2.4 Vectors and matrices

A matrix is a vector with a dim attribute.

x <- 1:4
x
[1] 1 2 3 4
dim(x) <- c(2, 2)
dim(x)
[1] 2 2
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4
t(x)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Note that matrices are stored column-wise. The function t gives the transpose of a matrix.

An example of matrix multiplication.

x <- matrix(c(1,2,3,4, 5, 6), nrow = 2)
x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
y <- matrix(c(1,2,3,4, 5, 6), ncol = 2)
y
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
y %*% x
     [,1] [,2] [,3]
[1,]    9   19   29
[2,]   12   26   40
[3,]   15   33   51
x %*% y
     [,1] [,2]
[1,]   22   49
[2,]   28   64

Dimensions must match according to ordinary mathematical rules.

C.2.5 Conditional execution

Conditional execution uses the if statement (but see also switch). Conditional constructs are typically used inside functions.

x <- 3
if (x > 3) {
    cat("x is larger than 3\n")
}else{ 
    cat("x is smaller than or equal to 3\n")
}
x is smaller than or equal to 3

See also the function cat (and print).

C.2.6 Loops

Loops are typically only used inside functions.

  • A for loop makes an expression to to be iterated as a variable assumes all values in a sequence.
x <- 1
for (i in 1:4){
    x <- x + 1
    cat("i = ", i, ", x = ", x, "\n")
}
i =  1 , x =  2 
i =  2 , x =  3 
i =  3 , x =  4 
i =  4 , x =  5 
x
[1] 5
  • The while loop iterates as long as a condition is TRUE.
done <- FALSE
i <- 0
while (!done & (i < 5)){
    i <- i + 1
    done <- (i > 3)
}
if (!done) cat("Did not converge\n")
  • The repeat loop iterates indefinitely or until a break statement is reached.
i <- 0
done <- FALSE
repeat{
    i <- i + 1
    done <- (i > 3)
    if (done) break
}
i
[1] 4

In R, there are a family of functions that operates directly on vectors. See the documentation of apply, tapply, lapply, and sapply.

C.3 Writing Functions

Functions are created by assignment using the keyword function.

fun <- function(x) exp(x)
fun(1)
[1] 2.718282

This function, fun, is the exponential function, obviously. Usually, the function body consists of several statements.

fun2 <- function(x, maxit){
    if (any(x <= 0)) error("Only valid for positive x")
    i <- 0
    sum <- 0
    for (i in 0:maxit){
        sum <- sum + x^i / gamma(i + 1)
    }
    sum
}
fun2(1, 5)
[1] 2.716667
fun2(1:5, 5)
[1]  2.716667  7.266667 18.400000 42.866667 91.416667

This function, fun2, uses the Taylor expansion of \(\exp(x)\) for calculation. This is a very crude variant, and it is only good for positive \(x\). It throws an error if fed by a negative value. Note that it is vectorized, i.e., the first argument may be a real-valued vector. Also note the function any.

In serious function writing, it is very inconvenient to define the function at the command prompt. It is much better to write it in a good editor and then source it into the R session as needed. If it will be used extensively in the future, the way to go is to write a package. This is covered later in these notes. Our favorite editor, and the one used in writing the examples in these notes, is emacs, which is free and available on all platforms. Another option is to use RStudio (RStudio Team 2021).

Functions can be defined inside the body of another function. In this way it is hidden for other functions, which is important when considering scoping rules. This is a feature that is not available in Fortran 77 or C. Suppose that the function inside is defined inside the function head.

  1. A call to inside will use the function defined in head, and not some function outside with the same name.
  2. The function inside is not visible outside head and cannot be used there.

The return value of a function is the value of the last expression in the function body.

C.3.1 Calling conventions

The arguments to a function are either specified or unspecified (or both). Unspecified arguments are shown as “\(\ldots\)” in the function definition. See for instance the functions c and min. The “\(\ldots\)” may be replaced by any number of arguments in a call to the function.

The formal arguments are those specified in the function definition and the actual arguments are those given in the actual call to the function. The rules by which formal and actual are matched are

  1. Any actual argument given as name = value, where name exactly matches a formal argument, is matched.
  2. Arguments specified by name = value, but no exact match are then considered. If a perfect partial match is found, it is accepted.
  3. The remaining unnamed arguments are then matched in sequence (positional matching).
  4. Remaining unmatced actual arguments are part of the … argument if there is one. Otherwise an error occurs.
  5. Formal arguments need not be matched.

C.3.2 The argument ...

The purpose of the ... argument is to pass arguments to a function call inside the actual function. Read An Introduction to R (available in the R help system) for further detail.

C.3.3 Writing functions

Below is a function that calculates the maximum likelihood estimate of the parameter in the exponential distribution, given a censored sample.

mlexp <- function(x, d){
    n <- length(x)
    ## Check indata:
    if (any(x <= 0)) stop("This function needs positive x")
    if (length(d) != n) stop("Lengths of x and d do not match")
    ## End check

    ## We adopt the rule: d == 0 for censored obs; 1 for 'events'.

    d <- as.numeric(d != 0)

    ## The density function is lambda^d * exp(-lambda * x), x > 0.
    ## But we prefer a parameterization where the 
    ## parameter may vary freely.
    ## Set lambda = exp(alpha):
    ## The density is exp(d * alpha) * exp(-x * exp(alpha)).
    
    sumx <- sum(x)
    sumd <- sum(d)
    loglik <- function(alpha){
        sumd * alpha - sumx * exp(alpha)
    }
    ## We want the first derivative for optimization:
    dloglik <- function(alpha){
        sumd - sumx * exp(alpha)
    }

    ## Start value:
    alpha <- 0
    res <- optim(alpha, fn = loglik, gr = dloglik,
                 method = "BFGS", control = list(fnscale = -1))
    res
}

First note that the arguments of the function, x and d, do not have any default values. Both arguments must be matched in the actual call to the function. Second, note the check of sanity in the beginning of the function and the calls to stop in case of an error. For less severe mistakes it is possible to give a warning, which does not break the execution of the function.

C.3.4 Lazy evaluation

When a function is called, the arguments are parsed, but no evaluated until it is needed in the function. Specifically, if the function does not use the argument, it is never evaluated, and the variable need not even exist. This is in contrast to rules in C and Fortran.

fu <- function(x, y) 2 * x
fu(3)
[1] 6
fu(3, y = zz)
[1] 6
exists("zz")
[1] TRUE

C.3.5 Recursion

Functions are allowed to be recursive, i.e., a function in R is allowed to call itself. See An Introduction to R for details, but be warned: This technique may be extremely memory-intensive and slow. Use with caution, there is often a more direct way of attack.

C.3.6 Vectorized functions

Most mathematical functions in R are vectorized, meaning that if the argument to the function is a vector, them the result is a vector of the same length. Some functions apply the recycling rule to vectors shorter than the longest:

sqrt(c(9, 4, 25))
[1] 3 2 5
pnorm(1, mean = 0, sd = c(1,2,3))
[1] 0.8413447 0.6914625 0.6305587

When writing own functions, the question of making it vectorized should always be considered. Often it is automatically vectorized, like

fun <- function(x) 3 * x - 2
fun(c(1,2,3))
[1] 1 4 7

but not always. Under some circumstances, it may be necessary to explicitly make the calculations separately for each element in a vector.

Within a function in R, objects are found in the following order:

  1. It is a defined variable.
  2. It is in the argument list.
  3. It is defined in the environment where the function is defined.

Note that a function will eventually look for the variable in the work space. This is often not the intention!

Tip: Test functions in a clean workspace. For instance, start R with the flag R --vanilla.

C.4 Standard Graphics

The workhorses in standard graphics handling in R are the functions plot and par. Read their respective documentations carefully; you can manipulate your plots almost without limitation!

See the code and Figure C.1 for the output.

source("plot2.R")
plot2
function () 
{
    set.seed(1211)
    x <- 3 + rnorm(100)
    y <- 2 + x + rnorm(100, sd = 0.5)
    oldpar <- par(mfrow = c(1, 2))
    on.exit(par(oldpar))
    par(las = 1)
    plot(x, y, main = "Scatterplot of x and y")
    hist(y, main = "Histogram of y", probability = TRUE)
}
plot2()
Output from the function plot2.

FIGURE C.1: Output from the function plot2.

C.5 Useful R Functions

There are some useful functions in R doing set operations. For exact definitions of these functions, see the documentation.

  • match match(x, y) returns a vector of the same length as x with the place of the first occurrence of each element in x in y. If there is mo match for a particular element in x, NA is returned (can be changed). Example
x <- c(1,3,2,5)
y <- c(3, 2, 4, 1, 3)
match(x, y)
[1]  4  1  2 NA

The match function is useful picking information in one data frame and put it in another (or the same). As an example, look at the data frame oldmort.

require(eha)
data(oldmort)
names(oldmort)
 [1] "id"         "enter"      "exit"       "event"      "birthdate" 
 [6] "m.id"       "f.id"       "sex"        "civ"        "ses.50"    
[11] "birthplace" "imr.birth"  "region"    

Note the variables id (id of current record) and m.id (id of the mother to the current). Some of the mothers are also subjects in this file, and if we in an analysis want to to use mother’s birthplace as a covariate, we can get it from her record (if present; there will be many missing mothers). This is accomplished with match; creating a new variable m.birthplace (mother’s birthplace) is as simple as

oldmort$m.birthplace <- with(oldmort, 
                             birthplace[match(m.id, id)])

The tricky part here is to get the order of the arguments to match right. Always check the result on a few records!

  • %in% Simple version of match. Returns a logical vector of the same length as x.
x %in% y
[1]  TRUE  TRUE  TRUE FALSE

A simple and efficient way of selecting subsets of a certain kind, For instance, to select all cases with parity 1, 2, or 3, use

data(fert)
f13 <- fert[fert$parity %in% 1:3, ]

This is the same as

f13 <- fert[(fert$parity >= 1) & (fert$parity <= 3), ]

Note that the parentheses in the last line are strictly unnecessary, but their presence increases readability and avoids stupid mistakes about precedence between operators. Use parentheses often!

  • subset Selects subsets of vectors, arrays, and data frames. See the help page.

  • tapply Applies a function to all subsets of a data frame (it is more general than that; see the help page). Together with it is useful for creating new summary variables for clusters and sticking the values to each individual.

If we for all rows in the fert data frame want to add the age at first birth for the corresponding mother, we can do this as follows.

data(fert)
indx <- tapply(fert$id, fert$id)
min.age <- tapply(fert$age, fert$id, min)
fert$min.age <- min.age[indx] 

Check this for yourself! And read the help pages for match and tapply carefully.

C.6 Help in R

Besides all documentation that can be found on CRAN, there is an extensive help system online in R. To start the general help system, type

> help.start()

at the prompt (or, find a corresponding menu item on a graphical user interface system). Then a help window opens up with access to several FAQs, and all the packages that are installed.

For immediate help on a specific function you know the name of, say coxreg, type

> ?coxreg

If you only need to see the syntax and input arguments, use the function :

> args(coxreg)

C.7 Functions for Survival Analysis

Here some useful functions in eha and survival are listed. In most cases the description is very sparse, and the reader is recommended to consult the corresponding help page when more detail is wanted.

  • aftreg Fits accelerated failure time (AFT) models.
  • age.window Makes a “horizontal cut” in the Lexis diagram, that is, it selects a subset of a data set based on age. The data frame must contain three variables that can be input to the Surv function. The default names are enter, exit, and event. As an example,
library(eha)
mort6070 <- age.window(oldmort, c(60, 70))

limits the study of old age mortality to the (exact) ages between 60 and 70, i.e. from 60.000 up to and including 69.999 years of age.

For rectangular cuts in the Lexiz diagram, use both age.window and cal.window in succession. The order is unimportant.

  • cal.window Makes a “vertical cut” in the Lexis diagram, that is, it selects a subset of a data set based on calendar time. As an example,
library(eha)
mort18601870 <- cal.window(oldmort, c(1860, 1870))

limits the study of old age mortality to the time period between 1860 and 1870, that is, from January 1, 1860 up to and including December 31, 1869.

For rectangular cuts in the Lexiz diagram, use both age.window and cal.window in succession. The order is unimportant.

  • cox.zph Tests the proportionality assumption on fits from coxph. This function is part of the survival package.

  • coxreg Fits Cox proportional hazards model. It is a wrapper for coxph in case of simpler Cox regression models.

  • phreg Fits parametric proportional hazards models.

  • pchreg Piecewise constant proportional hazards regression

  • tpchreg Piecewise constant proportional hazards regression for tabular data.

  • toTpch Tabulating individual survival data.

  • risksets Finds the members of each riskset at all event times.

C.7.1 Checking the integrity of survival data

In data sources collected through reading and coding old printed handwritten sources, many opportunities for making and propagating errors occur. Therefore, it is important to have tools for detecting logical errors like people dying before they got married, of living at two places at the same time. In the sources used in this book, such errors occur in almost all data retrievals. It is important to say that this fact is not a sign of a bad job in the transfer of data from old sources to digital data files. The case is almost always that the errors are present in the original source, and in the registration process no corrections of “obvious” mistakes by the registrators are allowed; the digital files are supposed to be as close as possible to the original sources. These are the rules at the Demographic Database (DDB), Umeå University.

This means that the researcher has a responsibility to check her data for logical (and other) errors, and truthfully report how she handles these errors. In my own experience with data from the DDB, the relative frequency of errors varies between 1 and 5 per cent. In most cases it would not do much harm to simply delete erroneous records, but often the errors are “trivial,” and it seems easy to guess what the truth should be.

C.7.1.1 Old age mortality

As an example of the kind of errors discussed above, we take a look at the original version of oldmort in eha. For survival data in "“interval” form (enter, exit], it is important that enter is smaller than exit. For individuals with more than one record, the intervals are not allowed to overlap, and at most one of them is allowed to end with an event. Note that we are talking about survival data, so there are no repeated events “by design.”

The function check.surv in eha tries to do these kinds of checks and report back all individuals (id) that do not follow the rules.

load("olm.rda")
require(eha)
errs <- with(olm, check.surv(enter, exit, event, id))
errs
 [1] "785000980" "787000493" "790000498" "791001135" "792001121"
 [6] "794001225" "794001364" "794001398" "796000646" "797001175"
[11] "797001217" "798001203" "800001130" "801000743" "801001113"
[16] "801001210" "801001212" "802001136" "802001155" "802001202"
[21] "803000747" "804000717" "804001272" "804001354" "804001355"
[26] "804001532" "805001442" "806001143" "807000736" "807001214"
[31] "808000663" "810000704" "811000705" "812001454" "812001863"
[36] "813000613" "814000799" "815000885" "816000894" "817000949"
[41] "819001046"
no.err <- length(errs)

So, 41 individuals have bad records. Let us look at some of them.

badrecs <- olm[olm$id %in% errs, ]
dim(badrecs)
[1] 105  13
badrecs[badrecs$id == errs[1], ]
           id  enter   exit event birthdate      m.id      f.id
283 785000980 74.427 75.917 FALSE  1785.573 743000229 747000387
284 785000980 75.525 87.917 FALSE  1785.573 743000229 747000387
285 785000980 87.917 88.306  TRUE  1785.573 743000229 747000387
       sex     civ ses.50 birthplace imr.birth region
283 female married farmer     parish   12.4183  rural
284 female   widow farmer     parish   12.4183  rural
285 female   widow farmer     parish   12.4183  rural

The error here is that the second interval (75.526, 87.927] starts before the first interval has ended. We also see that this is a married woman who became a widow in the second record.

In this case I think that a simple correction saves the situation: Let the second interval begin where the first ends. The motivation for this choice is that the first interval apparently ends with the death of the woman’s husband. Death dates are usually reliable, so I believe more in it than in other dates.

Furthermore, the exact choice here will have very little impact on the analyses we are interested in. The length of the overlap is less than half a year, and this person will eventually have the wrong civil status for at most that amount of time.

It turns out that all “bad” individuals are of this type. Therefore we “clean” the data set by running it through this function:

ger <- function(){
    require(eha)
    load("olm.rda")
    out <- check.surv(olm$enter, olm$exit, olm$event, olm$id)
    ko <- olm[olm$id %in% out, ]
    ## ko is all records belonging to "bad individuals"
    ## Sort ko, by id, then by interval start, then by event
    ko <- ko[order(ko$id, ko$enter, -ko$event), ]
    for (i in ko$id){ # Go thru all 'bad individuals'
        slup <- ko[ko$id == i, , drop = FALSE]
        ## Pick out their records
        n <- NROW(slup)
        if (n > 1){ # Change them
            for (j in (2:n)){
                slup$enter[j] <- slup$exit[j-1]
            }
            ## And put them back:
            ko[ko$id == i, ] <- slup
        }
    }
    ## Finally, put 'all bad' back into place:
    om <- olm
    om[om$id %in% out, ] <- ko
    ## Keep only records that start before they end:
    om <- om[om$enter < om$exit, ]
    om
}

Read the comments, lines (or part of lines) starting with #. The output of this function, om, is the present data file oldmort in eha, with the errors corrected as indicated above.

Running the function.

source("ger.R")
om <- ger()
left <- with(om, check.surv(enter, exit, event, id))
left
[1] "794001225"

Still one bad. Let us take a look at that individual.

om[om$id == left, ] 
            id  enter   exit event birthdate m.id f.id  sex   civ
1168 794001225 65.162 68.650 FALSE  1794.838   NA   NA male widow
1169 794001225 68.650 85.162 FALSE  1794.838   NA   NA male widow
1171 794001225 78.652 85.162 FALSE  1794.838   NA   NA male widow
     ses.50 birthplace imr.birth region
1168 farmer     parish  12.70903  rural
1169 farmer     region  12.70903  rural
1171 farmer     parish  12.70903  rural

It seems as if this error (second an third records overlap) should have been fixed by our function ger. However, there may have been a record removed, so we have to look at the original data file:

olm[olm$id == left, ]
            id  enter   exit event birthdate m.id f.id  sex   civ
1168 794001225 65.162 68.650 FALSE  1794.838   NA   NA male widow
1169 794001225 65.162 85.162 FALSE  1794.838   NA   NA male widow
1170 794001225 68.650 78.652 FALSE  1794.838   NA   NA male widow
1171 794001225 78.652 85.162 FALSE  1794.838   NA   NA male widow
     ses.50 birthplace imr.birth region
1168 farmer     parish  12.70903  rural
1169 farmer     region  12.70903  rural
1170 farmer     parish  12.70903  rural
1171 farmer     parish  12.70903  rural

Yes, after correction, the third record was removed. The matter can now be fixed through removing the last record in om[om\$id == left, ] :

who <- which(om$id == left)
who
[1] 1166 1167 1168
om <- om[-who[length(who)], ]
om[om$id == left, ]
            id  enter   exit event birthdate m.id f.id  sex   civ
1168 794001225 65.162 68.650 FALSE  1794.838   NA   NA male widow
1169 794001225 68.650 85.162 FALSE  1794.838   NA   NA male widow
     ses.50 birthplace imr.birth region
1168 farmer     parish  12.70903  rural
1169 farmer     region  12.70903  rural

And we are done.

C.8 Reading Data into R

The first thing you have to do in a statistical analysis (in R) is to get data into the computing environment. Data may come from different kind of sources, and R has the capacity to read from many kinds of sources, like data files from other statistical programs, spreadsheet type (think Excel) data. This is normally done with the aid of the recommended package foreign (R Core Team 2020), but the newer package haven (Wickham and Miller 2020) is also useful.

If anything else fails, it is almost always possible to get data written in ASCII format (e.g., .csv files). These can be read into R with the function read.table.

C.8.1 Reading data from ASCII files

Ordinary text (ASCII) files in tabular form can be read into R with the function read.table, and the result is a data frame. For instance, with a file looking this:

 enter exit event
     1    5     1
     2    7     1
     1    5     1
     0    6     0

we read it into R with

mydata <- read.table("my.dat", header = TRUE)

Note the argument header. It can take two values, TRUE or FALSE, with FALSE being the default value. If header = TRUE, then the first row of the data file is interpreted as variable names.

There are a few optional arguments to read.table that are important to consider. The first is dec, which gives the character used as a decimal point. The default is “.” but in some locales “,” (e.g., Swedish) is commonly used in output from other software than R. The second is sep, which gives the field separator character. The default value is "“, which means”white space" (tabs, spaces, newlines, etc.). See the help page for all the possibilities, including the functions read.csv, read.csv2, etc., which are variations of read.table with other default values.

The result should always be checked after reading. Do that by printing a few rows

head(mydata)
  enter exit event
1     1    5     1
2     2    7     1
3     1    5     1
4     0    6     0

and/or by summarizing the columns

summary(mydata)
     enter           exit          event     
 Min.   :0.00   Min.   :5.00   Min.   :0.00  
 1st Qu.:0.75   1st Qu.:5.00   1st Qu.:0.75  
 Median :1.00   Median :5.50   Median :1.00  
 Mean   :1.00   Mean   :5.75   Mean   :0.75  
 3rd Qu.:1.25   3rd Qu.:6.25   3rd Qu.:1.00  
 Max.   :2.00   Max.   :7.00   Max.   :1.00  

The str function is also helpful:

str(mydata)
'data.frame':   4 obs. of  3 variables:
 $ enter: int  1 2 1 0
 $ exit : int  5 7 5 6
 $ event: int  1 1 1 0

For each variable, its name and type is given, together with the first few values. In this example it is all the values, and we can also see that R interprets all the values as integer. This is fine, even though at least enter and exit probably are supposed to be real-valued. In fact, R is not so strict with numerical types (as opposed to C and Fortran). Frequent coercing takes place, and the mode numeric contains the types integer and double.

Now this tends to be confusing, but the bottom line is that you, as a user, need not worry at all. Except in one place: When you are writing an R function that calls compiled C or Fortran code, but that is far beyond the scope of this book. Interested readers are recommended to read Venables (2000).

C.8.2 Reading foreign data files

Data file from Minitab, SAS, SPSS, Stata, etc., can be read with a suitable function from the foreign package. Consult its help pages for more information. Another option is to use one of the tidyverse (Wickham et al. 2019) packages haven, readxl, and DBI. For more information, see relevant help pages.

References

Dalgaard, P. 2008. Introductory Statistics with R, Second Edition. Springer, Berlin.
R Core Team. 2020. Foreign: Read Data Stored by ’Minitab’, ’s’, ’SAS’, ’SPSS’, ’Stata’, ’Systat’, ’Weka’, ’dBase’, ... https://CRAN.R-project.org/package=foreign.
RStudio Team. 2021. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.
Venables, B. D., W. N. and Ripley. 2000. S Programming. New York: Springer-Verlag.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Evan Miller. 2020. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files. https://CRAN.R-project.org/package=haven.