C A Brief Introduction to R
There are already many excellent sources of information available for those who want to learn more about R. A visit to the R home page, is highly recommendable. When there, look under Documentation. The Task Views on the home page of CRAN contains a section Survival worth examining.
When R is installed on a computer, it is easy to
access the documentation that comes with the distribution; start
R and read the help page ?help.start
. A good start for a
beginner is also the book Introductory Statistics with R by
(Dalgaard 2008). The book on
S Programming (Venables 2000) is recommended for the more
advanced studies of the topic that its title implies.
That said, in this chapter we will only look at some aspects of working with R, aspects that are relevant to our topic.
C.1 R in General
In this section we investigate how to work with R in general, without special reference to event history analysis, which will be treated separately in a later section in this chapter.
C.1.1 R objects
R is an object-oriented programming language, which have certain implications for terminology. So is for instance everything i R objects. We will not draw too much on this fact, but it may be useful to know.
C.1.2 Expressions and assignments
Commands are either expressions or assignments. Commands are separated by newline (normally) or semi-colon. The hash mark (#) marks the rest of the line as a comment.
1 + exp(2.3)
[1] 10.97418
<- 1 + exp(2.3)
x x
[1] 10.97418
Note that the R prompt is >
. If the previous expression is
incomplete, it changes to +
.
The first line above is an expression. It is evaluated as return
is
pressed, and the result is normally printed. The second line is an
assignment, the result is stored in x
and not printed. By typing x
, its
content is normally printed.
The assignment symbol is <-
, which consists of two consecutive key
strokes, “<” followed by “-.” It is strongly recommended to enclose the
assignment symbol (and any arithmetic operator) between spaces. It is
forbidden to separate the two symbols by a space! You will then get
comparison with something negative:
<- 3
x < - 3 x
[1] FALSE
It ia also allowed, nowadays, to use =
as an assignment symbol, but be aware
of the fact that <-
and =
are not always exchangeable, for instance when
giving values to arguments in function calls.
C.1.3 Objects
Everything in R is an object (object-oriented programming). All objects have a mode and a length. Data objects have modes numeric, complex, logical, or character, and language objects have modes function, call, expression, name, etc. All data objects are vectors; there are no scalars in R.
Vectors contain elements of the same kind; a list can be thought of a vector where each element can be anything, even a list, and the elements can be completely different.
There are more properties of objects worth mentioning, but we save that for the situations where we need them.
C.1.4 Vectors and matrices
There are five basic types of vectors:
logical
, integer
, double
,
complex
, and character
.
The function c
(for concatenate) creates vectors:
<- c(2.0, 3.4, 5.6)
dat <- c("red", "green", "blue") cols
The elements of a vector may be named:
names(dat) <- c("Sam", "John", "Kate")
names(dat)
[1] "Sam" "John" "Kate"
dat
Sam John Kate
2.0 3.4 5.6
"John"] dat[
John
3.4
Note the first two lines; names
may either display (second row) or
replace (first row).
The function matrix creates a matrix (surprise!):
<- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
Y Y
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
dim(Y) <- c(3, 2)
Y
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
as.vector(Y)
[1] 1 2 3 4 5 6
mode(Y)
[1] "numeric"
attributes(Y)
$dim
[1] 3 2
It is also possible to create vectors by the functions numeric
,
character
, and logical
.
<- logical(7)
x x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
<- numeric(7)
x x
[1] 0 0 0 0 0 0 0
<- character(7)
x x
[1] "" "" "" "" "" "" ""
These functions are useful for allocating storage (memory), which later can be filled. This is better practice than letting a vector grow sequentially.
The functions nrow
and ncol
extracts the number of rows and
columns, respectively, of a matrix.
An array is an extension of a matrix to more than two dimensions.
C.1.5 Lists
Family records may be produced as a list:
<- list(FamilyID = 1, man = "John", wife = "Kate",
fam children = c("Sam", "Bart"))
fam
$FamilyID
[1] 1
$man
[1] "John"
$wife
[1] "Kate"
$children
[1] "Sam" "Bart"
$children fam
[1] "Sam" "Bart"
C.1.6 Data frames
A data frame is a special case of a list. It consists of
variables of the same length, but not necessarily of the same type. Data
frames are created by either data.frame
or read.table
, where
the latter reads data from an ASCII file on disk.
<- data.frame(name = c("Homer", "Flanders", "Skinner"),
dat income = c(1000, 23000, 85000))
dat
name income
1 Homer 1000
2 Flanders 23000
3 Skinner 85000
A data frame has row names (here: 1, 2, 3) and variable (column)
names (here: name
, income
). The data frame is the normal object
for data storage and statistical analysis in R.
C.1.7 Factors
Categorical variables are conveniently stored as factors in R.
<- factor(c("se", "no", "uk", "uk", "us"))
country country
[1] se no uk uk us
Levels: no se uk us
print.default(country)
[1] 2 1 3 3 4
Factors are internally coded by integers (1, 2, …). The levels are by default sorted into alphabetical order. Can be changed:
<- factor(c("se", "no", "uk", "uk", "us"),
country levels = c("uk", "us", "se", "no"))
country
[1] se no uk uk us
Levels: uk us se no
The first level often has a special meaning (“reference category”) in statistical analyses in R.
C.1.8 Operators
Arithmetic operations are performed on vectors, element by element. The
common operators are + - * / ^
, where ^
is
exponentiation. For instance
<- c(1, 2, 3)
x ^2 x
[1] 1 4 9
<- 2 * x
y / x y
[1] 2 2 2
C.1.9 Recycling
The last examples above were examples of recycling; If two vectors of different lengths occur in an expression, the shorter one is recycled (repeated) as many times as necessary. For instance
<- options(warn = -1) # Turn off warnings.
op <- c(1, 2, 3)
x <- c(2, 3)
y / y x
[1] 0.5000000 0.6666667 1.5000000
options(op) # Set warnings to original state.
so the actual operation performed is
c(1, 2, 3) / c(2, 3, 2) # Truncated to equal length
[1] 0.5000000 0.6666667 1.5000000
It is most common with the shorter vector being of length one, for instance, a
scalar multiplication. If the length of the longer vector is not a multiple
of the length of the shorter vector, a warning is issued: longer object length is not a multiple of shorter object length
(not shown
above). This is often a sign of a mistake in the code (so do not turn off warnings!).
C.1.10 Precedence
Multiplication and division is performed before addition and subtraction, as in pure mathematics. The (almost) full set of rules, from highest to lowest, are:
$, [[
: Extraction of elements in a list.[
: Vector extraction.^
: Exponentiation.-
: Unary minus.:
: Sequence generation.%%, %/%, %*%
: Special operators.*, /
: & Multiplication and division.+, -
: Addition and subtraction.<, >, <=, >=, ==, !=
: Comparison.!
: Logical negation.&, |, &&, ||
: Logical operators.$\sim$
: Formula.<-
: Assignment.
It is, however, highly recommended to use parentheses often rather than relying on (and remembering) those rules. Compare, for instance
1:3 + 1
[1] 2 3 4
1:(3 + 1)
[1] 1 2 3 4
For descriptions of specific operators, consult the help system.
C.2 Some Standard R Functions
round
,floor
,ceiling
: Rounds to nearest integer, downwards, and upwards, respectively.
round(2.5) # Round to even
[1] 2
floor(2.5)
[1] 2
ceiling(2.5)
[1] 3
%/%
and%%
for integer division and modulo reduction.
5 %/% 2
[1] 2
5 %% 2
[1] 1
- Mathematical functions:
abs
,sign
,log
,exp
, etc. sum
,prod
,cumsum
,cumprod
for sums and products.min
,max
,cummin
,cummax
for extreme values.pmin
,pmax
parallel min and max.
<- c(1, 2, 3)
x <- c(2, 1, 4)
y max(x, y)
[1] 4
pmax(x, y)
[1] 2 2 4
sort
,order
for sorting and ordering. Especially useful isorder
for rearranging a data frame according to a specific ordering of one variable:
require(eha)
data(mort)
<- mort[order(mort$id, -mort$enter), ]
mt <- mt[!duplicated(mt$id), ] last
First, mort
is sorted after id
, and within id
decreasingly
after enter
(notice the minus sign in the formula!). Then all rows
with duplicated id
are removed, only the first appearance is
kept. In this way we get a data frame with exactly one row per individual,
the row that corresponds to the individual’s last (in age or calendar time).
See also the next item!
duplicated
andunique
for marking duplicated elements in a vector and removing duplicates, respectively.
<- c(1,2,1)
x duplicated(x)
[1] FALSE FALSE TRUE
unique(x)
[1] 1 2
!duplicated(x)] x[
[1] 1 2
C.2.1 Sequences
The operator :
is used to generate sequences of numbers.
1:5
[1] 1 2 3 4 5
5:1
[1] 5 4 3 2 1
-1:5
[1] -1 0 1 2 3 4 5
-(1:5)
[1] -1 -2 -3 -4 -5
-1:-5
[1] -1 -2 -3 -4 -5
Don’t trust that you remember rules of precedence; use parentheses!
There is also the functions seq
and rep
, see the help pages!
C.2.2 Logical expression
Logical expressions can take only two distinct values, TRUE
and FALSE
(note uppercase; R acknowledges the difference between
lower case and upper case).
Logical vectors may be used in ordinary arithmetic, where they are
coerced into numeric vectors with FALSE
equal to zero and
TRUE
equal to one.
C.2.3 Indexing
Indexing is a powerful feature in R. It comes in several flavors.
- A logical vector Must be of the same length as the vector it is applied to.
<- c(-1:4)
x x
[1] -1 0 1 2 3 4
> 0] x[x
[1] 1 2 3 4
- Positive integers Selects the corresponding values.
- Negative integers Removes the corresponding values.
- Character strings For named vectors; select the elements with the given names.
- Empty Selects all values. Often used to select all rows or columns from a matrix,
<- matrix(c(1,2,3,4), nrow = 2)
x x
[,1] [,2]
[1,] 1 3
[2,] 2 4
1] x[,
[1] 1 2
1, drop = FALSE] x[,
[,1]
[1,] 1
[2,] 2
Note that a dimension is lost when one row or column is selected. This can be
overridden by the argument drop = FALSE
(don’t drop dimension(s)).
C.2.4 Vectors and matrices
A matrix is a vector with a dim
attribute.
<- 1:4
x x
[1] 1 2 3 4
dim(x) <- c(2, 2)
dim(x)
[1] 2 2
x
[,1] [,2]
[1,] 1 3
[2,] 2 4
t(x)
[,1] [,2]
[1,] 1 2
[2,] 3 4
Note that matrices are stored column-wise. The function t
gives the transpose of a matrix.
An example of matrix multiplication.
<- matrix(c(1,2,3,4, 5, 6), nrow = 2)
x x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
<- matrix(c(1,2,3,4, 5, 6), ncol = 2)
y y
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
%*% x y
[,1] [,2] [,3]
[1,] 9 19 29
[2,] 12 26 40
[3,] 15 33 51
%*% y x
[,1] [,2]
[1,] 22 49
[2,] 28 64
Dimensions must match according to ordinary mathematical rules.
C.2.5 Conditional execution
Conditional execution uses the if
statement (but see also
switch
). Conditional constructs are typically used inside functions.
<- 3
x if (x > 3) {
cat("x is larger than 3\n")
else{
}cat("x is smaller than or equal to 3\n")
}
x is smaller than or equal to 3
See also the function cat
(and print
).
C.2.6 Loops
Loops are typically only used inside functions.
- A
for
loop makes an expression to to be iterated as a variable assumes all values in a sequence.
<- 1
x for (i in 1:4){
<- x + 1
x cat("i = ", i, ", x = ", x, "\n")
}
i = 1 , x = 2
i = 2 , x = 3
i = 3 , x = 4
i = 4 , x = 5
x
[1] 5
- The
while
loop iterates as long as a condition isTRUE
.
<- FALSE
done <- 0
i while (!done & (i < 5)){
<- i + 1
i <- (i > 3)
done
}if (!done) cat("Did not converge\n")
- The
repeat
loop iterates indefinitely or until abreak
statement is reached.
<- 0
i <- FALSE
done repeat{
<- i + 1
i <- (i > 3)
done if (done) break
} i
[1] 4
In R, there are a family of functions that operates directly on
vectors. See the documentation of apply
, tapply
, lapply
,
and sapply
.
C.3 Writing Functions
Functions are created by assignment using the keyword function
.
<- function(x) exp(x)
fun fun(1)
[1] 2.718282
This function, fun
, is the exponential function, obviously. Usually,
the function body consists of several statements.
<- function(x, maxit){
fun2 if (any(x <= 0)) error("Only valid for positive x")
<- 0
i <- 0
sum for (i in 0:maxit){
<- sum + x^i / gamma(i + 1)
sum
}
sum
}fun2(1, 5)
[1] 2.716667
fun2(1:5, 5)
[1] 2.716667 7.266667 18.400000 42.866667 91.416667
This function, fun2
, uses the Taylor expansion of \(\exp(x)\) for
calculation. This is a very crude variant, and it is only good for positive
\(x\). It throws an error if fed by a negative value. Note that it is
vectorized, i.e., the first argument may be a real-valued
vector. Also note the function any
.
In serious function writing, it is very inconvenient to define the function
at the command prompt. It is much better to write it in a good editor and
then source
it into the R session as needed. If it will be used
extensively in the future, the way to go is to write a package
. This
is covered later in these notes. Our favorite editor, and the one used in
writing the examples in these notes, is emacs
, which is free and available on all
platforms. Another option is to use RStudio (RStudio Team 2021).
Functions can be defined inside the body of another function. In this way
it is hidden for other functions, which is important when considering
scoping rules. This is a feature that is not available in Fortran
77 or C. Suppose that the function inside
is defined inside
the function head
.
- A call to
inside
will use the function defined inhead
, and not some function outside with the same name. - The function
inside
is not visible outsidehead
and cannot be used there.
The return value of a function is the value of the last expression in the function body.
C.3.1 Calling conventions
The arguments to a function are either specified or
unspecified (or both). Unspecified arguments are shown as “\(\ldots\)”
in the function definition. See for instance the functions c
and min
.
The “\(\ldots\)” may be replaced by any number of arguments in a call to the
function.
The formal arguments are those specified in the function definition and the actual arguments are those given in the actual call to the function. The rules by which formal and actual are matched are
- Any actual argument given as
name = value
, wherename
exactly matches a formal argument, is matched. - Arguments specified by
name = value
, but no exact match are then considered. If a perfect partial match is found, it is accepted. - The remaining unnamed arguments are then matched in sequence (positional matching).
- Remaining unmatced actual arguments are part of the … argument if there is one. Otherwise an error occurs.
- Formal arguments need not be matched.
C.3.2 The argument ...
The purpose of the ...
argument is to pass arguments to a
function call inside the actual function. Read An Introduction to R (available
in the R help system) for further detail.
C.3.3 Writing functions
Below is a function that calculates the maximum likelihood estimate of the parameter in the exponential distribution, given a censored sample.
mlexp <- function(x, d){
n <- length(x)
## Check indata:
if (any(x <= 0)) stop("This function needs positive x")
if (length(d) != n) stop("Lengths of x and d do not match")
## End check
## We adopt the rule: d == 0 for censored obs; 1 for 'events'.
d <- as.numeric(d != 0)
## The density function is lambda^d * exp(-lambda * x), x > 0.
## But we prefer a parameterization where the
## parameter may vary freely.
## Set lambda = exp(alpha):
## The density is exp(d * alpha) * exp(-x * exp(alpha)).
sumx <- sum(x)
sumd <- sum(d)
loglik <- function(alpha){
sumd * alpha - sumx * exp(alpha)
}
## We want the first derivative for optimization:
dloglik <- function(alpha){
sumd - sumx * exp(alpha)
}
## Start value:
alpha <- 0
res <- optim(alpha, fn = loglik, gr = dloglik,
method = "BFGS", control = list(fnscale = -1))
res
}
First note that the arguments of the function, x
and d
, do not
have any default values. Both arguments must be matched in the actual call
to the function. Second, note the check of sanity in the beginning of the
function and the calls to stop
in case of an error. For less severe
mistakes it is possible to give a warning
, which does not break the
execution of the function.
C.3.4 Lazy evaluation
When a function is called, the arguments are parsed, but no evaluated until it is needed in the function. Specifically, if the function does not use the argument, it is never evaluated, and the variable need not even exist. This is in contrast to rules in C and Fortran.
<- function(x, y) 2 * x
fu fu(3)
[1] 6
fu(3, y = zz)
[1] 6
exists("zz")
[1] TRUE
C.3.5 Recursion
Functions are allowed to be recursive, i.e., a function in R is allowed to call itself. See An Introduction to R for details, but be warned: This technique may be extremely memory-intensive and slow. Use with caution, there is often a more direct way of attack.
C.3.6 Vectorized functions
Most mathematical functions in R are vectorized, meaning that if the argument to the function is a vector, them the result is a vector of the same length. Some functions apply the recycling rule to vectors shorter than the longest:
sqrt(c(9, 4, 25))
[1] 3 2 5
pnorm(1, mean = 0, sd = c(1,2,3))
[1] 0.8413447 0.6914625 0.6305587
When writing own functions, the question of making it vectorized should always be considered. Often it is automatically vectorized, like
<- function(x) 3 * x - 2
fun fun(c(1,2,3))
[1] 1 4 7
but not always. Under some circumstances, it may be necessary to explicitly make the calculations separately for each element in a vector.
Within a function in R, objects are found in the following order:
- It is a defined variable.
- It is in the argument list.
- It is defined in the environment where the function is defined.
Note that a function will eventually look for the variable in the work space. This is often not the intention!
Tip: Test functions in a clean workspace. For instance, start R with
the flag R --vanilla
.
C.4 Standard Graphics
The workhorses in standard graphics handling in R are the functions
plot
and par
. Read their respective documentations carefully; you
can manipulate your plots almost without limitation!
See the code and Figure C.1 for the output.
source("plot2.R")
plot2
function ()
{
set.seed(1211)
x <- 3 + rnorm(100)
y <- 2 + x + rnorm(100, sd = 0.5)
oldpar <- par(mfrow = c(1, 2))
on.exit(par(oldpar))
par(las = 1)
plot(x, y, main = "Scatterplot of x and y")
hist(y, main = "Histogram of y", probability = TRUE)
}
plot2()
FIGURE C.1: Output from the function plot2.
C.5 Useful R Functions
There are some useful functions in R doing set operations. For exact definitions of these functions, see the documentation.
- match
match(x, y)
returns a vector of the same length asx
with the place of the first occurrence of each element inx
iny
. If there is mo match for a particular element inx
,NA
is returned (can be changed). Example
<- c(1,3,2,5)
x <- c(3, 2, 4, 1, 3)
y match(x, y)
[1] 4 1 2 NA
The match
function is useful picking information in one data frame
and put it in another (or the same). As an example, look at the data frame
oldmort
.
require(eha)
data(oldmort)
names(oldmort)
[1] "id" "enter" "exit" "event" "birthdate"
[6] "m.id" "f.id" "sex" "civ" "ses.50"
[11] "birthplace" "imr.birth" "region"
Note the variables id
(id of current record) and m.id
(id of
the mother to the current). Some of the mothers are also subjects in this
file, and if we in an analysis want to to use mother’s birthplace
as
a covariate, we can get it from her record (if present; there will be many
missing mothers). This is accomplished with match
; creating a new
variable m.birthplace
(mother’s birthplace) is as simple as
$m.birthplace <- with(oldmort,
oldmortmatch(m.id, id)]) birthplace[
The tricky part here is to get the order of the arguments to match
right. Always check the result on a few records!
- %in% Simple version of
match
. Returns a logical vector of the same length asx
.
%in% y x
[1] TRUE TRUE TRUE FALSE
A simple and efficient way of selecting subsets of a certain kind, For instance, to select all cases with parity 1, 2, or 3, use
data(fert)
<- fert[fert$parity %in% 1:3, ] f13
This is the same as
<- fert[(fert$parity >= 1) & (fert$parity <= 3), ] f13
Note that the parentheses in the last line are strictly unnecessary, but their presence increases readability and avoids stupid mistakes about precedence between operators. Use parentheses often!
subset Selects subsets of vectors, arrays, and data frames. See the help page.
tapply Applies a function to all subsets of a data frame (it is more general than that; see the help page). Together with it is useful for creating new summary variables for clusters and sticking the values to each individual.
If we for all rows in the fert
data frame want to add the age at
first birth for the corresponding mother, we can do this as follows.
data(fert)
<- tapply(fert$id, fert$id)
indx <- tapply(fert$age, fert$id, min)
min.age $min.age <- min.age[indx] fert
Check this for yourself! And read the help pages for match
and
tapply
carefully.
C.6 Help in R
Besides all documentation that can be found on CRAN, there is an extensive help system online in R. To start the general help system, type
> help.start()
at the prompt (or, find a corresponding menu item on a graphical user interface system). Then a help window opens up with access to several FAQs, and all the packages that are installed.
For immediate help on a specific function you know the name of, say coxreg
, type
> ?coxreg
If you only need to see the syntax and input arguments, use the function :
> args(coxreg)
C.7 Functions for Survival Analysis
Here some useful functions in eha
and survival
are listed. In most cases the
description is very sparse, and the reader is recommended to consult the
corresponding help page when more detail is wanted.
- aftreg Fits accelerated failure time (AFT) models.
- age.window Makes a “horizontal
cut” in the Lexis diagram, that is, it selects a
subset of a data set based on age. The data frame must contain three
variables that can be input to the
Surv
function. The default names areenter
,exit
, andevent
. As an example,
library(eha)
<- age.window(oldmort, c(60, 70)) mort6070
limits the study of old age mortality to the (exact) ages between 60 and 70, i.e. from 60.000 up to and including 69.999 years of age.
For rectangular cuts in the Lexiz diagram, use both age.window
and cal.window
in succession. The order is unimportant.
- cal.window Makes a “vertical cut” in the Lexis diagram, that is, it selects a subset of a data set based on calendar time. As an example,
library(eha)
<- cal.window(oldmort, c(1860, 1870)) mort18601870
limits the study of old age mortality to the time period between 1860 and 1870, that is, from January 1, 1860 up to and including December 31, 1869.
For rectangular cuts in the Lexiz diagram, use both age.window
and cal.window
in succession. The order is unimportant.
cox.zph Tests the proportionality assumption on fits from
coxph
. This function is part of the survival package.coxreg Fits Cox proportional hazards model. It is a wrapper for
coxph
in case of simpler Cox regression models.phreg Fits parametric proportional hazards models.
pchreg Piecewise constant proportional hazards regression
tpchreg Piecewise constant proportional hazards regression for tabular data.
toTpch Tabulating individual survival data.
risksets Finds the members of each riskset at all event times.
C.7.1 Checking the integrity of survival data
In data sources collected through reading and coding old printed handwritten sources, many opportunities for making and propagating errors occur. Therefore, it is important to have tools for detecting logical errors like people dying before they got married, of living at two places at the same time. In the sources used in this book, such errors occur in almost all data retrievals. It is important to say that this fact is not a sign of a bad job in the transfer of data from old sources to digital data files. The case is almost always that the errors are present in the original source, and in the registration process no corrections of “obvious” mistakes by the registrators are allowed; the digital files are supposed to be as close as possible to the original sources. These are the rules at the Demographic Database (DDB), Umeå University.
This means that the researcher has a responsibility to check her data for logical (and other) errors, and truthfully report how she handles these errors. In my own experience with data from the DDB, the relative frequency of errors varies between 1 and 5 per cent. In most cases it would not do much harm to simply delete erroneous records, but often the errors are “trivial,” and it seems easy to guess what the truth should be.
C.7.1.1 Old age mortality
As an example of the kind of errors discussed above, we take a look at the
original version of oldmort
in eha
. For survival data in
"“interval” form (enter, exit]
, it is important that enter
is smaller than exit. For individuals with more than one record, the
intervals are not allowed to overlap, and at most one of them is allowed to
end with an event
. Note that we are talking about survival data,
so there are no repeated events “by design.”
The function check.surv
in eha
tries to do these kinds of
checks and report back all individuals (id
) that do not follow the
rules.
load("olm.rda")
require(eha)
<- with(olm, check.surv(enter, exit, event, id))
errs errs
[1] "785000980" "787000493" "790000498" "791001135" "792001121"
[6] "794001225" "794001364" "794001398" "796000646" "797001175"
[11] "797001217" "798001203" "800001130" "801000743" "801001113"
[16] "801001210" "801001212" "802001136" "802001155" "802001202"
[21] "803000747" "804000717" "804001272" "804001354" "804001355"
[26] "804001532" "805001442" "806001143" "807000736" "807001214"
[31] "808000663" "810000704" "811000705" "812001454" "812001863"
[36] "813000613" "814000799" "815000885" "816000894" "817000949"
[41] "819001046"
<- length(errs) no.err
So, 41 individuals have bad records. Let us look at some of them.
<- olm[olm$id %in% errs, ]
badrecs dim(badrecs)
[1] 105 13
$id == errs[1], ] badrecs[badrecs
id enter exit event birthdate m.id f.id
283 785000980 74.427 75.917 FALSE 1785.573 743000229 747000387
284 785000980 75.525 87.917 FALSE 1785.573 743000229 747000387
285 785000980 87.917 88.306 TRUE 1785.573 743000229 747000387
sex civ ses.50 birthplace imr.birth region
283 female married farmer parish 12.4183 rural
284 female widow farmer parish 12.4183 rural
285 female widow farmer parish 12.4183 rural
The error here is that the second interval (75.526, 87.927]
starts before
the first interval has ended. We also see that this is a married woman who
became a widow in the second record.
In this case I think that a simple correction saves the situation: Let the second interval begin where the first ends. The motivation for this choice is that the first interval apparently ends with the death of the woman’s husband. Death dates are usually reliable, so I believe more in it than in other dates.
Furthermore, the exact choice here will have very little impact on the analyses we are interested in. The length of the overlap is less than half a year, and this person will eventually have the wrong civil status for at most that amount of time.
It turns out that all “bad” individuals are of this type. Therefore we “clean” the data set by running it through this function:
ger <- function(){
require(eha)
load("olm.rda")
out <- check.surv(olm$enter, olm$exit, olm$event, olm$id)
ko <- olm[olm$id %in% out, ]
## ko is all records belonging to "bad individuals"
## Sort ko, by id, then by interval start, then by event
ko <- ko[order(ko$id, ko$enter, -ko$event), ]
for (i in ko$id){ # Go thru all 'bad individuals'
slup <- ko[ko$id == i, , drop = FALSE]
## Pick out their records
n <- NROW(slup)
if (n > 1){ # Change them
for (j in (2:n)){
slup$enter[j] <- slup$exit[j-1]
}
## And put them back:
ko[ko$id == i, ] <- slup
}
}
## Finally, put 'all bad' back into place:
om <- olm
om[om$id %in% out, ] <- ko
## Keep only records that start before they end:
om <- om[om$enter < om$exit, ]
om
}
Read the comments, lines (or part of lines) starting with #. The output of
this function, om
, is the present data file oldmort
in
eha
, with the errors corrected as indicated above.
Running the function.
source("ger.R")
<- ger()
om <- with(om, check.surv(enter, exit, event, id))
left left
[1] "794001225"
Still one bad. Let us take a look at that individual.
$id == left, ] om[om
id enter exit event birthdate m.id f.id sex civ
1168 794001225 65.162 68.650 FALSE 1794.838 NA NA male widow
1169 794001225 68.650 85.162 FALSE 1794.838 NA NA male widow
1171 794001225 78.652 85.162 FALSE 1794.838 NA NA male widow
ses.50 birthplace imr.birth region
1168 farmer parish 12.70903 rural
1169 farmer region 12.70903 rural
1171 farmer parish 12.70903 rural
It seems as if this error (second an third records overlap) should have been
fixed by our function ger
. However, there may have been a record
removed, so we have to look at the original data file:
$id == left, ] olm[olm
id enter exit event birthdate m.id f.id sex civ
1168 794001225 65.162 68.650 FALSE 1794.838 NA NA male widow
1169 794001225 65.162 85.162 FALSE 1794.838 NA NA male widow
1170 794001225 68.650 78.652 FALSE 1794.838 NA NA male widow
1171 794001225 78.652 85.162 FALSE 1794.838 NA NA male widow
ses.50 birthplace imr.birth region
1168 farmer parish 12.70903 rural
1169 farmer region 12.70903 rural
1170 farmer parish 12.70903 rural
1171 farmer parish 12.70903 rural
Yes, after correction, the third record was removed. The matter can now be fixed
through removing the last record in om[om\$id == left, ]
:
<- which(om$id == left)
who who
[1] 1166 1167 1168
<- om[-who[length(who)], ]
om $id == left, ] om[om
id enter exit event birthdate m.id f.id sex civ
1168 794001225 65.162 68.650 FALSE 1794.838 NA NA male widow
1169 794001225 68.650 85.162 FALSE 1794.838 NA NA male widow
ses.50 birthplace imr.birth region
1168 farmer parish 12.70903 rural
1169 farmer region 12.70903 rural
And we are done.
C.8 Reading Data into R
The first thing you have to do in a statistical analysis (in R) is to get
data into the computing environment. Data may come from different kind of
sources, and R has the capacity to read from many kinds of sources, like
data files from other statistical programs, spreadsheet type (think
Excel) data. This is normally done with the aid of the recommended package
foreign
(R Core Team 2020), but the newer package haven
(Wickham and Miller 2020) is also useful.
If anything else fails, it is almost always possible to get data written in
ASCII format (e.g., .csv files). These can be read into R with the
function read.table
.
C.8.1 Reading data from ASCII files
Ordinary text (ASCII) files in tabular form can be read into R with the
function read.table
, and the result is a data frame. For instance, with a file looking this:
enter exit event
1 5 1
2 7 1
1 5 1
0 6 0
we read it into R with
<- read.table("my.dat", header = TRUE) mydata
Note the argument header
. It can take two values, TRUE
or
FALSE
, with FALSE
being the default value. If header = TRUE
, then the first row of the data file is interpreted as
variable names.
There are a few optional arguments to read.table
that are important
to consider. The first is dec
, which gives the character used as a
decimal point. The default is “.” but in some locales “,” (e.g.,
Swedish) is commonly used in output from other software than R. The
second is sep
, which gives the field separator
character. The default value is "“, which means”white space" (tabs,
spaces, newlines, etc.). See the help page for all the possibilities,
including the functions read.csv
, read.csv2
, etc., which are
variations of read.table
with other default values.
The result should always be checked after reading. Do that by printing a few rows
head(mydata)
enter exit event
1 1 5 1
2 2 7 1
3 1 5 1
4 0 6 0
and/or by summarizing the columns
summary(mydata)
enter exit event
Min. :0.00 Min. :5.00 Min. :0.00
1st Qu.:0.75 1st Qu.:5.00 1st Qu.:0.75
Median :1.00 Median :5.50 Median :1.00
Mean :1.00 Mean :5.75 Mean :0.75
3rd Qu.:1.25 3rd Qu.:6.25 3rd Qu.:1.00
Max. :2.00 Max. :7.00 Max. :1.00
The str
function is also helpful:
str(mydata)
'data.frame': 4 obs. of 3 variables:
$ enter: int 1 2 1 0
$ exit : int 5 7 5 6
$ event: int 1 1 1 0
For each variable, its name and type is given, together with the
first few values. In this example it is all the values, and we can also see
that R interprets all the values as integer
. This is fine,
even though at least enter
and exit
probably are supposed to be
real-valued. In fact, R is not so strict with numerical types (as
opposed to C and Fortran). Frequent
coercing takes place, and the
mode numeric
contains
the types integer
and double
.
Now this tends to be confusing, but the bottom line is that you, as a user, need not worry at all. Except in one place: When you are writing an R function that calls compiled C or Fortran code, but that is far beyond the scope of this book. Interested readers are recommended to read Venables (2000).
C.8.2 Reading foreign data files
Data file from Minitab, SAS, SPSS, Stata, etc.,
can be read with a suitable function from the foreign
package. Consult its help pages for more information. Another option is to
use one of the tidyverse
(Wickham et al. 2019) packages haven
, readxl
, and
DBI
. For more information, see relevant help pages.