3.5 Explanatory Variables

Explanatory variables, or covariates, may be of essentially two different types continuous and discrete. The discrete type usually takes only a finite number of distinct values and is often called a factor, e.g., in R. A special case of a factor is one that takes only two distinct values, say 0 and 1. Such a factor is called an indicator, because we can let the value 1 indicate the presence of a certain property and 0 denote its absence. To summarize, there is

Covariate: taking values in an interval (e.g., age, blood pressure).
Factor: taking a finite number of values (e.g., civil status, occupation).
Indicator: a factor taking two values (e.g., gender).

3.5.1 Continuous Covariates

We use the qualifier continuous to stress that factors are excluded, because often the term covariate is used as a synonym for explanatory variable.

Values taken by a continuous covariate are ordered. The effect on the response is by model definition ordered in the same or reverse order. On the other hand, values taken by a factor are unordered (but may be defined as ordered in R).

3.5.2 Factor Covariates

An explanatory variable that can take only a finite (usually small) number of distinct values is called a categorical variable. In R language, it is called a factor. Examples of such variables are gender, socio-economic status, birth place. Students of statistics have long been taught to create dummy variables in such situations, in the following way:

Given a categorical variable \(F\) with \((k+1)\) levels \((f_0, f_1, f_2, \ldots f_k)\) (\(k+1\) levels),
Create \(k\) indicator (``dummy’’) variables \((I_1, I_2, \ldots I_k)\).

The level \(f_0\) is the reference category, characterized by that all indicator variables are zero for an individual with this value. Generally, for the level, \(f_i, \; i = 1,\ldots, k\), the indicator variable \(I_i\) is one, the rest are zero. In other words, for a single individual, at most one indicator is one, and the rest are zero.

In R, there is no need to explicitly create dummy variables, it is done behind the scenes by the functions factor and as.factor.

Note that a factor with two levels, i.e., an indicator variable, can always be treated as a continuous covariate, if coded numerically (e.g., 0 and 1).

Example 3.1 (Infant mortality and age of mother)

Consider a demographic example, the influence of mother’s age (a continuous covariate) on infant mortality. It is considered well-known that a young mother means high risk for the infant, and also that old mother means high risk, compared to “in-between-aged” mothers. So the risk order is not the same (or reverse) as the age order.

One solution (not necessarily the best) to this problem is to factorize: Let, for instance,

\[\begin{equation*} \mbox{mother's age} = \left\{\begin{array}{ll} \mbox{low}, & 15 < \mbox{age} \le 25 \\ \mbox{middle}, & 25 < \mbox{age} \le 35 \\ \mbox{high}, & 35 < \mbox{age} \end{array} \right. \end{equation*}\]

In this layout, there will be two parameters measuring the deviation from the reference category, which will be the first category by default.

In R, this is easily achieved with the aid of the cut function. It works like this:

age <- rnorm(100, 30, 6)
age.group <- cut(age, c(15, 25, 35, 48))
summary(age.group)

## (15,25] (25,35] (35,48]    NA's 
##      17      63      18       2

Note that the created intervals by default are closed to the right and open to the left. This has consequences for how observations exactly on a boundary are treated; they belong to the lower-valued interval. The argument right in the call to cut can be used switch this behaviour the other way around.

Note further that values falling below the smallest value (15 in our example) or above the largest value (48) are reported as missing values (NA in R terminology, Not Available).

For further information about the use of the cut function, see the help page. \(\Box\)