Conversion semantics

There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap.

Value labels

Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:

Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with PROC FORMAT and then associated with variables in a DATA step (the names of character formats thealways start with $).

labelled()

To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with labelled(). This class allows you to associated arbitrary labels with numeric or character vectors:

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels:

See the documentation for as_factor() for more options to control exactly what the factor uses for levels.

Both as_factor() and zap_labels() have data frame methods if you want to apply the same strategy to every column in a data frame:

Missing values

All three tools provide a global “system missing value” which is displayed as .. This is roughly equivalent to R’s NA, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).

Each tool also provides a mechanism for recording multiple types of missingness:

Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value.

Haven models these missing values in two different ways:

Tagged missing values

To support Stata’s extended and SAS’s special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.

The R interface for creating with tagged NAs is a little clunky because generally they’ll be created by haven for you. But you can create your own with tagged_na():

Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():

To test if a value is a tagged NA, use is_tagged_na(), and to extract the value of the tag, use na_tag():

My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and as_factor() knows how to relabel:

User defined missing values

SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.

These objects are somewhat dangerous to work with in R because most R functions don’t know those values are missing:

Because of that danger, the default behaviour of read_spss() is to return regular labelled objects where user-defined missing values have been converted to NAs. To get read_spss() to return labelled_spss() objects, you’ll need to set user_na = TRUE.

I’ve defined an is.na() method so you can find them yourself:

And the presence of that method does mean many functions with an na.rm argument will work correctly:

But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels: