Replacing values with NA

Nicholas Tierney

2018-02-08

When you are dealing with missing values, you might want to replace values with a missing values (NA). This is useful in cases when you know the origin of the data and can be certain which values should be missing. For example, you might know that all values of “N/A”, “N A”, and “Not Available”, or -99, or -1 are supposed to be missing.

naniar provides functions to specifically work on this type of problem using the function replace_with_na. This function is the compliment to tidyr::replace_na, which replaces an NA value with a specified value, whereas naniar::replace_with_na replaces a value with an NA:

In this vignette, we describe some simple use cases for these functions and describe how they work.

Example data

First, we introduce a small fictional dataset, df, which contains some common features of a dataset with the sorts of missing values we might encounter. This includes multiple specifications of missing values, such as “N/A”, “N A”, and “Not Available”. And also some common numeric codes, like -98, -99, and -1.


df <- tibble::tribble(
  ~name,           ~x,  ~y,              ~z,  
  "N/A",           1,   "N/A",           -100, 
  "N A",           3,   "NOt available", -99,
  "N / A",         NA,  29,              -98,
  "Not Available", -99, 25,              -101,
  "John Smith",    -98, 28,              -1)

Using replace_with_na

What if we want to replace the value -99 in the x column with a missing value?

First, let’s load naniar.

library(naniar)

Now, we specify the fact that we want to replace -99 with a missing value. To do so we use the replace argument, and specify a named list, which contains the names of the variable and the value it would take to replace with NA.

df %>% replace_with_na(replace = list(x = -99))
#> # A tibble: 5 x 4
#>   name               x y                   z
#>   <chr>          <dbl> <chr>           <dbl>
#> 1 N/A             1.00 N/A           -100   
#> 2 N A             3.00 NOt available - 99.0 
#> 3 N / A          NA    29            - 98.0 
#> 4 Not Available  NA    25            -101   
#> 5 John Smith    -98.0  28            -  1.00

And say we want to replace -98 as well?

df %>%
  replace_with_na(replace = list(x = c(-99, -98)))
#> # A tibble: 5 x 4
#>   name              x y                   z
#>   <chr>         <dbl> <chr>           <dbl>
#> 1 N/A            1.00 N/A           -100   
#> 2 N A            3.00 NOt available - 99.0 
#> 3 N / A         NA    29            - 98.0 
#> 4 Not Available NA    25            -101   
#> 5 John Smith    NA    28            -  1.00

And then what if we want to replace -99 and -98 in all the numeric columns, x and z?

df %>%
  replace_with_na(replace = list(x = c(-99,-98),
                             z = c(-99, -98)))
#> # A tibble: 5 x 4
#>   name              x y                   z
#>   <chr>         <dbl> <chr>           <dbl>
#> 1 N/A            1.00 N/A           -100   
#> 2 N A            3.00 NOt available   NA   
#> 3 N / A         NA    29              NA   
#> 4 Not Available NA    25            -101   
#> 5 John Smith    NA    28            -  1.00

Using replace_with_na works well when we know the exact value to be replaced, and for which variables we want to replace, providing there are not many variables. But what do you do when you’ve got many variables you want to observe?

Extending replace_with_na

Sometimes you have many of the same value that you want to replace. For example, -99 and -98 above, and also the variants of “NA”, such as “N/A”, and “N / A”, and “Not Available”. You might also have certain variables that you want to be affected by these rules, or you might have more complex rules, like, “only affect variables that are numeric, or character, with this rule”.

To account for these cases we have borrowed from dplyr’s scoped variants and created the functions:

Below we will now consider some very simple examples of the use of these functions, so that you can better understand how to use them.

Using replace_with_na_all

Use replace_with_na_all when you want to replace ALL values that meet a condition across an entire dataset. The syntax here is a little different, and follows the rules for rlang’s expression of simple functions. This means that the function starts with ~, and when referencing a variable, you use .x.

For example, if we want to replace all cases of -99 in our dataset, we write:


df %>% replace_with_na_all(condition = ~.x == -99)
#> # A tibble: 5 x 4
#>   name               x y                   z
#>   <chr>          <dbl> <chr>           <dbl>
#> 1 N/A             1.00 N/A           -100   
#> 2 N A             3.00 NOt available   NA   
#> 3 N / A          NA    29            - 98.0 
#> 4 Not Available  NA    25            -101   
#> 5 John Smith    -98.0  28            -  1.00

Likewise, if you have a set of (annoying) repeating strings like various spellings of “NA”, then I suggest you first lay out all the offending cases:


# write out all the offending strings
na_strings <- c("NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")

Then you write ~.x %in% na_strings - which reads as “does this value occur in the list of NA strings”.


df %>%
  replace_with_na_all(condition = ~.x %in% na_strings)
#> # A tibble: 5 x 4
#>   name            x y           z
#>   <chr>       <dbl> <chr>   <dbl>
#> 1 <NA>         1.00 <NA>  -100   
#> 2 <NA>         3.00 <NA>  - 99.0 
#> 3 <NA>        NA    29    - 98.0 
#> 4 <NA>       -99.0  25    -101   
#> 5 John Smith -98.0  28    -  1.00

replace_with_na_at

This is similar to _all, but instead in this case you can specify the variables that you want affected by the rule that you state. This is useful in cases where you want to specify a rule that only affects a selected number of variables.


df %>% 
  replace_with_na_at(.vars = c("x","z"),
                     condition = ~.x == -99)
#> # A tibble: 5 x 4
#>   name               x y                   z
#>   <chr>          <dbl> <chr>           <dbl>
#> 1 N/A             1.00 N/A           -100   
#> 2 N A             3.00 NOt available   NA   
#> 3 N / A          NA    29            - 98.0 
#> 4 Not Available  NA    25            -101   
#> 5 John Smith    -98.0  28            -  1.00

Although you can achieve this with regular replace_with_na, it is more concise to use, replace_with_na_at. Additionally, you can specify rules as function, for example, make a value NA if the exponent of that number is less than 1:


df %>% 
  replace_with_na_at(.vars = c("x","z"),
                     condition = ~ exp(.x) < 1)
#> # A tibble: 5 x 4
#>   name              x y             z    
#>   <chr>         <dbl> <chr>         <lgl>
#> 1 N/A            1.00 N/A           NA   
#> 2 N A            3.00 NOt available NA   
#> 3 N / A         NA    29            NA   
#> 4 Not Available NA    25            NA   
#> 5 John Smith    NA    28            NA

replace_with_na_if

There may be some cases where you can identify variables based on some test - is.character - are they character variables? is.numeric - Are they numeric or double? and a given value inside that type of data. For example,


df %>%
  replace_with_na_if(.predicate = is.character,
                     condition = ~.x %in% ("N/A"))
#> # A tibble: 5 x 4
#>   name               x y                   z
#>   <chr>          <dbl> <chr>           <dbl>
#> 1 <NA>            1.00 <NA>          -100   
#> 2 N A             3.00 NOt available - 99.0 
#> 3 N / A          NA    29            - 98.0 
#> 4 Not Available -99.0  25            -101   
#> 5 John Smith    -98.0  28            -  1.00

# or
df %>%
  replace_with_na_if(.predicate = is.character,
                     condition = ~.x %in% (na_strings))
#> # A tibble: 5 x 4
#>   name            x y           z
#>   <chr>       <dbl> <chr>   <dbl>
#> 1 <NA>         1.00 <NA>  -100   
#> 2 <NA>         3.00 <NA>  - 99.0 
#> 3 <NA>        NA    29    - 98.0 
#> 4 <NA>       -99.0  25    -101   
#> 5 John Smith -98.0  28    -  1.00

This means that you are able to apply a rule to many variables that meet a pre-specified condition. This can be of particular use if you have many variables and don’t want to list them all, and also if you know that there is a particular problem for variables of a particular class.