Speed improvements. Thanks to the help, contributions, and discussion with Romain François and Jim Hester, naniar now has greatly improved speed for calculating the missingness in each row. These speedups should continue to improve in future releases.
replace_with_na, thankyou to Colin Fay for his work on this:
replace_with_na_allreplaces all NAs across the dataframe that meet a specified condition (using the syntax
~.x == -99)
replace_with_na_atreplaces all NAs across for specified variables
replace_with_na_ifreplaces all NAs for those variables that satisfy some predicate function (e.g., is.character)
which_na - replacement for
miss_scan_count. This makes it easier for users to search for particular occurences of these values across their variables. #119
n_miss_row calculates the number of missing values in each row, returning a vector. There are also 3 other functions which are similar in spirit:
prop_complete_row, which return a vector of the number of complete obserations, the proportion of missings in a row, and the proportion of complete obserations in a row
add_miss_cluster is a new function that calculates a cluster of missingness for each row, using
hclust. This can be useful in exploratory modelling of missingness, similar to Tierney et al 2015. and Barnett et al. 2017
where_na - a function that returns the positions of NA values. For a dataframe it returns a matrix of row and col positions of NAs, and for a vector it returns a vector of positions of NAs. (#105)
only_missargument. When set to FALSE (the default) it will bind a dataframe with all of the variables duplicated with their shadow. Setting this to TRUE will bind variables only those variables that contain missing values.
gg_miss_caseto be clearer and less cluttered ( #117), also added n
order_casesoption to order by cases.
gg_miss_span. This makes it easier for users to visualise these plots across the values of another variable. In the future I will consider adding
facetto the other shorthand plotting function, but at the moment these seemed to be the ones that would benefit the most from this feature.
oceanbuoysnow is numeric type for year, latitude, and longitude, previously it was factor. See related issue
shadow_shiftwhen there are Inf or -Inf values (see #117)
replace_with_na, as it is a more natural phrase (“replace coffee to tea” vs “replace coffee with tea”). This will be made defunct in the next version.
cast_shadow no longer works when called as
cast_shadow(data). This action used to return all variables, and then shadow variables for the variables that only contained missing values. This was inconsistent with the use of
cast_shadow(data, var1, var2). A new option has been added to
bind_shadow that controls this - discussed below. See more details at issue 65.
Change behaviour of
cast_shadow so that the default option is to return only the variables that contain missings. This is different to
bind_shadow, which binds a complete shadow matrix to the dataframe. A way to think about this is that the shadow is only cast on variables that contain missing values, whereas a bind is binding a complete shadow to the data. This may change in the future to be the default option for
naniaronto CRAN, updates to
naniarwill happen reasonably regularly after this approximately every 1-2 months
group_byis now respected by the following functions:
label_missto be more consistent with the rest of naniar
miss_df_pct- this was literally the same as
show_pctargument to show the percentage of missing values (Thanks Jennifer for the helpful feedback! :))
miss_case_summarynow have consistent output (one was ordered by n_missing, not the other).
x(as adviced by Hadley)
replace_to_nais a complement to
tidyr::replace_naand replaces a specified value from a variable to NA.
gg_miss_fctreturns a heatmap of the number of missings per variable for each level of a factor. This feature was very kindly contributed by Colin Fay.
gg_miss_functions now return a ggplot object, which behave as such.
gg_miss_basic themes can be overriden with ggplot functions. This fix was very kindly contributed by Colin Fay.
add_*functions handle bare unqouted names where appropriate as per #61
geom_miss_point(), to keep consistent with the rest of the functions in
taoas per #59
tsgeneric functions are now
gg_miss_spanand work on
data.frame’s, as opposed to just
add_shadow_shift()adds a column of shadow_shifted values to the current dataframe, adding “_shift" as a suffix
cast_shadow()- acts like
bind_shadow()but allows for specifying which columns to add
shadow_shift now has a method for factors - powered by
gg_missing_*is changed to
gg_miss_*to fit with other syntax
shadow_cat, as they are no longer needed, and have been superceded by
pedestrian- contains hourly counts of pedestrians
miss_ts_run(): return the number of missings / complete in a single run
miss_ts_summary(): return the number of missings in a given time period
gg_miss_ts(): plot the number of missings in a given time period
narnia- I had to explain the spelling a few times when I was introducing the package and I realised that I should change the name. Fortunately it isn’t on CRAN yet.
prop_missand the complement
n_missreturns the number of missing values,
prop_missreturns the proportion of missing values. Likewise,
prop_completereturns the proportion of complete values.
The left hand side functions have been made defunct in favour of the right hand side. -
miss_*= I want to explore missing values
miss_case_*= I want to explore missing cases
miss_case_pct= I want to find the percentage of cases containing a missing value
miss_case_summary= I want to find the number / percentage of missings in each case
miss_case_table= I want a tabulation of the number / percentage of cases missing
This is more consistent and easier to reason with.
Thus, I have renamed the following functions: -
These will be made defunct in the next release, 0.0.6.9000 (“The Wood Between Worlds”).
n_completeis a complement to
n_miss, and counts the number of complete values in a vector, matrix, or dataframe.
shadow_shiftnow handles cases where there is only 1 complete value in a vector.
After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data
add_prop_miss are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al
geom_miss_point() now supports transparency, thanks to @seasmith (Luke Smith)
more shadows. These are mainly around
gather_shadow, which are helper functions to assist with creating
geom_missing_point() broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O’hara-Wild for his help with this.
missing data summaries
table_missing_case also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.
semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won’t come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.
gathered related functions into single R files rather than leaving them in their own.
correctly imported the
%>% operator from magrittr, and removed a lot of chaff around
@importFrom - really don’t need to use
@importFrom that often.
geom_missing_point()now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.
percent_missing_dfreturns the percentage of missing data for a data.frame
percent_missing_varthe percentage of variables that contain missing values
percent_missing_casethe percentage of cases that contain missing values.
table_missing_vartable of missing information for variables
table_missing_casetable of missing information for cases
summary_missing_varsummary of missing information for variables (counts, percentages)
summary_missing_casesummary of missing information for variables (counts, percentages)