Make easier the process of validating data.frames before beginning any process.
Problem
Standard R solutions for validations are good for working with vectors, but they aren’t so useful when working with data.frames, as we need to apply validations specifically at row level.
On the other hand, the validate
package provides
excellent tools to validate data.frames but they are divided into
several functions to ensure flexibility.
Solution
tidyvalidate
aims to simplify the tools provided by the
validate package to make it easier to identify errors and share the
results in any useful way, like a QC report.
Example - Finding errors
The validate_rules()
function can find errors at column
and row level by passing a list with the summary and
row_level_errors.
library(tidyvalidate)
simple_validation <-
validate_rules(mtcars,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > 15)
- summary is a data.table showing in the fails column the mistakes found, at column with mpg_string and row level with the mpg_low.
simple_validation$summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: hp_numeric 1 1 0 0 FALSE FALSE
#> 3: mpg_low 32 26 6 0 FALSE FALSE
- row_level_errors is a list of data.tables with an element for each broken rule related to elements at row level. In this example, it only shows the mpg_low as it is the only row level rule with failed rows.
simple_validation$row_level_errors
#> $mpg_low
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_low 14.3
#> 2: mpg_low 10.4
#> 3: mpg_low 10.4
#> 4: mpg_low 14.7
#> 5: mpg_low 13.3
#> 6: mpg_low 15.0
Adding indentifiers at row level
When validating elements at row level, it is useful to add columns that are not related to the test itself but are useful to identify the individual elements with errors.
# Creating an unique identifier for each row
mtcars_names <- mtcars
mtcars_names$`Car (Name)` <- rownames(mtcars_names)
rownames(mtcars_names) <- NULL
# Results of validating at row level
validate_rules(mtcars_names,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > 15,
keep_rl_cols = "Car (Name)")$row_level_errors
#> $mpg_low
#> Broken Rule Car (Name) mpg
#> <char> <char> <num>
#> 1: mpg_low Duster 360 14.3
#> 2: mpg_low Cadillac Fleetwood 10.4
#> 3: mpg_low Lincoln Continental 10.4
#> 4: mpg_low Chrysler Imperial 14.7
#> 5: mpg_low Camaro Z28 13.3
#> 6: mpg_low Maserati Bora 15.0
Validating based on enviroment variables
Sometimes we need to create dynamic validations based on variables from the global environment as you see below.
min_mpg <- 12
validate_rules(mtcars_names,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > min_mpg,
keep_rl_cols = "Car (Name)")
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: hp_numeric 1 1 0 0 FALSE FALSE
#> 3: mpg_low 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_low
#> Broken Rule Car (Name) mpg
#> <char> <char> <num>
#> 1: mpg_low Cadillac Fleetwood 10.4
#> 2: mpg_low Lincoln Continental 10.4
In those cases, we can also pass a list of elements to the
env
argument.
validate_rules(mtcars_names,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > var_min_mpg,
env = list(var_min_mpg = min_mpg),
keep_rl_cols = "Car (Name)")
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: hp_numeric 1 1 0 0 FALSE FALSE
#> 3: mpg_low 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_low
#> Broken Rule Car (Name) mpg
#> <char> <char> <num>
#> 1: mpg_low Cadillac Fleetwood 10.4
#> 2: mpg_low Lincoln Continental 10.4
Example - Alerting if error
Having a report to identify mistakes is really useful, but we don’t always want to see the same summaries if all results are good, but we want to be alerted about it.
That’s the main purpose of the action_if_problem()
function. It has the following actions available:
- stop: The (default) option will show an error if any the rows listed present an failed element.
try({
validate_rules(mtcars,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > 15) |>
action_if_problem("We shound't have cars with low mpg",
problem_action = "stop")
})
#> [1] "We shound't have cars with low mpg"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: mpg_low 32 26 6 0 FALSE FALSE
#> Error in action_if_problem(validate_rules(mtcars, mpg_string = is.character(mpg), :
#> We shound't have cars with low mpg
-
warning: It will let you know that there is a
problem without stopping the code from running and returning the same
results provided by the
validate_rules()
function.
warning_results <-
validate_rules(mtcars,
mpg_string = is.character(mpg),
hp_numeric = is.numeric(hp),
mpg_low = mpg > 15) |>
action_if_problem("We shound't have cars with low mpg",
problem_action = "warning")
#> [1] "We shound't have cars with low mpg"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: mpg_low 32 26 6 0 FALSE FALSE
#> Warning in action_if_problem(validate_rules(mtcars, mpg_string =
#> is.character(mpg), : We shound't have cars with low mpg
warning_results
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_string 1 0 1 0 FALSE FALSE
#> 2: hp_numeric 1 1 0 0 FALSE FALSE
#> 3: mpg_low 32 26 6 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_low
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_low 14.3
#> 2: mpg_low 10.4
#> 3: mpg_low 10.4
#> 4: mpg_low 14.7
#> 5: mpg_low 13.3
#> 6: mpg_low 15.0