Skip to contents

Make easier the process of validating data.frames before beginning any process.

Problem

Standard R solutions for validations are good for working with vectors, but they aren’t so useful when working with data.frames, as we need to apply validations specifically at row level.

On the other hand, the validate package provides excellent tools to validate data.frames but they are divided into several functions to ensure flexibility.

Solution

tidyvalidate aims to simplify the tools provided by the validate package to make it easier to identify errors and share the results in any useful way, like a QC report.

Example - Finding errors

The validate_rules() function can find errors at column and row level by passing a list with the summary and row_level_errors.

library(tidyvalidate)

simple_validation <-
  validate_rules(mtcars,
                 mpg_string = is.character(mpg),
                 hp_numeric = is.numeric(hp),
                 mpg_low = mpg > 15)
  • summary is a data.table showing in the fails column the mistakes found, at column with mpg_string and row level with the mpg_low.
simple_validation$summary
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2: hp_numeric     1      1     0     0  FALSE   FALSE
#> 3:    mpg_low    32     26     6     0  FALSE   FALSE
  • row_level_errors is a list of data.tables with an element for each broken rule related to elements at row level. In this example, it only shows the mpg_low as it is the only row level rule with failed rows.
simple_validation$row_level_errors
#> $mpg_low
#>    Broken Rule   mpg
#>         <char> <num>
#> 1:     mpg_low  14.3
#> 2:     mpg_low  10.4
#> 3:     mpg_low  10.4
#> 4:     mpg_low  14.7
#> 5:     mpg_low  13.3
#> 6:     mpg_low  15.0

Adding indentifiers at row level

When validating elements at row level, it is useful to add columns that are not related to the test itself but are useful to identify the individual elements with errors.

# Creating an unique identifier for each row
mtcars_names <- mtcars
mtcars_names$`Car (Name)` <- rownames(mtcars_names)
rownames(mtcars_names) <- NULL

# Results of validating at row level
validate_rules(mtcars_names,
               mpg_string = is.character(mpg),
               hp_numeric = is.numeric(hp),
               mpg_low = mpg > 15,
               keep_rl_cols = "Car (Name)")$row_level_errors
#> $mpg_low
#>    Broken Rule          Car (Name)   mpg
#>         <char>              <char> <num>
#> 1:     mpg_low          Duster 360  14.3
#> 2:     mpg_low  Cadillac Fleetwood  10.4
#> 3:     mpg_low Lincoln Continental  10.4
#> 4:     mpg_low   Chrysler Imperial  14.7
#> 5:     mpg_low          Camaro Z28  13.3
#> 6:     mpg_low       Maserati Bora  15.0

Validating based on enviroment variables

Sometimes we need to create dynamic validations based on variables from the global environment as you see below.

min_mpg <- 12

validate_rules(mtcars_names,
               mpg_string = is.character(mpg),
               hp_numeric = is.numeric(hp),
               mpg_low = mpg > min_mpg,
               keep_rl_cols = "Car (Name)")
#> $summary
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2: hp_numeric     1      1     0     0  FALSE   FALSE
#> 3:    mpg_low    32     30     2     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_low
#>    Broken Rule          Car (Name)   mpg
#>         <char>              <char> <num>
#> 1:     mpg_low  Cadillac Fleetwood  10.4
#> 2:     mpg_low Lincoln Continental  10.4

In those cases, we can also pass a list of elements to the env argument.

validate_rules(mtcars_names,
               mpg_string = is.character(mpg),
               hp_numeric = is.numeric(hp),
               mpg_low = mpg > var_min_mpg,
               env = list(var_min_mpg = min_mpg),
               keep_rl_cols = "Car (Name)")
#> $summary
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2: hp_numeric     1      1     0     0  FALSE   FALSE
#> 3:    mpg_low    32     30     2     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_low
#>    Broken Rule          Car (Name)   mpg
#>         <char>              <char> <num>
#> 1:     mpg_low  Cadillac Fleetwood  10.4
#> 2:     mpg_low Lincoln Continental  10.4

Example - Alerting if error

Having a report to identify mistakes is really useful, but we don’t always want to see the same summaries if all results are good, but we want to be alerted about it.

That’s the main purpose of the action_if_problem() function. It has the following actions available:

  • stop: The (default) option will show an error if any the rows listed present an failed element.
try({
  validate_rules(mtcars,
                 mpg_string = is.character(mpg),
                 hp_numeric = is.numeric(hp),
                 mpg_low = mpg > 15) |>
    action_if_problem("We shound't have cars with low mpg",
                      problem_action = "stop")
})
#> [1] "We shound't have cars with low mpg"
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2:    mpg_low    32     26     6     0  FALSE   FALSE
#> Error in action_if_problem(validate_rules(mtcars, mpg_string = is.character(mpg),  : 
#>   We shound't have cars with low mpg
  • warning: It will let you know that there is a problem without stopping the code from running and returning the same results provided by the validate_rules() function.
warning_results <-
  validate_rules(mtcars,
                 mpg_string = is.character(mpg),
                 hp_numeric = is.numeric(hp),
                 mpg_low = mpg > 15) |>
  action_if_problem("We shound't have cars with low mpg",
                    problem_action = "warning")
#> [1] "We shound't have cars with low mpg"
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2:    mpg_low    32     26     6     0  FALSE   FALSE
#> Warning in action_if_problem(validate_rules(mtcars, mpg_string =
#> is.character(mpg), : We shound't have cars with low mpg

warning_results
#> $summary
#>          name items passes fails   nNA  error warning
#>        <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_string     1      0     1     0  FALSE   FALSE
#> 2: hp_numeric     1      1     0     0  FALSE   FALSE
#> 3:    mpg_low    32     26     6     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_low
#>    Broken Rule   mpg
#>         <char> <num>
#> 1:     mpg_low  14.3
#> 2:     mpg_low  10.4
#> 3:     mpg_low  10.4
#> 4:     mpg_low  14.7
#> 5:     mpg_low  13.3
#> 6:     mpg_low  15.0