Skip to contents

Introduction to tidyvalidate

tidyvalidate is a package that simplifies data validation in R by providing an intuitive interface to the powerful validate package. It helps you ensure data quality before analysis by making it easy to check both column-level and row-level conditions in your data frames.

Why tidyvalidate?

Data validation in R traditionally faces two challenges:

  1. Base R validation functions work well with vectors but become cumbersome when working with data frames, especially for row-level validations
  2. The validate package offers comprehensive validation capabilities but requires multiple function calls for common use cases

tidyvalidate addresses these challenges by providing a streamlined interface that makes data validation both powerful and easy to use.

Basic Usage

Let’s start with a simple example using the built-in mtcars dataset:

library(tidyvalidate)

# Define and run validations
validation_results <- validate_rules(
  mtcars,
  # Column-level validation
  mpg_type = is.numeric(mpg),
  hp_type = is.numeric(hp),
  # Row-level validation
  mpg_threshold = mpg > 15
)

Understanding the Results

The validate_rules() function returns a list with two components:

1. Summary of All Validations
validation_results$summary
#>             name items passes fails   nNA  error warning
#>           <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1:      mpg_type     1      1     0     0  FALSE   FALSE
#> 2:       hp_type     1      1     0     0  FALSE   FALSE
#> 3: mpg_threshold    32     26     6     0  FALSE   FALSE

The summary table shows: - Which rules passed or failed - How many items were checked for each rule - The number of failures and NA values

2. Detailed Row-Level Errors
validation_results$row_level_errors
#> $mpg_threshold
#>      Broken Rule   mpg
#>           <char> <num>
#> 1: mpg_threshold  14.3
#> 2: mpg_threshold  10.4
#> 3: mpg_threshold  10.4
#> 4: mpg_threshold  14.7
#> 5: mpg_threshold  13.3
#> 6: mpg_threshold  15.0

This list contains data tables showing exactly which rows failed each validation rule, but only for row-level rules (like our mpg_threshold).

Advanced Features

Adding Row Identifiers

When working with real datasets, you often need to identify which specific records failed validation. Here’s how to include identifiers in the error output:

# Add row names as a column
cars_with_names <- mtcars
cars_with_names$car_name <- rownames(mtcars)
rownames(cars_with_names) <- NULL

# Validate with identifiers
validation_with_ids <- validate_rules(
  cars_with_names,
  mpg_threshold = mpg > 15,
  keep_rl_cols = "car_name"  # Include car_name in error output
)

# View failures with identifiers
validation_with_ids$row_level_errors
#> $mpg_threshold
#>      Broken Rule            car_name   mpg
#>           <char>              <char> <num>
#> 1: mpg_threshold          Duster 360  14.3
#> 2: mpg_threshold  Cadillac Fleetwood  10.4
#> 3: mpg_threshold Lincoln Continental  10.4
#> 4: mpg_threshold   Chrysler Imperial  14.7
#> 5: mpg_threshold          Camaro Z28  13.3
#> 6: mpg_threshold       Maserati Bora  15.0

Using Dynamic Validation Rules

You can make your validation rules dynamic by using variables. There are two ways to do this:

1. Using Environment Variables
# Define threshold in environment
min_mpg <- 12

# Use environment variable in validation
validate_rules(
  mtcars,
  mpg_minimum = mpg > min_mpg
)
#> $summary
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     30     2     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#>    Broken Rule   mpg
#>         <char> <num>
#> 1: mpg_minimum  10.4
#> 2: mpg_minimum  10.4
# Pass variables explicitly
validate_rules(
  mtcars,
  mpg_minimum = mpg > threshold,
  env = list(threshold = 12)  # More explicit and safer
)
#> $summary
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     30     2     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#>    Broken Rule   mpg
#>         <char> <num>
#> 1: mpg_minimum  10.4
#> 2: mpg_minimum  10.4

Taking Action on Validation Failures

The action_if_problem() function helps you handle validation failures appropriately. It offers two modes:

1. Stop on Failure

Use this when you want to halt execution if validations fail:

try({
  validate_rules(mtcars, mpg_minimum = mpg > 15) |>
    action_if_problem(
      message_text = "Validation failed: Some cars have low MPG",
      problem_action = "stop"
    )
})
#> [1] "Validation failed: Some cars have low MPG"
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     26     6     0  FALSE   FALSE
#> Error in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg >  : 
#>   Validation failed: Some cars have low MPG

2. Warning on Failure

Use this when you want to continue execution but be notified of failures:

results_with_warning <- validate_rules(mtcars, mpg_minimum = mpg > 15) |>
  action_if_problem(
    message_text = "Warning: Some cars have low MPG",
    problem_action = "warning"
  )
#> [1] "Warning: Some cars have low MPG"
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     26     6     0  FALSE   FALSE
#> Warning in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg > :
#> Warning: Some cars have low MPG

# Processing continues and you can still access results
results_with_warning
#> $summary
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     26     6     0  FALSE   FALSE
#> 
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#>    Broken Rule   mpg
#>         <char> <num>
#> 1: mpg_minimum  14.3
#> 2: mpg_minimum  10.4
#> 3: mpg_minimum  10.4
#> 4: mpg_minimum  14.7
#> 5: mpg_minimum  13.3
#> 6: mpg_minimum  15.0

Best Practices

  1. Always include meaningful rule names that describe the validation
  2. Use keep_rl_cols to include identifying columns in error reports
  3. Prefer the env parameter over environment variables for dynamic thresholds
  4. Choose appropriate actions based on how critical the validation is:
    • Use “stop” for critical data quality issues
    • Use “warning” for advisory checks

Next Steps

  • Explore the validate package documentation for more complex validation rules
  • Check out the other vignettes for advanced usage patterns
  • Consider contributing to the package on GitHub