Introduction to tidyvalidate
tidyvalidate
is a package that simplifies data
validation in R by providing an intuitive interface to the powerful
validate
package. It helps you ensure data quality before
analysis by making it easy to check both column-level and row-level
conditions in your data frames.
Why tidyvalidate?
Data validation in R traditionally faces two challenges:
- Base R validation functions work well with vectors but become cumbersome when working with data frames, especially for row-level validations
- The
validate
package offers comprehensive validation capabilities but requires multiple function calls for common use cases
tidyvalidate
addresses these challenges by providing a
streamlined interface that makes data validation both powerful and easy
to use.
Basic Usage
Let’s start with a simple example using the built-in
mtcars
dataset:
library(tidyvalidate)
# Define and run validations
validation_results <- validate_rules(
mtcars,
# Column-level validation
mpg_type = is.numeric(mpg),
hp_type = is.numeric(hp),
# Row-level validation
mpg_threshold = mpg > 15
)
Understanding the Results
The validate_rules()
function returns a list with two
components:
1. Summary of All Validations
validation_results$summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_type 1 1 0 0 FALSE FALSE
#> 2: hp_type 1 1 0 0 FALSE FALSE
#> 3: mpg_threshold 32 26 6 0 FALSE FALSE
The summary table shows: - Which rules passed or failed - How many items were checked for each rule - The number of failures and NA values
2. Detailed Row-Level Errors
validation_results$row_level_errors
#> $mpg_threshold
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_threshold 14.3
#> 2: mpg_threshold 10.4
#> 3: mpg_threshold 10.4
#> 4: mpg_threshold 14.7
#> 5: mpg_threshold 13.3
#> 6: mpg_threshold 15.0
This list contains data tables showing exactly which rows failed each
validation rule, but only for row-level rules (like our
mpg_threshold
).
Advanced Features
Adding Row Identifiers
When working with real datasets, you often need to identify which specific records failed validation. Here’s how to include identifiers in the error output:
# Add row names as a column
cars_with_names <- mtcars
cars_with_names$car_name <- rownames(mtcars)
rownames(cars_with_names) <- NULL
# Validate with identifiers
validation_with_ids <- validate_rules(
cars_with_names,
mpg_threshold = mpg > 15,
keep_rl_cols = "car_name" # Include car_name in error output
)
# View failures with identifiers
validation_with_ids$row_level_errors
#> $mpg_threshold
#> Broken Rule car_name mpg
#> <char> <char> <num>
#> 1: mpg_threshold Duster 360 14.3
#> 2: mpg_threshold Cadillac Fleetwood 10.4
#> 3: mpg_threshold Lincoln Continental 10.4
#> 4: mpg_threshold Chrysler Imperial 14.7
#> 5: mpg_threshold Camaro Z28 13.3
#> 6: mpg_threshold Maserati Bora 15.0
Using Dynamic Validation Rules
You can make your validation rules dynamic by using variables. There are two ways to do this:
1. Using Environment Variables
# Define threshold in environment
min_mpg <- 12
# Use environment variable in validation
validate_rules(
mtcars,
mpg_minimum = mpg > min_mpg
)
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 10.4
#> 2: mpg_minimum 10.4
2. Using the env
Parameter (Recommended)
# Pass variables explicitly
validate_rules(
mtcars,
mpg_minimum = mpg > threshold,
env = list(threshold = 12) # More explicit and safer
)
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 10.4
#> 2: mpg_minimum 10.4
Taking Action on Validation Failures
The action_if_problem()
function helps you handle
validation failures appropriately. It offers two modes:
1. Stop on Failure
Use this when you want to halt execution if validations fail:
try({
validate_rules(mtcars, mpg_minimum = mpg > 15) |>
action_if_problem(
message_text = "Validation failed: Some cars have low MPG",
problem_action = "stop"
)
})
#> [1] "Validation failed: Some cars have low MPG"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#> Error in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg > :
#> Validation failed: Some cars have low MPG
2. Warning on Failure
Use this when you want to continue execution but be notified of failures:
results_with_warning <- validate_rules(mtcars, mpg_minimum = mpg > 15) |>
action_if_problem(
message_text = "Warning: Some cars have low MPG",
problem_action = "warning"
)
#> [1] "Warning: Some cars have low MPG"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#> Warning in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg > :
#> Warning: Some cars have low MPG
# Processing continues and you can still access results
results_with_warning
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 14.3
#> 2: mpg_minimum 10.4
#> 3: mpg_minimum 10.4
#> 4: mpg_minimum 14.7
#> 5: mpg_minimum 13.3
#> 6: mpg_minimum 15.0
Best Practices
- Always include meaningful rule names that describe the validation
- Use
keep_rl_cols
to include identifying columns in error reports - Prefer the
env
parameter over environment variables for dynamic thresholds - Choose appropriate actions based on how critical the validation is:
- Use “stop” for critical data quality issues
- Use “warning” for advisory checks