Introduction to tidyvalidate
tidyvalidate is a package that simplifies data
validation in R by providing an intuitive interface to the powerful
validate package. It helps you ensure data quality before
analysis by making it easy to check both column-level and row-level
conditions in your data frames.
Why tidyvalidate?
Data validation in R traditionally faces two challenges:
- Base R validation functions work well with vectors but become cumbersome when working with data frames, especially for row-level validations
- The
validatepackage offers comprehensive validation capabilities but requires multiple function calls for common use cases
tidyvalidate addresses these challenges by providing a
streamlined interface that makes data validation both powerful and easy
to use.
Basic Usage
Let’s start with a simple example using the built-in
mtcars dataset:
library(tidyvalidate)
# Define and run validations
validation_results <- validate_rules(
mtcars,
# Column-level validation
mpg_type = is.numeric(mpg),
hp_type = is.numeric(hp),
# Row-level validation
mpg_threshold = mpg > 15
)Understanding the Results
The validate_rules() function returns a list with two
components:
1. Summary of All Validations
validation_results$summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_type 1 1 0 0 FALSE FALSE
#> 2: hp_type 1 1 0 0 FALSE FALSE
#> 3: mpg_threshold 32 26 6 0 FALSE FALSEThe summary table shows: - Which rules passed or failed - How many items were checked for each rule - The number of failures and NA values
2. Detailed Row-Level Errors
validation_results$row_level_errors
#> $mpg_threshold
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_threshold 14.3
#> 2: mpg_threshold 10.4
#> 3: mpg_threshold 10.4
#> 4: mpg_threshold 14.7
#> 5: mpg_threshold 13.3
#> 6: mpg_threshold 15.0This list contains data tables showing exactly which rows failed each
validation rule, but only for row-level rules (like our
mpg_threshold).
Advanced Features
Adding Row Identifiers
When working with real datasets, you often need to identify which specific records failed validation. Here’s how to include identifiers in the error output:
# Add row names as a column
cars_with_names <- mtcars
cars_with_names$car_name <- rownames(mtcars)
rownames(cars_with_names) <- NULL
# Validate with identifiers
validation_with_ids <- validate_rules(
cars_with_names,
mpg_threshold = mpg > 15,
keep_rl_cols = "car_name" # Include car_name in error output
)
# View failures with identifiers
validation_with_ids$row_level_errors
#> $mpg_threshold
#> Broken Rule car_name mpg
#> <char> <char> <num>
#> 1: mpg_threshold Duster 360 14.3
#> 2: mpg_threshold Cadillac Fleetwood 10.4
#> 3: mpg_threshold Lincoln Continental 10.4
#> 4: mpg_threshold Chrysler Imperial 14.7
#> 5: mpg_threshold Camaro Z28 13.3
#> 6: mpg_threshold Maserati Bora 15.0Using Dynamic Validation Rules
You can make your validation rules dynamic by using variables. There are two ways to do this:
1. Using Environment Variables
# Define threshold in environment
min_mpg <- 12
# Use environment variable in validation
validate_rules(
mtcars,
mpg_minimum = mpg > min_mpg
)
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 10.4
#> 2: mpg_minimum 10.42. Using the env Parameter (Recommended)
# Pass variables explicitly
validate_rules(
mtcars,
mpg_minimum = mpg > threshold,
env = list(threshold = 12) # More explicit and safer
)
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 30 2 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 10.4
#> 2: mpg_minimum 10.4Taking Action on Validation Failures
The action_if_problem() function helps you handle
validation failures appropriately. It offers two modes:
1. Stop on Failure
Use this when you want to halt execution if validations fail:
try({
validate_rules(mtcars, mpg_minimum = mpg > 15) |>
action_if_problem(
message_text = "Validation failed: Some cars have low MPG",
problem_action = "stop"
)
})
#> [1] "Validation failed: Some cars have low MPG"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#> Error in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg > :
#> Validation failed: Some cars have low MPG2. Warning on Failure
Use this when you want to continue execution but be notified of failures:
results_with_warning <- validate_rules(mtcars, mpg_minimum = mpg > 15) |>
action_if_problem(
message_text = "Warning: Some cars have low MPG",
problem_action = "warning"
)
#> [1] "Warning: Some cars have low MPG"
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#> Warning in action_if_problem(validate_rules(mtcars, mpg_minimum = mpg > :
#> Warning: Some cars have low MPG
# Processing continues and you can still access results
results_with_warning
#> $summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#>
#> $row_level_errors
#> $row_level_errors$mpg_minimum
#> Broken Rule mpg
#> <char> <num>
#> 1: mpg_minimum 14.3
#> 2: mpg_minimum 10.4
#> 3: mpg_minimum 10.4
#> 4: mpg_minimum 14.7
#> 5: mpg_minimum 13.3
#> 6: mpg_minimum 15.0Best Practices
- Always include meaningful rule names that describe the validation
- Use
keep_rl_colsto include identifying columns in error reports - Prefer the
envparameter over environment variables for dynamic thresholds - Choose appropriate actions based on how critical the validation is:
- Use “stop” for critical data quality issues
- Use “warning” for advisory checks
