A high-level wrapper around the validate
package that performs data validation
checks against a set of user-defined business rules. The function accepts a data frame
and a set of validation rules, then returns both a summary of validation results and
detailed information about any violations at the row level.
Usage
validate_rules(
df,
...,
env = list(),
keep_rl_cols = NULL,
select_rl_rules = NULL
)
Arguments
- df
A data.frame or data.table containing the data to validate
- ...
Validation rules expressed as named R expressions (e.g., "age_check" = age >= 18)
- env
A list of variables to be used within the validation rules. These variables can be referenced directly in the rule expressions (e.g., list(min_age = 18))
- keep_rl_cols
Character vector specifying additional columns to include in the row-level error output, besides those used in the validation rules
- select_rl_rules
Character vector specifying which row-level rules to analyze. If NULL (default), analyzes all failing rules
Value
A list containing two elements:
summary: A data.table containing validation results for all rules, including:
name: The name of the validation rule
items: Number of items checked
passes: Number of passing checks
fails: Number of failing checks
nNA: Number of NA values encountered
row_level_errors: A list of data.tables, one for each failing rule, containing:
Broken Rule: The name of the failed validation rule
Columns used in the validation rule
Additional columns specified in keep_rl_cols
Details
The function provides two key benefits:
It simplifies the process of validating data against multiple business rules
It makes it easy to identify specific rows that violate each rule
The function will stop with an error if any validation rule returns NA values, as these are considered invalid results rather than rule violations.
Examples
# Validate car data against mpg and horsepower rules
validation_results <- data.table::as.data.table(mtcars, keep.rownames = "Car Name") |>
validate_rules(
"mpg_minimum" = mpg > min_mpg,
"hp_maximum" = hp < 200,
env = list(min_mpg = 15),
keep_rl_cols = "Car Name"
)
# View summary of all rules
validation_results$summary
#> name items passes fails nNA error warning
#> <char> <int> <int> <int> <int> <lgcl> <lgcl>
#> 1: mpg_minimum 32 26 6 0 FALSE FALSE
#> 2: hp_maximum 32 25 7 0 FALSE FALSE
# View specific rows that violated each rule
validation_results$row_level_errors
#> $mpg_minimum
#> Broken Rule Car Name mpg
#> <char> <char> <num>
#> 1: mpg_minimum Duster 360 14.3
#> 2: mpg_minimum Cadillac Fleetwood 10.4
#> 3: mpg_minimum Lincoln Continental 10.4
#> 4: mpg_minimum Chrysler Imperial 14.7
#> 5: mpg_minimum Camaro Z28 13.3
#> 6: mpg_minimum Maserati Bora 15.0
#>
#> $hp_maximum
#> Broken Rule Car Name hp
#> <char> <char> <num>
#> 1: hp_maximum Duster 360 245
#> 2: hp_maximum Cadillac Fleetwood 205
#> 3: hp_maximum Lincoln Continental 215
#> 4: hp_maximum Chrysler Imperial 230
#> 5: hp_maximum Camaro Z28 245
#> 6: hp_maximum Ford Pantera L 264
#> 7: hp_maximum Maserati Bora 335
#>