Skip to contents

A high-level wrapper around the validate package that performs data validation checks against a set of user-defined business rules. The function accepts a data frame and a set of validation rules, then returns both a summary of validation results and detailed information about any violations at the row level.

Usage

validate_rules(
  df,
  ...,
  env = list(),
  keep_rl_cols = NULL,
  select_rl_rules = NULL
)

Arguments

df

A data.frame or data.table containing the data to validate

...

Validation rules expressed as named R expressions (e.g., "age_check" = age >= 18)

env

A list of variables to be used within the validation rules. These variables can be referenced directly in the rule expressions (e.g., list(min_age = 18))

keep_rl_cols

Character vector specifying additional columns to include in the row-level error output, besides those used in the validation rules

select_rl_rules

Character vector specifying which row-level rules to analyze. If NULL (default), analyzes all failing rules

Value

A list containing two elements:

  • summary: A data.table containing validation results for all rules, including:

    • name: The name of the validation rule

    • items: Number of items checked

    • passes: Number of passing checks

    • fails: Number of failing checks

    • nNA: Number of NA values encountered

  • row_level_errors: A list of data.tables, one for each failing rule, containing:

    • Broken Rule: The name of the failed validation rule

    • Columns used in the validation rule

    • Additional columns specified in keep_rl_cols

Details

The function provides two key benefits:

  1. It simplifies the process of validating data against multiple business rules

  2. It makes it easy to identify specific rows that violate each rule

The function will stop with an error if any validation rule returns NA values, as these are considered invalid results rather than rule violations.

See also

Examples

# Validate car data against mpg and horsepower rules
validation_results <- data.table::as.data.table(mtcars, keep.rownames = "Car Name") |>
  validate_rules(
    "mpg_minimum" = mpg > min_mpg,
    "hp_maximum" = hp < 200,
    env = list(min_mpg = 15),
    keep_rl_cols = "Car Name"
  )

# View summary of all rules
validation_results$summary
#>           name items passes fails   nNA  error warning
#>         <char> <int>  <int> <int> <int> <lgcl>  <lgcl>
#> 1: mpg_minimum    32     26     6     0  FALSE   FALSE
#> 2:  hp_maximum    32     25     7     0  FALSE   FALSE

# View specific rows that violated each rule
validation_results$row_level_errors
#> $mpg_minimum
#>    Broken Rule            Car Name   mpg
#>         <char>              <char> <num>
#> 1: mpg_minimum          Duster 360  14.3
#> 2: mpg_minimum  Cadillac Fleetwood  10.4
#> 3: mpg_minimum Lincoln Continental  10.4
#> 4: mpg_minimum   Chrysler Imperial  14.7
#> 5: mpg_minimum          Camaro Z28  13.3
#> 6: mpg_minimum       Maserati Bora  15.0
#> 
#> $hp_maximum
#>    Broken Rule            Car Name    hp
#>         <char>              <char> <num>
#> 1:  hp_maximum          Duster 360   245
#> 2:  hp_maximum  Cadillac Fleetwood   205
#> 3:  hp_maximum Lincoln Continental   215
#> 4:  hp_maximum   Chrysler Imperial   230
#> 5:  hp_maximum          Camaro Z28   245
#> 6:  hp_maximum      Ford Pantera L   264
#> 7:  hp_maximum       Maserati Bora   335
#>