Maximizing Driver Earnings by Selecting the Best Configuration and Time to Start

A Simulation-Driven Optimization of Ride-Hail Policies Using Tree-Based Interactions

Once we have a policy to decide if a trip should be taken or rejected, it is important to validate whether any of the following factors could cause an increase or decrease in taxi results:

Defining Combinations to Simulate

As we have 196 zones to evaluate, 3 days to evaluate for each day of the week (7), 2 taxi companies, and we divide the day into 8 moments to measure (one every 3 hours), we end up with an experiment of 65,856 days to simulate.

Note

As DOLocationID is the variable used in the simulation to define the start zone of the first trip to simulate, we need to simulate all available zones for each combination.

Show the code
# Loading functions ----

## To manage relative paths
library(here)
library(pins)
library(qs2)

## Manipulate data
library(data.table)
library(sf)
library(DBI)
library(duckdb)
library(lubridate)
library(timeDate)
library(ggtext)
library(scales)
library(gt)

## To run predictions
library(tidymodels)
library(embed)
library(themis)

## Custom functions, board & params
devtools::load_all()
BoardLocal <- board_folder(here("../NycTaxiPins/Board"))
Params <- yaml::read_yaml(here("params.yml"))


# Defining combinations ----

## Getting shapes to use
ZonesShapes <-
  pin_read(BoardLocal, "ZonesShapes") |>
  (\(df) {
    df[
      df[["borough"]] %chin% names(Params$BoroughColors),
    ]
  })() |>
  st_cast(to = "MULTIPOLYGON")

setDT(ZonesShapes)
ZonesShapes[, LocationID := as.character(LocationID)]

UniqueZoneIds <- unique(ZonesShapes$LocationID) |> sort()


## Defining days to sample
YearDays <- data.table(
  days = seq.Date(make_date(2023), make_date(2023, 12, 31), by = "day")
)
YearDays[, wday := lubridate::wday(days)]
set.seed(4879)
SampledDays <- YearDays[, .SD[sample.int(.N, 3L)], by = "wday"]


## Pre-build the big design matrix once
DataToSimulate <- CJ(
  DOLocationID = UniqueZoneIds,
  hvfhs_license_num = c("HV0003", "HV0005"),
  calendar_day = as.character(SampledDays$days),
  time_hour = seq(0, 21, length.out = 8)
)
DataToSimulate[, `:=`(
  trip_id = as.numeric(1:.N),
  request_datetime = paste(calendar_day, time_hour) |> ymd_h(),
  trip_time = 0
)]


set.seed(4875)

DataToSimulate |>
  slice_sample(n = 8) |>
  gt() |>
  # Add a title and subtitle
  tab_header(
    title = md("**Situations to evaluate**"),
    subtitle = paste(
      "A sample of the",
      comma(nrow(DataToSimulate)),
      "days to simulate"
    )
  ) |>
  # Format the date-time column nicely
  fmt_datetime(
    columns = request_datetime,
    date_style = "iso",
    time_style = "hm"
  ) |>
  # Left-align text, center numbers for better readability
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  # Give it a clean, modern theme
  opt_stylize(style = 1, color = "gray")
Situations to evaluate
A sample of the 65,856 days to simulate
DOLocationID hvfhs_license_num calendar_day time_hour trip_id request_datetime trip_time
37 HV0005 2023-02-24 15 48230 2023-02-24 3:00 PM 0
125 HV0003 2023-07-03 6 7123 2023-07-03 6:00 AM 0
79 HV0003 2023-08-08 12 59565 2023-08-08 12:00 PM 0
236 HV0003 2023-11-01 15 38438 2023-11-01 3:00 PM 0
133 HV0005 2023-05-15 15 9958 2023-05-15 3:00 PM 0
125 HV0005 2023-05-15 12 7269 2023-05-15 12:00 PM 0
55 HV0003 2023-07-26 12 53173 2023-07-26 12:00 PM 0
82 HV0005 2023-12-08 6 60795 2023-12-08 6:00 AM 0

Running the Simulation

As I knew that running the whole process would take me several days, I decided to store the results of the simulations on disk for each zone. In this way, I was able to keep the experiment running as expected and stop the process between days to allow the computer to rest and reduce the risk of losing all progress. I also avoided memory problems while using the computer for daily activities.

To avoid the overhead of using an IDE, the next code was sourced from the terminal after connecting with ssh NycTaxi to consume as few resources as possible.

## Loading env

# Getting prior policy
AcceptRejectPolicyFitted <-
  pin_read(BoardLocal, "AcceptRejectPolicyFitted")

# Competing renaming columns
DataToSimulate[, `:=`(
  calendar_day = NULL,
  time_hour = NULL,
  PULocationID = "1",
  wav_match_flag = "N",
  driver_pay = 0,
  tips = 0,
  dispatching_base_num = "",
  originating_base_num = "",
  on_scene_datetime = NA_POSIXct_,
  pickup_datetime = NA_POSIXct_,
  dropoff_datetime = NA_POSIXct_,
  trip_miles = 0,
  base_passenger_fare = 0,
  tolls = 0,
  bcf = 0,
  sales_tax = 0,
  congestion_surcharge = 0,
  airport_fee = 0,
  shared_request_flag = "",
  shared_match_flag = "",
  access_a_ride_flag = "",
  wav_request_flag = "",
  month = 1,
  year = 2023,
  performance_per_hour = 0
)]

# Check which zones are already done
AlreadyDone <- pin_list(BoardLocal) |>
  grep(pattern = "^InitialZoneSim_ZoneId_", value = TRUE) |>
  gsub(pattern = "InitialZoneSim_ZoneId_", replacement = "")

zones_to_run <- setdiff(UniqueZoneIds, AlreadyDone)

cat(
  "→",
  length(zones_to_run),
  "zones left to simulate (out of",
  length(UniqueZoneIds),
  "total).\n"
)


# === MAIN RESUMABLE LOOP ===
for (zone_i in zones_to_run) {
  cat("   Simulating zone", zone_i, "... ")

  # Filter only this zone
  ZoneData <- DataToSimulate[list(zone_i), on = "DOLocationID"]

  # Fresh connection for each zone
  con <- dbConnect(duckdb(), dbdir = here("../NycTaxiBigFiles/my-db.duckdb"))

  ZoneSim <- simulate_trips(
    con,
    ZoneData,
    seeds = 4878,
    fitted_wf = AcceptRejectPolicyFitted
  )

  dbDisconnect(con, shutdown = TRUE)

  # Save immediately with clear name
  pin_write(
    BoardLocal,
    ZoneSim,
    paste0("InitialZoneSim_ZoneId_", zone_i),
    type = "qs2"
  )

  cat("DONE (", nrow(ZoneSim), "rows)\n")
}

cat("✅ All zones simulated!\n")

Here is the list of zones simulated, and we can see that it took around 102 minutes to complete each zone, which means I invested around 14 days of computation to obtain the results below.

angelfeliz@debian:~/r-projects/NycTaxi$ ls -ld ../NycTaxiPins/Board/InitialZoneSim_ZoneId_* | wc -l
196

Now we just need to summarize the results to start the exploration.

consolidate_all_simulations <- function(zone_id_data, board, sim_start_days) {
  consolidated_data =
    sim_start_trip_summary(
      pin_read(board, zone_id_data),
      sim_start_days
    )[,
      sim_start_days[, list(
        simulation_id = trip_id,
        hvfhs_license_num,
        DOLocationID,
        request_datetime = paste(calendar_day, time_hour) |> ymd_h()
      )][.SD, on = "simulation_id"],
      .SDcols = !patterns("_sd")
    ]

  return(consolidated_data)
}

InitialZoneSimResults <-
  pin_list(BoardLocal) |>
  grep(pattern = "^InitialZoneSim_ZoneId_", value = TRUE) |>
  lapply(
    FUN = consolidate_all_simulations,
    board = BoardLocal,
    sim_start_days = DataToSimulate
  ) |>
  rbindlist()

# Adding region info
InitialZoneSimResults[
  ZonesShapes,
  on = c("DOLocationID" = "LocationID"),
  `:=`(borough = borough, start_zone_name = zone)
]

# Save immediately with clear name
pin_write(
  BoardLocal,
  InitialZoneSimResults,
  "InitialZoneSimResults",
  type = "qs2"
)

Exploring Results

Before anything, it is important to confirm if changing the initial conditions of trips had an effect on the target variable. Just imagine that we see the same value for all combinations—we could stop the question simply by making the first chart.

But you can see in the following chart that there is a lot of variation. The data appears to follow a log-normal distribution due to its long right tail. These are really promising results. The goal behind this experiment is to identify the starting configurations that can beat the current baseline of $55.09, and it is clear based on the chart that the top 10% present higher daily hourly wages than the baseline.

Show the code
InitialZoneSimResults <- pin_read(
  BoardLocal,
  "InitialZoneSimResults"
)

FinalPfaMeans <- pin_read(
  BoardLocal,
  "FinalPfaMeans"
)


TopDailyHourlyWage <- quantile(
  InitialZoneSimResults$daily_hourly_wage_mean,
  0.90
)

InitialZoneSimResults[,
  is_top_trip := fifelse(
    daily_hourly_wage_mean >= TopDailyHourlyWage,
    "yes",
    "no"
  ) |>
    factor(levels = c("no", "yes"))
]


BreaksToPlot <- c(seq(20, 200, 10), unname(FinalPfaMeans)) |> setdiff(60L)

highlight_text <- function(x) {
  paste0(
    " <span style='color:",
    Params$ColorHighlight,
    ";'>**",
    x,
    "**</span> "
  )
}

ggplot(
  InitialZoneSimResults,
  aes(daily_hourly_wage_mean, fill = is_top_trip)
) +
  geom_histogram(color = "black", bins = 30) +
  geom_vline(
    xintercept = FinalPfaMeans[2L],
    linetype = 2
  ) +
  scale_y_continuous(
    label = label_comma(accuracy = 1)
  ) +
  scale_x_continuous(
    label = label_currency(accuracy = 1),
    breaks = setdiff(BreaksToPlot, FinalPfaMeans[1L])
  ) +
  scale_fill_manual(
    values = c("yes" = Params$ColorHighlight, "no" = Params$ColorGray)
  ) +
  labs(
    title = paste0(
      "**Top**",
      highlight_text("10%"),
      "**of combinations make at least**",
      highlight_text(paste0("_", dollar(TopDailyHourlyWage), " per hour_"))
    ),
    y = "Count",
    x = "Daily Hourly Wage"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily, base_size = 12) +
  theme(
    plot.title = element_markdown(size = 12, family = Params$BaseFontFamily),
    axis.title = element_text(face = "bold", family = Params$BaseFontFamily),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
  )

As we can see, the improvement behind the policy only benefits Uber drivers, as the median hourly wage for Lyft drivers matches the exact value of the baseline. As a result, our policy will always recommend Uber as the taxi company.

Show the code
TaxiCompanyFillColors <- c(
  "Uber" = Params$ColorHighlight,
  "Lyft" = Params$ColorGray
)

InitialZoneSimResults[,
  taxi_company := fifelse(hvfhs_license_num == 'HV0003', "Uber", "Lyft")
]

ggplot(
  InitialZoneSimResults,
  aes(
    factor(taxi_company, levels = names(TaxiCompanyFillColors)),
    daily_hourly_wage_mean,
    fill = taxi_company
  )
) +
  geom_boxplot(width = 0.5) +
  geom_hline(
    yintercept = FinalPfaMeans[2L],
    linetype = 2
  ) +
  geom_hline(
    yintercept = FinalPfaMeans[1L],
    linetype = 2
  ) +
  scale_fill_manual(
    values = TaxiCompanyFillColors
  ) +
  scale_y_continuous(
    breaks = BreaksToPlot,
    label = dollar_format(1)
  ) +
  labs(
    title = paste0(
      "<span style='color:",
      Params$ColorHighlight,
      ";'>**_Uber_**</span> ",
      "**drivers makes more money than** ",
      "<span style='color:",
      "#5b5b5b",
      ";'>**_Lyft_**</span> **drivers**"
    ),
    x = "Taxi Company",
    y = "Daily Hourly Wage"
  ) +
  theme_light(base_family = Params$BaseFontFamily) +
  theme(
    plot.title = element_markdown(size = 14, family = Params$BaseFontFamily),
    plot.subtitle = element_markdown(
      size = 12,
      family = Params$BaseFontFamily
    ),
    axis.title = element_text(face = "bold", family = Params$BaseFontFamily),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
  )

After isolating the effect of the taxi company on the hourly wage, we can see by selecting the zones with the highest and lowest medians that the change in the target variable explained by starting zone is low. We are evaluating 196 zones, and the increase in the median for Uber drivers was only $2 over the baseline ($62.88) after implementing the trip selection model. Based on the boxplot, we can conclude that zones do not provide enough information to explain the variance present in the target variable.

Show the code
TopBottomZonesMedian <- InitialZoneSimResults[,
  list(daily_hourly_wage_mean = median(daily_hourly_wage_mean)),
  by = c("taxi_company", "DOLocationID")
][order(daily_hourly_wage_mean)][,
  .SD[c(1:10, .N - 0:9)],
  by = "taxi_company",
  .SDcols = "DOLocationID"
][, InitialZoneSimResults[.SD, on = c("taxi_company", "DOLocationID")]][,
  start_zone_name := reorder(
    paste0(start_zone_name, "-", taxi_company),
    daily_hourly_wage_mean,
    median
  )
]

ggplot(
  TopBottomZonesMedian,
  aes(daily_hourly_wage_mean, start_zone_name, fill = borough)
) +
  geom_boxplot() +
  geom_vline(xintercept = FinalPfaMeans[1L], linetype = 2) +
  geom_vline(xintercept = FinalPfaMeans[2L], linetype = 2) +
  scale_fill_manual(values = Params$BoroughColors) +
  scale_x_continuous(
    breaks = BreaksToPlot,
    label = dollar_format(1)
  ) +
  scale_y_discrete(label = \(x) sub("-(.|\\s)+$", "", x)) +
  facet_wrap(vars(taxi_company), ncol = 1, scales = "free_y") +
  labs(
    title = "Some difference can be explained by start zone",
    subtitle = "After splitting by taxi company",
    y = "Start Simulation Zone",
    x = "Daily Hourly Wage",
    fill = "Borough"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily) +
  theme(
    plot.title.position = "plot",
    plot.title = element_text(
      size = 14,
      face = "bold",
      family = Params$BaseFontFamily
    ),
    plot.subtitle = element_text(
      size = 12,
      face = "italic",
      family = Params$BaseFontFamily
    ),
    axis.text.y = element_text(
      size = 10,
      face = "italic",
      family = Params$BaseFontFamily
    ),
    axis.title = element_text(face = "bold", family = Params$BaseFontFamily),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "right",
    legend.title = element_text(face = "bold", family = Params$BaseFontFamily),
    strip.text = element_text(
      size = 12,
      face = "bold",
      family = Params$BaseFontFamily
    )
  )

Based on the next chart, we can see a strong effect of start hour on the target variable and some moderate interactions with the day of the week. The hourly wage is higher at 0 hours on Sundays and Saturdays, but those increases are not present on weekdays.

Show the code
Weekdays <- c(
  "Sunday",
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday"
)

BetterThanModelFillColors <- c(
  "TRUE" = Params$ColorHighlight,
  "FALSE" = Params$ColorGray
)

ggplot(
  copy(InitialZoneSimResults)[, `:=`(
    taxi_company = factor(taxi_company, levels = c("Uber", "Lyft")),
    time_hour = reorder(
      hour(request_datetime),
      hour(request_datetime),
      FUN = sum
    ),
    week_day = factor(Weekdays[wday(request_datetime)], levels = Weekdays)
  )][,
    better_than_model := median(daily_hourly_wage_mean) > FinalPfaMeans[2L],
    by = c("time_hour", "taxi_company", "week_day")
  ],
  aes(
    time_hour,
    daily_hourly_wage_mean,
    fill = better_than_model
  )
) +
  geom_boxplot() +
  geom_hline(yintercept = FinalPfaMeans[1L], linetype = 2) +
  geom_hline(yintercept = FinalPfaMeans[2L], linetype = 2) +
  scale_fill_manual(
    values = BetterThanModelFillColors
  ) +
  scale_y_continuous(breaks = BreaksToPlot, label = dollar_format(1)) +
  facet_grid(vars(taxi_company), vars(week_day)) +
  labs(
    title = paste0(
      "<span style='color:",
      Params$ColorHighlight,
      ";'>**Start Time**</span> ",
      "**Can Increase Earning Reguardless Day of Week**"
    ),
    x = "Hour Of The Day",
    y = "Daily Hourly Wage"
  ) +
  theme_light(base_family = Params$BaseFontFamily) +
  theme(
    plot.title = element_markdown(size = 12, family = Params$BaseFontFamily),
    plot.subtitle = element_markdown(size = 10, family = Params$BaseFontFamily),
    axis.title = element_text(face = "bold"),
    axis.text.x = element_text(size = 6, family = Params$BaseFontFamily),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none",
    strip.text = element_text(
      size = 10,
      face = "bold",
      family = Params$BaseFontFamily
    )
  )

Finding Hidden Patterns

Based on the prior exploration, it is clear that the most important rule found so far is to choose Uber as the company. However, to improve earnings further, we need to find better interactions between variables. For that reason, we are going to start with a tree model to see if the model can find the most important interactions to explain changes in the target variable.

To tune the model, we know that the most important hyperparameter is tree_depth to limit how many splits the tree can make, which controls vertical complexity. As a result, we want to use the minimum value possible.

Since we have a lot of data to explore due to the number of zones in the experiment, we know that min_n—which sets the minimum number of data points required in a node to allow a further split—will not be an important hyperparameter, as most splits will have many data points to avoid overfitting.

DecisionTreeSpec <- decision_tree(
  mode = "classification",
  tree_depth = tune("tree_depth"),
  min_n = 10
)

To help the tree find the important patterns, we apply the following transformations to the original data:

  • As start_zone presents high cardinality, we substitute it with hourly_wage_mean_by_zone after isolating the effect of taxi company.
  • To allow the model to see hour of the day and time of the week as the cycles they represent in real life—where the same time repeats every 24 hours and the same time in the week repeats every 168 hours—we create the variables day_cycle_sin_1 and day_cycle_cos_1 instead of just using time_hour, and week_cycle_sin_1 and week_cycle_cos_1 instead of using the original week_day.
  • We show taxi_company as a boolean variable to keep all variables numeric.
DecisionTreeRecipe <- recipe(
  is_top_trip ~ hvfhs_license_num +
    DOLocationID +
    request_datetime +
    simulation_id +
    daily_hourly_wage_mean,
  data = InitialZoneSimResults
) |>
  # Creating variables to use later
  step_mutate(
    taxi_company = ifelse(hvfhs_license_num == 'HV0003', "Uber", "Lyft") |>
      factor(levels = c("Lyft", "Uber")),
    day_cycle = hour(request_datetime),
    week_cycle = (wday(request_datetime) - 1) * 24 + day_cycle,
    hourly_wage_mean_by_zone = paste0(DOLocationID, "-", taxi_company)
  ) |>
  # Changing zones with their mean
  step_lencode(
    hourly_wage_mean_by_zone,
    outcome = vars(daily_hourly_wage_mean),
    smooth = FALSE
  ) |>
  # Transforming company to binary variable
  step_dummy(taxi_company) |>
  # Creating Cycles
  step_harmonic(
    day_cycle,
    cycle_size = 24,
    frequency = 1,
    keep_original_cols = FALSE
  ) |>
  step_harmonic(
    week_cycle,
    cycle_size = 168,
    frequency = 1,
    keep_original_cols = FALSE
  ) |>
  # Saving original vars
  update_role(
    simulation_id,
    DOLocationID,
    daily_hourly_wage_mean,
    hvfhs_license_num,
    request_datetime,
    new_role = "additional info"
  ) |>
  # Removing original variables
  step_upsample(is_top_trip, over_ratio = tune("over_ratio"))

Now, after all this configuration, we are ready to consolidate all that logic into a compact workflow that will help us fit and evaluate our model using all tools from tidymodels.

DecisionTreeWf <-
  workflow() |>
  add_recipe(DecisionTreeRecipe) |>
  add_model(DecisionTreeSpec)

As this is a simple model and we just need to tune 2 parameters, we can use the grid method to find the best trip possible with 10-fold cross-validation to estimate the model error without taking testing data. We just need to understand the results of the experiment rather than using this model to make predictions.

DecisionTreeWfRegularGrid <- grid_regular(
  tree_depth(range = c(2L, 16L)),
  over_ratio(range = c(0.1, 1)),
  levels = c(tree_depth = 5, over_ratio = 10)
)

set.seed(5878)
InitialZoneSimResultsResamples <- vfold_cv(
  InitialZoneSimResults,
  v = 10,
  strata = is_top_trip
)

DecisionTreeWfTuned <- tune_grid(
  object = DecisionTreeWf,
  resamples = InitialZoneSimResultsResamples,
  grid = DecisionTreeWfRegularGrid,
  metrics = metric_set(recall, precision, brier_class),
  control = control_grid(
    verbose = TRUE,
    parallel_over = NULL,
    save_pred = FALSE
  )
)

pin_write(
  BoardLocal,
  DecisionTreeWfTuned,
  name = "DecisionTreeWfTuned",
  type = "qs2"
)

After all this exploration, we not only need to take the configuration that better predicts the results, but instead the one that loses less prediction power without moving too high in tree_depth. For that reason, rather than using the default function select_best, we use a different version that is more flexible in the brier_class score to select the configuration with less model complexity, as we want a model that is easy to explain.

DecisionTreeWfTuned <- pin_read(
  BoardLocal,
  name = "DecisionTreeWfTuned"
)

DecisionTreeWfBestConfig <- select_by_pct_loss(
  DecisionTreeWfTuned,
  metric = "brier_class",
  limit = 1,
  tree_depth
)
Show the code
autoplot(DecisionTreeWfTuned) +
  geom_vline(xintercept = DecisionTreeWfBestConfig$over_ratio, linetype = 2) +
  scale_x_continuous(breaks = breaks_width(0.1)) +
  labs(
    title = paste0(
      "**With a tree depth of**",
      highlight_text(DecisionTreeWfBestConfig$tree_depth),
      "**and**",
      DecisionTreeWfBestConfig$over_ratio |>
        scales::percent() |>
        highlight_text(),
      "**of oversampling**<br>**We can get the best model**"
    )
  ) +
  theme_light(base_family = Params$BaseFontFamily) +
  theme(
    legend.position = "top",
    plot.title = element_markdown(family = Params$BaseFontFamily),
    plot.subtitle = element_text(
      family = Params$BaseFontFamily,
      face = "italic"
    ),
    legend.title = element_markdown(family = Params$BaseFontFamily),
    panel.grid.minor = element_blank()
  )

Now we are ready to train the model and see the probabilities predicted by the model. The results are really good, as we can see that the median of daily_hourly_wage_mean increases with the probability of the trip being of high value.

DecisionTreeWfFitted <-
  finalize_workflow(DecisionTreeWf, DecisionTreeWfBestConfig) |>
  fit(data = InitialZoneSimResults)

pin_write(
  BoardLocal,
  DecisionTreeWfFitted,
  name = "DecisionTreeWfFitted",
  type = "qs2"
)
Show the code
DecisionTreeWfFitted <- pin_read(
  BoardLocal,
  "DecisionTreeWfFitted"
)

DecisionTreeAugmented <-
  augment(DecisionTreeWfFitted, new_data = InitialZoneSimResults)

ggplot(
  DecisionTreeAugmented,
  aes(
    round(.pred_yes, 2) |> forcats::as_factor(),
    daily_hourly_wage_mean
  )
) +
  geom_jitter(aes(color = is_top_trip), alpha = 0.1) +
  geom_boxplot() +
  geom_hline(yintercept = TopDailyHourlyWage, linetype = 2) +
  scale_y_continuous(
    breaks = seq(0, 150, 10) |> setdiff(70) |> c(unname(TopDailyHourlyWage)),
    label = label_currency(accuracy = 1)
  ) +
  scale_color_manual(
    values = c("yes" = Params$ColorHighlight, "no" = Params$ColorGray)
  ) +
  labs(
    title = paste0(
      "**After**",
      highlight_text("50%"),
      "**we can confirm a change in distribution**"
    ),
    y = "Daily Hourly Wage",
    x = "Probabilty to be High Trip"
  ) +
  theme_light(base_family = Params$BaseFontFamily) +
  theme(
    plot.title = element_markdown(family = Params$BaseFontFamily),
    axis.title = element_text(size = 12, face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    plot.subtitle = element_text(
      family = Params$BaseFontFamily,
      face = "italic"
    ),
    legend.position = "none"
  )

I feel really good about the results provided in the confusion matrix, as the model was able to predict 41% of days as high value and only incorrectly predicted high-value trips for 5% of all possible trips—a solid pattern.

Show the code
plot_custom_conf_matrix <- function(dt) {
  procced_data =
    dt |>
    group_by(Truth = is_top_trip, Prediction = .pred_class) |>
    summarize(Freq = n(), mean_pred_yes = mean(.pred_yes)) |>
    mutate(truth_prop = Freq / sum(Freq)) |>
    ungroup()

  true_positive_prob =
    procced_data |>
    filter(Truth == "yes", Prediction == "yes") |>
    pull(truth_prop) |>
    percent()

  created_plot =
    procced_data |>
    ggplot(aes(Truth, Prediction)) +
    geom_tile(aes(fill = truth_prop)) +
    geom_label(
      aes(
        label = paste0(
          percent(truth_prop),
          "\n",
          comma(Freq),
          "\n Yes prob: ",
          percent(mean_pred_yes)
        )
      ),
      fill = "white"
    ) +
    scale_fill_continuous(palette = "Greys") +
    labs(
      title = paste0(
        true_positive_prob |> paste("of high value trips") |> highlight_text(),
        "**were predicted**"
      )
    ) +
    theme_minimal(base_family = Params$BaseFontFamily) +
    theme(
      plot.title.position = "plot",
      plot.title = element_markdown(family = Params$BaseFontFamily),
      axis.title = element_text(size = 12, face = "bold"),
      axis.text = element_text(size = 12),
      panel.grid = element_blank(),
      legend.position = 'none'
    )

  return(created_plot)
}

plot_custom_conf_matrix(DecisionTreeAugmented)

Confirming Pattern Effect

To facilitate the interpretation of the effect of using the tree model to change the behavior of the taxi driver, we will range all variables from 0 to 1 in order to have the model coefficients in dollars per hour.

TreeResultsToExplore <-
  DecisionTreeRecipe |>
  step_range(all_numeric_predictors()) |>
  finalize_recipe(DecisionTreeWfBestConfig) |>
  prep(training = InitialZoneSimResults) |>
  bake(new_data = InitialZoneSimResults) |>
  bind_cols(
    transmute(
      DecisionTreeAugmented,
      tree_model_high_value = (.pred_yes > 0.5) * 1
    )
  )

As we want to define the effect of using the tree model on the hourly mean wage of drivers, it is really important to confirm that the variables used to train the model do not present collinearity. Based on the table, it is clear that taxi_company_Uber and hourly_wage_mean_by_zone are contributing the same information. As a result, we know that hourly_wage_mean_by_zone is just dividing drivers by company, and the zone is not providing any additional information.

Show the code
linear_reg() |>
  set_engine("lm") |>
  fit(
    data = TreeResultsToExplore,
    formula = daily_hourly_wage_mean ~
      tree_model_high_value +
      hourly_wage_mean_by_zone +
      taxi_company_Uber +
      day_cycle_sin_1 +
      day_cycle_cos_1 +
      week_cycle_sin_1 +
      week_cycle_cos_1
  ) |>
  extract_fit_engine() |>
  car::vif() |>
  round(digits = 2) |>
  (\(x) tibble(variable = names(x), vif = x))() |>
  arrange(desc(vif)) |>
  gt() |>
  tab_header(
    title = "Multicollinearity Diagnostic: Variance Inflation Factor (VIF)",
    subtitle = "Checking stability of factors determining hourly income"
  ) |>
  cols_label(
    variable = "Variable",
    vif = "VIF Score"
  ) |>
  # Format VIF to 2 decimal places
  fmt_number(columns = vif, decimals = 2) |>
  # Style variable names to be italicized
  tab_style(
    style = cell_text(style = "italic"),
    locations = cells_body(columns = variable)
  ) |>
  # Add a visual warning for high multicollinearity (VIF > 5 or 10)
  tab_style(
    style = cell_text(color = "darkred", weight = "bold"),
    locations = cells_body(columns = everything(), rows = vif >= 5.0)
  ) |>
  tab_source_note(
    source_note = "Note: VIF > 5 indicates potential multicollinearity issues that may inflate standard errors."
  )
Multicollinearity Diagnostic: Variance Inflation Factor (VIF)
Checking stability of factors determining hourly income
Variable VIF Score
taxi_company_Uber 45.38
hourly_wage_mean_by_zone 45.21
tree_model_high_value 1.27
day_cycle_sin_1 1.08
day_cycle_cos_1 1.01
week_cycle_cos_1 1.01
week_cycle_sin_1 1.00
Note: VIF > 5 indicates potential multicollinearity issues that may inflate standard errors.


With 65,856 simulated days all coefficients are statistically significant, so the focus should be on the magnitude of each effect.

The regression reveals a clear hierarchy of factors:

  • A driver that use Uber can earn up to $7.71 more per hour compared to a driver that use Lyft.
  • Starting at a favorable time of day can add up to $5.52 per hour (captured by day_cycle_cos_1), while an unfavorable hour can subtract up to $2.33.
  • Day-of-week effects are small (under $1.58 in either direction), confirming that what time of day the driver starts matters far more than which day of the week they work.
  • A driver who follows the tree model’s accept/reject strategy earns an additional $5.81 per hour on top of all the above. This reflects the specific company-and-timing combinations the tree identified as high-value that no single variable captures independently.

In practice, a driver making all the right decisions — starting with Uber at a favorable time and following the tree model’s strategy — can move from the baseline of $54/hour to approximately $70/hour, representing a 34% improvement in hourly earnings.

Show the code
create_lm_gt_table = function(lm_model) {
  model_coef =
    tidy(lm_model, conf.int = TRUE) |>
    arrange(desc(abs(estimate))) |>
    select(term, conf.low, estimate, conf.high)

  gt_final =
    gt(model_coef) |>
    tab_header(
      title = "Impact Analysis: Decision Tree Policy vs. Baseline Components",
      subtitle = "Hierarchy of factors determining hourly income"
    ) |>
    cols_label(
      term = "Variable",
      conf.low = "Lower",
      estimate = "Estimate (Coeff.)",
      conf.high = "Upper"
    ) |>
    fmt_currency(columns = c(conf.low, estimate, conf.high), decimals = 2) |>
    tab_style(
      style = cell_text(style = "italic"),
      locations = cells_body(columns = term)
    ) |>
    tab_style(
      style = list(
        cell_fill(color = Params$ColorHighlightLow),
        cell_text(weight = "bold")
      ),
      locations = cells_body(rows = term == "tree_model_high_value")
    ) |>
    tab_source_note(
      source_note = "Note: Based on 65,856 simulated days. OLS regression with robust standard errors."
    )

  return(gt_final)
}

LinearModelTreeEffect <- linear_reg() |>
  set_engine("lm") |>
  fit(
    data = TreeResultsToExplore,
    formula = daily_hourly_wage_mean ~
      tree_model_high_value +
      taxi_company_Uber +
      day_cycle_sin_1 +
      day_cycle_cos_1 +
      week_cycle_sin_1 +
      week_cycle_cos_1
  )

create_lm_gt_table(LinearModelTreeEffect)
Impact Analysis: Decision Tree Policy vs. Baseline Components
Hierarchy of factors determining hourly income
Variable Lower Estimate (Coeff.) Upper
(Intercept) $52.90 $53.04 $53.18
taxi_company_Uber $7.61 $7.71 $7.80
tree_model_high_value $5.66 $5.81 $5.97
day_cycle_cos_1 $5.40 $5.52 $5.65
day_cycle_sin_1 −$2.46 −$2.33 −$2.20
week_cycle_cos_1 $2.17 $2.29 $2.42
week_cycle_sin_1 −$0.84 −$0.71 −$0.59
Note: Based on 65,856 simulated days. OLS regression with robust standard errors.

Defining Policy to Implement

To understand how the model can achieve such incredible results, let’s plot the model to see how it is using the features.

Show the code
TreeModelFittedEngine = extract_fit_engine(DecisionTreeWfFitted)

rpart.plot::prp(
  # 1. Model Object
  x = TreeModelFittedEngine,

  # 2. Labels & Formatting
  main = "Strategy to find best trips",
  prefix = "Best trip\n",
  extra = 106,
  digits = 2,

  # 3. Tree Structure & Geometry
  type = 3,
  branch = 0.3,
  branch.lwd = 2.5,
  roundint = FALSE,

  # 4. Node & Leaf Aesthetics
  box.palette = "GyPu",
  round = 2,
  leaf.round = 9,
  under = TRUE,
  under.cex = 1.0,

  # 5. Variable Name Handling
  varlen = 0,
  faclen = 0,
  clip.facs = FALSE,

  # 6. Typography & Fonts
  family = Params$BaseFontFamily,
  split.family = Params$BaseFontFamily,
  split.cex = 0.9
)

As the most important variable for the model was hourly_wage_mean_by_zone, let’s explore which zones were excluded by the model. As a first step, we can extract the split used for that variable.

HourlyWageMeanByZoneSplit <-
  TreeModelFittedEngine$splits["hourly_wage_mean_by_zone", "index"]

HourlyWageMeanByZoneSplit
#> [1] 59.65442

As the variables used by the model are not easy to interpret by humans, let’s integrate into a single dataframe the features used by the model, the features before using the recipe transformations, and the predictions of the model to create visuals that translate what was learned by the model.

TreeResultsToExplain <-
  DecisionTreeRecipe |>
  finalize_recipe(DecisionTreeWfBestConfig) |>
  prep(training = InitialZoneSimResults) |>
  bake(new_data = InitialZoneSimResults) |>
  (\(df) {
    select(df, -all_of(intersect(names(df), names(DecisionTreeAugmented))))
  })() |>
  bind_cols(
    DecisionTreeAugmented
  )
  1. The hourly_wage_mean_by_zone (as explained before) only worked to separate trips between Uber and Lyft to only select Uber at any zone.
TreeResultsToExplain |>
  mutate(is_top_zone = hourly_wage_mean_by_zone >= HourlyWageMeanByZoneSplit) |>
  distinct(DOLocationID, is_top_zone) |>
  count(DOLocationID) |>
  filter(n == 1L) |>
  nrow()
#> [1] 0
  1. It is better to start working at night in order to get better results.
Show the code
TreeResultsToExplain |>
  filter(taxi_company_Uber == 1L) |>
  distinct(
    .pred_class,
    day_cycle_sin_1,
    day_cycle_cos_1,
    time_hour = hour(request_datetime),
    week_day = factor(
      Weekdays[wday(request_datetime)],
      levels = c(Weekdays[-1L], Weekdays[1L])
    ),
    week_cycle_sin_1,
    week_cycle_cos_1
  ) |>
  ggplot(aes(
    day_cycle_sin_1,
    day_cycle_cos_1,
    color = .pred_class
  )) +
  geom_text(aes(label = time_hour), fontface = "bold", size = 3.5) +
  scale_color_manual(
    values = c("yes" = Params$ColorHighlight, "no" = Params$ColorGray)
  ) +
  facet_wrap(vars(week_day), ncol = 4) +
  coord_fixed() +
  labs(
    title = paste0(
      "**Must of the days taxi driver can start at**",
      highlight_text("21 hours")
    ),
    subtitle = "_On **Mondays** and **Fridays** taxi drivers could just rest_"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily, base_size = 12) +
  theme(
    legend.position = "none",
    panel.grid = element_blank(),
    strip.background = element_rect(color = "black"),
    strip.text = element_text(face = "bold", size = 12),
    axis.text = element_blank(),
    axis.title = element_blank(),
    plot.title = element_markdown(),
    plot.subtitle = element_markdown()
  )

Now we just need to take that information in a format that can be applied to the trips sampled from the data.

ValidHoursToStartWorking <-
  TreeResultsToExplain |>
  filter(.pred_class == "yes") |>
  distinct(
    hour = hour(request_datetime),
    week_day = wday(request_datetime),
    week_cycle = (wday(request_datetime) - 1) * 24 + hour(request_datetime)
  ) |>
  arrange(week_cycle)

setDT(ValidHoursToStartWorking)

pin_write(
  BoardLocal,
  ValidHoursToStartWorking,
  name = "ValidHoursToStartWorking",
  type = "qs2"
)
Show the code
ValidHoursToStartWorking <- pin_read(BoardLocal, "ValidHoursToStartWorking")

ValidHoursToStartWorking |>
  gt() |>
  # Add a title and subtitle
  tab_header(
    title = md("**Valid Hour To Use**"),
    subtitle = "In case that of low value start times"
  ) |>
  # Left-align text, center numbers for better readability
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  # Give it a clean, modern theme
  opt_stylize(style = 1, color = "gray")
Valid Hour To Use
In case that of low value start times
hour week_day week_cycle
0 1 0
3 1 3
6 1 6
15 3 63
18 3 66
21 3 69
15 4 87
18 4 90
21 4 93
15 5 111
18 5 114
21 5 117
21 7 165

All this logic was implemented in the optimize_trip_start_time function, and we can make sure that the trips fill these conditions after applying the function over the trips used to create the baseline.

# Sampled data (Chapter 4)
SimulationStartDay <- pin_read(
  BoardLocal,
  name = "SimulationStartDay"
)
original_seeds <- c(8199, 50, 7450, 1430, 9370)

SimulationStartDayUpdated <- optimize_trip_start_time(
  SimulationStartDay,
  start_day_fitted_model = DecisionTreeWfFitted,
  valid_start_times_dt = ValidHoursToStartWorking
)
Show the code
list(SimulationStartDay, SimulationStartDayUpdated) |>
  lapply(\(dt) {
    dt[,
      list(Uber_prop = mean(hvfhs_license_num == "HV0003"), N = .N),
      by = list(
        week_day = factor(
          Weekdays[wday(request_datetime)],
          levels = c(Weekdays[-1L], Weekdays[1L])
        )
      )
    ]
  }) |>
  c(list(
    by = "week_day",
    all = TRUE,
    suffixes = c("_original", "_updated")
  )) |>
  do.call(
    what = "merge"
  ) |>
  gt() |>
  # Header & Titles
  tab_header(
    title = "Simulation Start Days After Policy",
    subtitle = "Now all days are with Uber and any day starts on Monday or Friday"
  ) |>

  # Apply styling to Header (Dark Blue/Gray theme from your style)
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_title()
  ) |>
  tab_style(
    style = cell_fill(color = "#2c3e50"),
    locations = cells_title()
  ) |>

  # Format Proportions (Percentages look great here, or use decimals = 4 to match your preference)
  fmt_percent(
    columns = c(Uber_prop_original, Uber_prop_updated),
    decimals = 0
  ) |>

  # Format Counts (Integers)
  fmt_integer(
    columns = c(N_original, N_updated)
  ) |>

  # Handle Missing Values (Replaces NA with a clean em-dash)
  sub_missing(
    columns = everything(),
    missing_text = "—"
  ) |>

  # Group columns visually using spanners
  tab_spanner(
    label = "ORIGINAL CONDITIONS",
    columns = c(Uber_prop_original, N_original)
  ) |>
  tab_spanner(
    label = "UPDATED CONDITIONS",
    columns = c(Uber_prop_updated, N_updated)
  ) |>

  # Clean Column Labels
  cols_label(
    week_day = "DAY OF WEEK",
    Uber_prop_original = "UBER PROP.",
    N_original = "N",
    Uber_prop_updated = "UBER PROP.",
    N_updated = "N"
  ) |>

  # Alignment
  cols_align(
    align = "left",
    columns = week_day
  ) |>

  # General Table Options
  tab_options(
    table.font.names = "Arial",
    table.background.color = "white",
    column_labels.font.weight = "bold",
    column_labels.background.color = "#f8f9fa",
    table.width = pct(100),
    quarto.disable_processing = TRUE
  )
Simulation Start Days After Policy
Now all days are with Uber and any day starts on Monday or Friday
DAY OF WEEK
ORIGINAL CONDITIONS
UPDATED CONDITIONS
UBER PROP. N UBER PROP. N
Monday 69% 140
Tuesday 68% 120 100% 384
Wednesday 71% 144 100% 142
Thursday 73% 152 100% 154
Friday 72% 161
Saturday 75% 178 100% 345
Sunday 78% 166 100% 36

Evaluating the Policy

To evaluate this policy under the same conditions we evaluated the baseline and the accept/reject policy, we are using the same simulation with the difference that now we are passing the start_day_fitted_model and valid_start_times_dt arguments, which will be used by optimize_trip_start_time to change the initial conditions of each day.

# Production Policy (Chapter 9)
AcceptRejectPolicyFitted <- pin_read(
  BoardLocal,
  "AcceptRejectPolicyFitted"
)

# Running simulation
con <- dbConnect(duckdb(), dbdir = here("../NycTaxiBigFiles/my-db.duckdb"))

FinalStartTimeResults <- simulate_trips(
  con,
  SimulationStartDay,
  seeds = original_seeds,
  fitted_wf = AcceptRejectPolicyFitted,
  start_day_fitted_model = DecisionTreeWfFitted,
  valid_start_times_dt = ValidHoursToStartWorking
)

dbDisconnect(con, shutdown = TRUE)

pin_write(
  BoardLocal,
  FinalStartTimeResults,
  "FinalStartTimeResults",
  type = "qs2"
)
Note

This final run took approximately 2 days on the available hardware.

Now we can consolidate all results in order to compare the three strategies under the same conditions.

SimulationHourlyWage <- pin_read(BoardLocal, "SimulationHourlyWage")
FinalPfaResults <- pin_read(BoardLocal, "FinalPfaResults")
FinalStartTimeResults <- pin_read(BoardLocal, "FinalStartTimeResults")

PolicyLevels <- c(
  "Accept All Trips",
  "Accept Based On Policy",
  "Setting Start Time"
)

FinalPfaResultsSummary <- rbind(
  sim_start_trip_summary(SimulationHourlyWage, SimulationStartDay)[,
    strategy := PolicyLevels[1L]
  ],
  sim_start_trip_summary(FinalPfaResults, SimulationStartDay)[,
    strategy := PolicyLevels[2L]
  ],
  sim_start_trip_summary(FinalStartTimeResults, SimulationStartDayUpdated)[,
    strategy := PolicyLevels[3L]
  ],
  fill = FALSE
)

FinalPfaResultsSummary[, strategy := factor(strategy, PolicyLevels)]

FinalPfaMeansAfterStart <-
  FinalPfaResultsSummary[,
    .(daily_hourly_wage_mean = mean(daily_hourly_wage_mean)),
    keyby = "strategy"
  ][, setattr(daily_hourly_wage_mean, "names", as.character(strategy))]

The results are amazing: we move from having $55.09 per hour to being able to earn on average $69.07 if we implement both policies at the same time. We can also see that the chart presents outliers of days that ended earning more than $90, which was not possible even using the accept/reject policy alone.

Show the code
PolicyLevelsFillColors <- c(
  Params$ColorGray,
  Params$ColorGray,
  Params$ColorHighlight
)
setattr(PolicyLevelsFillColors, "names", PolicyLevels)

ggplot(FinalPfaResultsSummary, aes(strategy, daily_hourly_wage_mean)) +
  geom_boxplot(aes(fill = strategy), width = 0.5) +
  geom_hline(yintercept = FinalPfaMeansAfterStart[3L], linetype = 2) +
  scale_fill_manual(
    values = PolicyLevelsFillColors
  ) +
  scale_y_continuous(breaks = breaks_width(5), label = dollar_format(1)) +
  labs(
    title = paste0(
      "<span style='color:",
      Params$ColorHighlight,
      ";'>**",
      PolicyLevels[3L],
      "**</span> ",
      "**is the best policy**"
    ),
    subtitle = paste0(
      "The new mean kept around **",
      dollar(FinalPfaMeansAfterStart[3L], accuracy = 1),
      "** confirming stable results"
    ),
    x = "Strategy Implemented",
    y = "Mean Hourly Wage"
  ) +
  theme_light(base_family = Params$BaseFontFamily) +
  theme(
    plot.title = element_markdown(size = 14, family = Params$BaseFontFamily),
    plot.subtitle = element_markdown(size = 12, family = Params$BaseFontFamily),
    axis.title = element_text(face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
  )

If we compare the difference between the accept/reject policy and the policy created now to start the trips, we can confirm that the increase in the average is not due to an increase in variability but rather because in the majority of cases the taxi drivers are getting better results than in the original situation.

Show the code
InterceptToValidate <- 0

MeanPolicyDifference <- FinalPfaResultsSummary[
  strategy != PolicyLevels[1L],
  .(
    mean_difference = fifelse(
      strategy == PolicyLevels[2L],
      -daily_hourly_wage_mean,
      daily_hourly_wage_mean
    ) |>
      sum()
  ),
  by = "simulation_id"
]

PropSimoOverIntercept <- MeanPolicyDifference[, mean(
  mean_difference >= InterceptToValidate
)]

ggplot(MeanPolicyDifference) +
  geom_histogram(aes(mean_difference), bins = 35) +
  geom_vline(xintercept = InterceptToValidate, linetype = 2) +
  scale_x_continuous(n.breaks = 10, label = dollar_format(1)) +
  labs(
    title = paste0(
      "<span style='color:",
      Params$ColorHighlight,
      ";'>**",
      percent(PropSimoOverIntercept, accuracy = 1),
      "**</span> ",
      "**of simulated days ended earning more money**"
    ),
    x = "Mean Hourly Wage Increase"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily) +
  theme(
    plot.title = element_markdown(face = "bold", size = 14),
    panel.grid.minor = element_blank(),
    axis.title = element_text(face = "bold")
  )

Key Takeaways

  • Platform Selection: Drivers should exclusively choose Uber. The selective trip policy provides no benefits for Lyft, whose median wage stagnates at the $55.09 baseline , whereas driving for Uber adds up to +$7.71/hour.

  • Geographic Zones: Initial start zones have a negligible impact (only a $2 variance over the baseline). Due to severe multicollinearity with the choice of company, geographic start zones provide no useful independent information.

  • Shift Timing: Favorable starting hours (especially night shifts) add up to +$5.52/hour , while unfavorable starting times subtract up to -$2.33/hour.

  • Day of the Week: Day-of-the-week variance is minor (under $1.58 in either direction) ; however, the optimized policy explicitly excludes Mondays and Fridays as sub-optimal working days.

  • Final Policy Lift: Pairing the trip-acceptance policy with the strategic “Setting Start Time” configuration boosts average hourly wages from $55.09 to $69.07 , while unlocking exceptional high-yield outlier days exceeding $90/hour.