Increasing NYC Taxi Drivers Earnings
Problem Description
Opportunity
Taxi drivers could increase their earnings by changing their strategy.
Questions to solve
- How much can a taxi driver increase their monthly earnings just by skipping trips under defined conditions?
- How much can a taxi driver increase their monthly earnings just by changing their initial zone and time?
Business success criteria
Develop a strategy to increase NYC taxi driversβ monthly earnings by 20%.
Project scope
This project will be limited to Juno, Uber, Via and Lyft taxi drivers who work in New York City in trips that take place between any zone of Manhattan, Brooklyn or Queens (the more active ones).
Results Highlight
π From Predictions to Profitable Policies
Instead of treating trip acceptance as a simple classification problem, we framed it as a sequential decision task under uncertainty. The final policy, obtained by integrating an XGBoost classifier into a fullβday empirical Monte Carlo simulator, yields a mean hourly wage of $60.58 β a 9.96% lift over the myopic βtakeβallβ baseline ($55.09). The improvement is achieved by becoming more selective: the optimal operational threshold Ο* lies at 0.6 (the highest value tested), meaning the driver accepts only trips with a predicted probability of being βhighβvalueβ β₯ 60%. The hourly wage rises monotonically with the threshold, contradicting the intuition that one should accept most trips and reject only the worst.

Simulated hourly wage (with 95% CI) as a function of the decision threshold \(\tau\). The baseline ( \(\tau\) = 0) corresponds to the myopic βtakeβallβ policy.
π€ Modeling for Decision Support
From probabilistic predictions to profitβdriven thresholds: We selected the Brier score as the primary evaluation metric because it measures both discrimination and calibrationβessential when probabilities inform acceptance decisions. To translate predicted probabilities into actionable trip recommendations, we performed a threshold optimization that accounts for asymmetric misclassification costs: rejecting a lucrative trip (false negative) is far costlier than accepting a mediocre one (false positive). By simulating driver earnings across 5βfold crossβvalidation, we identified an optimal validation threshold of 0.22βwell below the default 0.5. This lower threshold reflects the fact that drivers should be slightly more willing to accept trips, as the opportunity cost of waiting for a βperfectβ fare outweighs the risk of a lowβvalue ride.
Robustness and generalisation: The final XGBoost model was evaluated on a heldβout test set, achieving a Brier score of 0.146 (matching crossβvalidation) and delivering a net benefit of $0.35 per trip compared to a βtakeβallβ baseline. This represents a 51.5% improvement in expected cost, confirming that the costβbenefit tradeβoff remains stable on unseen data.
π― Sequential Decision Framework & Business Logic
The core of this project is not a simple classification task; it is a Sequential Decision Analytics problem. We transformed raw observational taxi data into a decisionβmaking tool by addressing two major challenges:
- The Target Variable Dilemma: The original dataset lacked a βground truthβ for whether a trip was optimal. We engineered the target variable
take_current_tripby calculating the Opportunity Cost of each trip. This involved simulating the potential earnings of waiting for a highβvalue fare versus accepting the immediate request, creating a decisionβcentric label from scratch. - The Baseline Paradox: We established a βTakeβAll Policyβ (accepting every trip) as the baseline. While the ML model shows strong predictive performance (AUC and Brier Score), we have framed the project to acknowledge that predictive accuracy does not automatically equate to policy superiority. The model is designed to optimize a threshold that maximizes net hourly earnings, not just βhits and misses.β
- Closing the ADP Loop: The final policy evaluation runs the exact same fullβday Monte Carlo simulator used for the baseline, but replaces the random trip selection with a threshold rule based on the XGBoost predictions. This applesβtoβapples comparison provides a credible estimate of the realβworld lift.
πΎ Engineering for Big Data & Software Reliability (Out-of-Core Processing)
We architected a robust, productionβgrade pipeline designed to handle datasets exceeding available RAM while maintaining strict software engineering standards:
Custom Tidymodels Extensions: To integrate geospatial features seamlessly into the machine learning pipeline, we developed a custom
recipesstep. This allows for the automated preprocessing of coordinates and spatial joins within a unified workflow, ensuring that feature engineering is consistent during both training and inference.ProductionβGrade R Development: To ensure reliability, the project is structured as a formal R package, moving beyond simple scripts to a maintainable codebase.
Unit Testing: We implemented a comprehensive suite of tests using
testthatto validate custom logic, specifically for the simulation functions and the customrecipessteps.Rigorous Documentation: All core functions are fully documented using
roxygen2, providing clear API definitions, parameter requirements, and usage examples.
Hybrid Analytical Engine: We utilized DuckDB as an outβofβcore engine to perform heavy aggregations and joins directly on disk. Once filtered, we leveraged data.table in R for ultraβefficient inβmemory manipulation, combining SQLβs disk performance with Rβs functional programming power.
Reproducible Environments: The entire stack is managed via Nix and Docker, ensuring the environmentβincluding complex geospatial system dependenciesβis 100% reproducible across any machine.
πΊοΈ HighβDimensional Feature Engineering & Spatial Intelligence
- Conceptual Clustering (NLP): To navigate the βCurse of Dimensionalityβ presented by 160,000+ US Census variables, we did not use arbitrary selection. Instead we applied NLP (Jaccard Distance) and EdgeβBetweenness clustering to group variables into conceptual themes (e.g., βCommuting Habits,β βWealth Distributionβ), allowing for a dataβdriven prioritization of features.
- Geospatial Intersections: Integrated OpenStreetMap (OSM) data by performing complex spatial intersections. We mapped road lengths and amenity densities (restaurants, transit hubs) to specific taxi zones to capture the geographic βDNAβ of NYC.
Methodology
To find the optimal solution for those questions, we followed the methodology proposed by Warren B. Powell (2022) in Sequential Decision Analytics and Modeling: Modeling with Python and combined it with the CrossβIndustry Standard Process for Data Mining (CRISPβDM) to define a machine learning model that powers the sequential decision.


Following the steps of both methodologies, we organized the articles created in this portfolio website:
| Sequential Decision Analytics | CRISPβDM | Article Name |
|---|---|---|
| Core Elements of the Problem | Business Understanding | 1. Business Understanding Overview |
| Data Understanding | 2. Data Collection Process | |
| Business Understanding | 3. Initial Exploration | |
| From Defining Mathematical Model to Evaluating Baseline Policy | Data Understanding | 4. SimulationβBased Estimation of the Baseline Hourly Wage for NYC Taxi Drivers 5. LookaheadβBased Labeling for Learning an Improved ADP Policy |
| Data Preparation | 6. Expanding Geospatial Information 7. Expanding Transportation and Socioeconomic Patterns |
|
| Modeling and Evaluation | 8. Policy Function Approximation: Training the Classifier and Validating Threshold on HeldβOut Data | |
| Defining and Evaluating New Policies | Modeling and Evaluation | 9. From Predictions to Policies: Integrating ML into Stochastic Optimization 10. Strategic \(S_0\): Designing a Policy Function Approximation for Optimal Starting States (Planned) |
| Deployment | 11. Wrap Decision Model into REST API (Planned) 12. Serving Model by a Shiny Web App (Planned) |
Data to Use
In this project, we used a subset of the data available in the TLC Trip Record Data from 2022 to 2023 for High Volume ForβHire Vehicle β which covers the Juno, Uber, Via and Lyft trips within our project scope β with the columns described in its data dictionary.
Disclaimer
This project was completed under strong assumptions given that the data used in the analysis does not provide any unique identifier for taxi drivers, which limits the realism of some results.
Additionally, this project aims to increase taxi driver earnings at the individual level. However, if applied extensively, it could also produce the following unintended consequences:
Reduced service quality: Drivers focusing solely on maximizing earnings may avoid less profitable areas or times, potentially leaving some passengers underserved.
Increased congestion: Drivers congregating in highβprofit areas could worsen traffic in already busy parts of the city.
This project is intended as a demonstration of data science methodology rather than a prescriptive business recommendation, and these considerations should be carefully weighed before any realβworld implementation.
Roadmap & Future Developments
Completed: Full ADP Loop
- Policy Function Approximation (PFA): Trained an XGBoost classifier on lookaheadβbased labels and calibrated its probability output.
- Tunable Threshold Policy: Defined the decision rule \(x_t = \mathbb{I}\{\hat{p}_t(S_t) \geq \tau\}\).
- FullβDay Monte Carlo Evaluation: Ran the policy inside the exact same simulator used for the baseline, obtaining a 9.96% lift in hourly wage. The optimal operational threshold is \(\tau^* = 0.6\) (the highest value tested), with a monotonic increase in performance up to that point. Further grid search up to \(\tau = 1.0\) will confirm whether the optimum lies even higher.
Planned Extensions
- SupplyβSide Modeling: The current simulation assumes an abundance of trips. Collecting data on driver availability and realized waiting times would allow a more realistic supplyβside extension.
- REST API: A R/Plumber API to serve realβtime trip recommendations based on the trained XGBoost model.
- Shiny Dashboard: An interactive web application built in Shiny to visualize the driverβs predicted earnings, optimal decision thresholds, and spatial demand heatmaps in realβtime.

Try the app in your browser: NYC Taxi Zone Selector on Hugging Face Spaces
Source code: https://github.com/AngelFelizR/nyc-taxi-zone-selector
Project Structure and Tooling
Reproducibility and longβterm maintainability were core priorities from the start, which shaped every tooling decision in this project. The following tools were used to achieve this:
- We use
gitto manage changes in the code and provide an interface to share the project on GitHub. DockerandNixare used to build a reproducible devβcontainer based ondefault.nix. The container can be connected via SSH using a public and private key pair as defined insetup.sh, and the.envrcsets the Nix environment to use in the Positron console.- For modeling, we used the
tidymodelsframework to ensure we are following good modeling practices. - Since the project follows the basic structure of an R package, we were able to document and create unit tests for custom functions using
testthat,roxygen2anddevtools. This was especially important to ensure that the simulation function and the custom step function (which extends therecipespackage) work correctly. - The project also follows the structure of a Quarto project and renders all articles into the
docsfolder, giving us full control over the format used to present each article. Results are hosted on GitHub Pages, so they can be shared at no cost. - The
.Rprofileoverridesinstall.packages,update.packagesandremove.packagesto make clear that R packages must be defined indefault.nixto ensure reproducibility. - To manage data larger than RAM, we use
duckdband keep large files in a separate folder namedNycTaxiBigFilesunder the same parent directory as this repo. - To cache results generated during the investigation process, we use
.qs2files and track them withpins, stored under the folderNycTaxiPinsin the same parent directory as this repo. - We use the air extension to ensure consistent code formatting across the project.
The result is a hybrid structure that combines an R package (with documented functions and unit tests) and a Quarto website (with rendered articles and hosted results), which was one of the most challenging aspects of the project to set up correctly:
tree -L 3
.
βββ air.toml
βββ default.nix
βββ DESCRIPTION
βββ docker-compose.yml
βββ Dockerfile
βββ docs
β βββ figures
β β βββ CRISP-DM_Process_Diagram.png
β β βββ htop_parallel_process.png
β β βββ logo-generated.jpeg
β β βββ model_benefit_curve.png
β β βββ model-benefit.jpg
β β βββ nyc-taxi-navbar-logo.png
β β βββ nyc-taxi-navbar-logo.xcf
β β βββ screenshot-ui.png
β β βββ Sequential-Decision-Modeling-Framework.png
β βββ index.html
β βββ investigation-phases
β β βββ 01-business-understanding.html
β β βββ 02-data-collection-process.html
β β βββ 03-initial-exploration_files
β β βββ 03-initial-exploration.html
β β βββ 04-base-line_files
β β βββ 04-base-line.html
β β βββ 05-lookahead-labeling_files
β β βββ 05-lookahead-labeling.html
β β βββ 06-expanding-geospatial-data_files
β β βββ 06-expanding-geospatial-data.html
β β βββ 07-expanding-transportation-socioeconomic_files
β β βββ 07-expanding-transportation-socioeconomic.html
β β βββ 08-policy-function-approximation_files
β β βββ 08-policy-function-approximation.html
β β βββ 09-from-predictions-to-policies_files
β β βββ 09-from-predictions-to-policies.html
β βββ man
β β βββ figures
β βββ search.json
β βββ site_libs
β βββ bootstrap
β βββ clipboard
β βββ DiagrammeR-styles-0.2
β βββ ggiraphjs-0.9.2
β βββ girafe-binding-0.9.2
β βββ grViz-binding-1.0.11
β βββ htmltools-fill-0.5.8.1
β βββ htmlwidgets-1.6.4
β βββ jquery-3.6.0
β βββ leaflet-1.3.1
β βββ leaflet-binding-2.2.3
β βββ leafletfix-1.0.0
β βββ Leaflet.glify-3.2.0
β βββ leaflet-providers-2.0.0
β βββ leaflet-providers-plugin-2.2.3
β βββ proj4-2.6.2
β βββ Proj4Leaflet-1.0.1
β βββ quarto-html
β βββ quarto-nav
β βββ quarto-search
β βββ rstudio_leaflet-1.3.1
β βββ viz-1.8.2
βββ figures
β βββ CRISP-DM_Process_Diagram.png
β βββ htop_parallel_process.png
β βββ logo-generated.jpeg
β βββ model_benefit_curve.png
β βββ model-benefit.jpg
β βββ nyc-taxi-navbar-logo.png
β βββ nyc-taxi-navbar-logo.xcf
β βββ screenshot-ui.png
β βββ Sequential-Decision-Modeling-Framework.png
β βββ simulated_wage_vs_threshold.png
βββ generate_env.R
βββ index.qmd
βββ investigation-phases
β βββ 01-business-understanding.qmd
β βββ 02-data-collection-process.qmd
β βββ 03-initial-exploration.qmd
β βββ 04-base-line.qmd
β βββ 05-lookahead-labeling.qmd
β βββ 06-expanding-geospatial-data.qmd
β βββ 07-expanding-transportation-socioeconomic.qmd
β βββ 08-policy-function-approximation.qmd
β βββ 09-from-predictions-to-policies.qmd
β βββ defining-start-zone-strategy.txt
β βββ temp2.qmd
β βββ temp.txt
βββ man
β βββ add_performance_variables.Rd
β βββ add_pred_class.Rd
β βββ add_take_current_trip.Rd
β βββ calculate_costs.Rd
β βββ collect_predictions_best_config.Rd
β βββ compare_model_predictions.Rd
β βββ compute_power.Rd
β βββ figures
β β βββ logo.hex
β β βββ logo-image.png
β β βββ logo.png
β β βββ Logo-source.txt
β βββ NycTaxi-package.Rd
β βββ plot_bar.Rd
β βββ plot_box.Rd
β βββ plot_heap_map.Rd
β βββ plot_num_distribution.Rd
β βββ required_pkgs.step_join_geospatial_features.Rd
β βββ simulate_trips.Rd
β βββ step_join_geospatial_features.Rd
βββ multicore-scripts
β βββ 01-fine-tune-future-process.R
β βββ 02-add-target.R
β βββ 02-run_add_target.sh
β βββ 03a-tuning-simple-models.R
β βββ 03b-tuning-dimreduction-models.R
β βββ 03c-tuning-tree-models.R
β βββ Grid-search code.R
βββ NAMESPACE
βββ params.yml
βββ _quarto.yml
βββ R
β βββ add_take_current_trip.R
β βββ calculate_costs.R
β βββ compare_model_predictions.R
β βββ compute_power.R
β βββ NycTaxi-package.R
β βββ plot_bar.R
β βββ plot_box.R
β βββ plot_heap_map.R
β βββ plot_num_distribution.R
β βββ simulate_trips.R
β βββ step_join_geospatial_features.R
β βββ utils.R
βββ README.md
βββ result -> /nix/store/9pl55mdv7yh07sf5i8z3b39q0dcy23q9-nix-shell
βββ setup.sh
βββ tests
βββ testthat
β βββ fixtures
β βββ test-add_take_current_trip.R
β βββ test-calculate_costs.R
β βββ test-plot_box.R
β βββ test-simulate_trips.R
β βββ test-step_join_geospatial_features.R
βββ testthat.R
45 directories, 99 filesDefining Development Environment
To reproduce the results of this project, follow these steps to set up the same environment using Docker and Nix.
1. Install Docker and Docker Compose
You need Docker and Docker Compose. Choose the appropriate installation method for your operating system:
- Windows or macOS: Install Docker Desktop (includes Docker Compose).
- Linux: Install the Docker Engine and then Docker Compose.
For Debian 13 (as an example), run the following as root:
apt update
apt install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian trixie stable"
apt update
apt install -y docker-ce docker-compose-plugin
systemctl enable docker && systemctl start docker
usermod -aG docker <YOUR-USER>
su - <YOUR-USER>Note: Replace <YOUR-USER> with your actual username.
2. Clone the Repository and Prepare Directories
Navigate to the parent directory where you want to store the project and the data folders. Then run:
cd <parent-dir-path>
mkdir NycTaxiBigFiles
mkdir NycTaxiPins
git clone https://github.com/AngelFelizR/NycTaxiYour directory structure should look like:
<parent-dir-path>/
βββ NycTaxi/ # cloned repository
βββ NycTaxiBigFiles/ # large data files (mounted into container)
βββ NycTaxiPins/ # pin board storage (mounted into container)3. Run the Setup Script
The repository includes a setup.sh script that automates all remaining steps: pulling the image, starting the container, and configuring SSH keyβbased authentication using your existing ~/.ssh/id_rsa.pub.
From inside the NycTaxi folder, run:
cd NycTaxi
chmod +x setup.sh
./setup.shThe script will:
- Pull the preβbuilt image
angelfelizr/nyc-taxi:4.5.2from Docker Hub. - Start the container in detached mode, mapping port
2222for SSH and mounting the three directories under/root/. - Register your public key (
~/.ssh/id_rsa.pub) inside the container so you can connect without a password.
#!/bin/bash
docker compose pull
docker compose up -d
docker compose cp ~/.ssh/id_rsa.pub nyc-taxi:/root/.ssh/authorized_keys
docker compose exec nyc-taxi chown root:root /root/.ssh/authorized_keys
docker compose exec nyc-taxi chmod 600 /root/.ssh/authorized_keys
echo "Ready! Connect with: ssh NycTaxi"You can verify the container is running with docker compose ps.
4. Configure SSH
Add the following to your ~/.ssh/config so you can connect with a simple alias:
Host NycTaxi
HostName 127.0.0.1
User root
Port 2222
IdentityFile ~/.ssh/id_rsa
Then connect with:
ssh NycTaxi5. Using Positron (or VS Code) with direnv
Since direnv is configured via the .envrc file in the repository, you can use Positron with the SSH remote development feature to work directly inside the container.
- In Positron, select βConnect to Hostβ¦β (or use the Remote Explorer).
- Enter
root@localhost:2222and authenticate using your SSH key (configured in Step 3). - Once connected, open the folder
/root/NycTaxi. - Install the direnv extension by mkhl from the Open VSX Registry. This extension automatically activates direnv when you open a folder containing an
.envrcfile.
After the extension loads, you should see a notification confirming that direnv is active. At that point, any terminal you open inside Positron will have the Nix environment loaded automatically.
To make the R interactive console use the Nix environment instead of the system default, open the Positron command palette and switch the active R interpreter to the one provided by the Nix shell. Once selected, the console will have access to all the R packages defined in default.nix.
6. Remote Pin Board (Optional)
If you need to use the shared pin board, create a cache directory on your host (outside the container) and then, inside R, set up the board as follows:
# On your host (in <parent-dir-path>)
mkdir NycTaxiBoardCacheIn your R session (inside the Nix shell), use:
BoardRemote <- board_url(
"https://raw.githubusercontent.com/AngelFelizR/NycTaxiPins/refs/heads/main/Board/",
cache = here::here("../NycTaxiBoardCache")
)The cache directory is mounted into the container at /root/NycTaxiBoardCache, so pins will be stored on your host and persist between container restarts.