Data Collection Process

For most projects, data collection can be done manually by downloading files and placing them in a folder. However, this approach isn’t feasible when working with big data. To handle the large volume of the NYC TLC Trip Record dataset, we’ve developed the following script to automate the entire data collection process. This ensures the project can be easily and reliably reproduced by simply running the code.

Web Scraping Data Links

First, we need to obtain the download links for the required data files. To ensure we always have an up-to-date list, we will programmatically scrape the TLC Trip Record Data website. This process extracts the direct links for all the High Volume For-Hire Vehicles documents from 2022 and 2023 using the rvest library in R, which is a powerful tool for web scraping.

Downloading the Source Page

The first step is to download the HTML content of the target webpage. This gives us the raw source code that we can parse to find the data links.

SourcePage <-
  rvest::read_html(
    "https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
  )

Extracting Data Links

With the webpage downloaded, we can now parse its content to isolate the specific links we need.

High-Volume Trip Data

The following code block extracts the download links for the monthly High-Volume For-Hire Vehicle trip data. It works by:

Finding all hyperlink (<a>) elements on the page.
Extracting the href attribute (the actual URL) from each element.
Using a regular expression (grep) to filter this list, keeping only the links for .parquet files from the years 2022 and 2023.
Cleaning up and sorting the resulting list of URLs.

TripLinks <-
  SourcePage |>
  rvest::html_elements(xpath = '//div[@class="faq-answers"]//li/a') |>
  rvest::html_attr("href") |>
  grep(pattern = "fhvhv_[a-z]+_202[23]-\\d{2}\\.parquet", value = TRUE) |>
  trimws() |>
  sort()

FileNames <- basename(TripLinks)

FileNames
#>  [1] "fhvhv_tripdata_2022-01.parquet" "fhvhv_tripdata_2022-02.parquet"
#>  [3] "fhvhv_tripdata_2022-03.parquet" "fhvhv_tripdata_2022-04.parquet"
#>  [5] "fhvhv_tripdata_2022-05.parquet" "fhvhv_tripdata_2022-06.parquet"
#>  [7] "fhvhv_tripdata_2022-07.parquet" "fhvhv_tripdata_2022-08.parquet"
#>  [9] "fhvhv_tripdata_2022-09.parquet" "fhvhv_tripdata_2022-10.parquet"
#> [11] "fhvhv_tripdata_2022-11.parquet" "fhvhv_tripdata_2022-12.parquet"
#> [13] "fhvhv_tripdata_2023-01.parquet" "fhvhv_tripdata_2023-02.parquet"
#> [15] "fhvhv_tripdata_2023-03.parquet" "fhvhv_tripdata_2023-04.parquet"
#> [17] "fhvhv_tripdata_2023-05.parquet" "fhvhv_tripdata_2023-06.parquet"
#> [19] "fhvhv_tripdata_2023-07.parquet" "fhvhv_tripdata_2023-08.parquet"
#> [21] "fhvhv_tripdata_2023-09.parquet" "fhvhv_tripdata_2023-10.parquet"
#> [23] "fhvhv_tripdata_2023-11.parquet" "fhvhv_tripdata_2023-12.parquet"

Location and Shapefile Data

Next, we extract the links for two supplementary files: the Taxi Zone Lookup Table (a CSV file mapping location IDs to borough and zone names) and the Taxi Zone Shapefile (a ZIP file containing geographic data for mapping). [cite: 6, 7] The code below targets the links specifically by their text content.

TaxiZoneLink <-
  SourcePage |>
  rvest::html_elements(xpath = '//ul/li/a[text()="Taxi Zone Lookup Table"]') |>
  rvest::html_attr("href") |>
  trimws()

TaxiZoneLink
#> [1] "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"

TaxiZoneShapesLink <-
  SourcePage |>
  rvest::html_elements(xpath = '//ul/li/a[text()="Taxi Zone Shapefile"]') |>
  rvest::html_attr("href") |>
  trimws()

TaxiZoneShapesLink
#> [1] "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip"

Organizing and Saving Data Files

To effectively query this large dataset, we need a smart storage strategy. We will save the parquet files in a partitioned folder structure, organized by year and month (e.g., year=2022/month=01/). This format is highly optimized for query engines like DuckDB, as it allows the database to quickly filter and read only the necessary data partitions, significantly speeding up performance.

Defining the Folder Structure

The code below programmatically defines the required directory paths. It starts with a base ../NycTaxiBigFiles folder and then generates a list of nested year and month subfolders based on the filenames we extracted earlier.

BigFilesPath <- here::here("../NycTaxiBigFiles")
ParquetFolderPath <- file.path(BigFilesPath, "trip-data")

YearFoldersPath <-
  gsub(
    x = FileNames,
    pattern = "^fhvhv_tripdata_|-\\d{2}\\.parquet$",
    replacement = ""
  ) |>
  paste0("year=", a = _) |>
  unique() |>
  file.path(ParquetFolderPath, a = _)

MonthFoldersPath <-
  paste(
    gsub(
      x = FileNames,
      pattern = "^fhvhv_tripdata_|-\\d{2}\\.parquet$",
      replacement = ""
    ) |>
      paste0("year=", a = _),
    gsub(
      x = FileNames,
      pattern = "^fhvhv_tripdata_\\d{4}-|\\.parquet$",
      replacement = ""
    ) |>
      paste0("month=", a = _),
    sep = "/"
  ) |>
  file.path(ParquetFolderPath, a = _) |>
  sort()

Creating the Folders

This loop iterates through the defined paths and creates the necessary directories if they don’t already exist.

FoldersToCreate <- c(
  BigFilesPath,
  ParquetFolderPath,
  YearFoldersPath,
  MonthFoldersPath
)

for (paths in FoldersToCreate) {
  if (!file.exists(paths)) dir.create(paths)
}

FoldersToCreate
#>  [1] "/home/nyc/NycTaxi/../NycTaxiBigFiles"
#>  [2] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data"
#>  [3] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022"
#>  [4] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023"
#>  [5] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=01"
#>  [6] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=02"
#>  [7] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=03"
#>  [8] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=04"
#>  [9] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=05"
#> [10] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=06"
#> [11] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=07"
#> [12] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=08"
#> [13] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=09"
#> [14] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=10"
#> [15] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=11"
#> [16] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=12"
#> [17] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=01"
#> [18] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=02"
#> [19] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=03"
#> [20] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=04"
#> [21] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=05"
#> [22] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=06"
#> [23] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=07"
#> [24] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=08"
#> [25] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=09"
#> [26] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=10"
#> [27] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=11"
#> [28] "/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=12"

Downloading and Storing the Files

Now, we download each file and save it to the correct location.

The Parquet trip files are downloaded into their respective year=/month= subfolders and named part-0.parquet.
The taxi zone CSV and shapefile ZIP are saved directly into the ../NycTaxiBigFiles folder. [cite: 11, 12]

Since downloading large Parquet files can take time, the timeout limit is increased to 30 minutes (1800 seconds).

# Parquet files might time a longer time to be downloaded
options(timeout = 1800)


# Parquet trip data
for (link_i in seq_along(TripLinks)) {
  download.file(
    TripLinks[link_i],
    destfile = file.path(MonthFoldersPath[link_i], "part-0.parquet"),
    mode = "wb"
  )
}


# Taxi Zone CSV
download.file(
  TaxiZoneLink,
  destfile = file.path(BigFilesPath, "taxi_zone_lookup.csv"),
  mode = "wb"
)

# Taxi Zone Shapes
download.file(
  TaxiZoneShapesLink,
  destfile = file.path(BigFilesPath, "taxi_zones.zip"),
  mode = "wb"
)

Saving zone shapes

Now we can unzip the taxi_zones.zip file into a new taxi_zones directory and remove the original ZIP file.

unzip(
  zipfile = file.path(BigFilesPath, "taxi_zones.zip"),
  exdir = file.path(BigFilesPath, "taxi_zones")
)

file.remove(file.path(BigFilesPath, "taxi_zones.zip"))

Adding zone information to the board

To have this information available later we can sabe zone information as pins.

BoardLocal <- pins::board_folder(here::here("../NycTaxiPins/Board"))

ZoneCodesRef <-
  data.table::fread(
    file.path(BigFilesPath, "taxi_zone_lookup.csv"),
    colClasses = "character"
  )

ZonesShapes <-
  sf::read_sf(file.path(BigFilesPath, "taxi_zones/taxi_zones.shp"))

BoardLocal |> pins::pin_write(ZoneCodesRef, "ZoneCodesRef", type = "qs2")
BoardLocal |> pins::pin_write(ZonesShapes, "ZonesShapes", type = "qs2")

Consolidating Data into a Database

While Parquet files are excellent for storing and sharing large datasets, they can be inefficient for operations like random sampling. To enable fast and memory-efficient analysis on data that is larger than RAM, we will consolidate all our Parquet files into a DuckDB database. DuckDB is a high-performance analytical database that runs directly within R.

The process involves five simple steps:

1. Listing All Parquet Files for Import

First, we create a formatted string that lists all the individual part-0.parquet file paths. This string will be used directly in a SQL query to tell DuckDB to read all the files as a single, unified dataset.

ParquetSource <-
  file.path(MonthFoldersPath, "part-0.parquet") |>
  paste0("'", a = _, "'") |>
  paste0(collapse = ", ") |>
  paste0("read_parquet([", a = _, "])")

ParquetSource
#> [1] "read_parquet(['/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=01/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=02/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=03/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=04/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=05/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=06/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=07/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=08/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=09/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=10/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=11/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2022/month=12/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=01/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=02/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=03/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=04/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=05/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=06/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=07/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=08/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=09/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=10/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=11/part-0.parquet', '/home/nyc/NycTaxi/../NycTaxiBigFiles/trip-data/year=2023/month=12/part-0.parquet'])"

2. Creating a Database Connection

Next, we establish a connection to our DuckDB database file, which will be stored at ../NycTaxiBigFiles/my-db.duckdb. If the file does not exist, DuckDB will create it automatically.

con <- DBI::dbConnect(duckdb::duckdb(), 
                      dbdir = file.path(BigFilesPath, "my-db.duckdb"))

3. Creating and Populating the Trips Table

This is the core step where the data is imported and transformed. We execute a single SQL command that:

Creates a new table named NycTrips.
Reads all the data from the Parquet files specified in ParquetSource.
Adds a unique trip_id to each record using the ROW_NUMBER() function.
Calculates a new field, performance_per_hour, to measure driver earnings per hour.

NycTripsCreateTable <- glue::glue_safe("
CREATE TABLE NycTrips AS
    SELECT 
        ROW_NUMBER() OVER () AS trip_id,
        *,
        (driver_pay + tips) / (trip_time / 3600) AS performance_per_hour
    FROM {ParquetSource}
")

DBI::dbExecute(con, NycTripsCreateTable)

4. Disconnecting from the Database

Once the table is created, it’s good practice to close the database connection to free up resources.

DBI::dbDisconnect(con, shutdown = TRUE)

rm(con)

5. Confirming the Final Database Size

As a final check, we can inspect the size of the resulting database file on disk. This confirms that all the data has been successfully imported and saved.

file.path(BigFilesPath, "my-db.duckdb") |>
  file.size() |>
  structure(class = "object_size") |>
  format(units = "auto")
#> [1] "55.8 Gb"

Retaining the Full Dataset for R Package Development

This project is not only a data analysis workflow but also structured as an R package, which includes unit tests in the tests/ directory to verify that the code works correctly and reliably. These tests often need to interact with the full dataset to check functions like data processing, simulations, or geospatial operations. To enable running devtools::check()—a command that builds the package, runs all tests, and checks for issues—we keep the large data files in the ../NycTaxiBigFiles/ directory. This way, the tests can access the complete data without needing to redownload or recreate it each time, making development and validation straightforward and efficient. While this results in a large storage footprint (around 67 GB), it ensures the package’s integrity and reproducibility.