Expanding Transportation and Socioeconomic Patterns

Expanding on the previous post’s geospatial feature extraction, this article will delve into enriching our understanding of NYC taxi zones with socioeconomic and transportation-related data from the US Census Bureau. This new layer of information will provide crucial context for analyzing take_current_trip by revealing underlying community characteristics and commuting behaviors within each zone.

Here are the steps to follow:

Importing training data with zone shapes.
Import Census data:
- Identify relevant variables.
- Download data for Manhattan, Queens and Brooklyn with geometry.
Spatially join ACS data to taxi zones:
- Perform spatial joins to aggregate census tract data to the taxi zone level.
- Address overlapping and partially contained tracts.
Extract new features:
- Calculate proportions, averages, and densities of the ACS variables for each taxi zone.
Validation:
- Integrate new features with existing data and take_current_trip.
- Define the correlation between new predictors and take_current_trip.

Setting up the environment

Loading packages

## To manage relative paths
library(here)

## To transform data that fits in RAM
library(data.table)

## Publish data sets, models, and other R objects
library(pins)
library(qs2)

## To work with spatial data
library(sf)

# To work and plot graphs
library(tidytext)
library(stringdist)
library(hashr)
library(igraph)

# Access to US Census Bureau datasets
library(tidycensus)

## To create general plots
library(ggplot2)
library(scales)

## To create table
library(gt)

## Custom functions
devtools::load_all()

# Defining the pin boards to use
BoardLocal <- board_folder(here("../NycTaxiPins/Board"))

# Loading params
Params <- yaml::read_yaml(here("params.yml"))
Params$BoroughColors <- unlist(Params$BoroughColors)

Computational Performance

The geospatial processing operations in this document are computationally intensive. Although they could benefit from parallelization with packages like future and future.apply, we’ve opted for a sequential approach with caching via qs2 to maximize reproducibility and maintain code simplicity. Intermediate results are saved after each costly operation to avoid unnecessary recalculations during iterative development.

Importing training data with zone shapes

Details on how this data was obtained can be found in the Data Collection Process.

ZonesShapes <-
  pin_read(BoardLocal, "ZonesShapes") |>
  (\(df) {
    df[df[["borough"]] %chin% names(Params$BoroughColors), c("OBJECTID", "LocationID")]
  })() |>
  st_cast(to = "MULTIPOLYGON")

SampledData <-
  pin_list(BoardLocal) |>
  grep(pattern = "^OneMonthData", value = TRUE) |>
  sort() |>
  head(12L) |>
  lapply(FUN = pin_read, board = BoardLocal) |>
  rbindlist() |>
  subset(
    select = c("trip_id", "PULocationID", "DOLocationID", "take_current_trip")
  )

Import Census data

Identify relevant variables

Based on the tidycensus documentation we have the below tables available for exploration.

ValidTables = c(
  "sf1",
  "sf2",
  "sf3",
  "sf4",
  "pl",
  "dhc",
  "dp",
  "ddhca",
  "ddhcb",
  "sdhc",
  "as",
  "gu",
  "mp",
  "vi",
  "acsse",
  "dpas",
  "dpgu",
  "dpmp",
  "dpvi",
  "dhcvi",
  "dhcgu",
  "dhcvi",
  "dhcas",
  "acs1",
  "acs3",
  "acs5",
  "acs1/profile",
  "acs3/profile",
  "acs5/profile",
  "acs1/subject",
  "acs3/subject",
  "acs5/subject",
  "acs1/cprofile",
  "acs5/cprofile",
  "sf2profile",
  "sf3profile",
  "sf4profile",
  "aian",
  "aianprofile",
  "cd110h",
  "cd110s",
  "cd110hprofile",
  "cd110sprofile",
  "sldh",
  "slds",
  "sldhprofile",
  "sldsprofile",
  "cqr",
  "cd113",
  "cd113profile",
  "cd115",
  "cd115profile",
  "cd116",
  "plnat",
  "cd118"
)

length(ValidTables)

[1] 55

That means that we have a lot a data to use, but to be able to get the list of variables available for each table we need to know the different year when each table was updated. In our particular case, we only know the last year when each table was updated from 2020 to 2020.

To solve that question, we only need to scrap the table available in https://api.census.gov/data.html and apply our particular selection to that huge table.

CensusTables <-
  rvest::read_html("https://api.census.gov/data.html") |>
  rvest::html_table() |>
  (\(x) x[[1L]])() |>
  as.data.table() |>
  subset(
    !is.na(`API Base URL`) & Vintage != "N/A",
    select = c(
      "Title",
      "Description",
      "Vintage",
      "Dataset Type",
      "Dataset Name",
      "API Base URL"
    )
  )

CensusTables[, `:=`(
  Vintage = as.integer(Vintage),
  `Dataset Name` = gsub("› ", "/", `Dataset Name`)
)]

CensusTables[, `:=`(
  survey = fcase(
    `Dataset Name` %like% "^dec",
    "dec",
    `Dataset Name` %like% "^acs",
    "acs"
  ),
  `Dataset Name` = gsub("(dec|acs)/", "", `Dataset Name`)
)]

setorder(CensusTables, -Vintage)

ValidTablesYears <-
  CensusTables[
    Vintage %between% c(2020, 2022)
  ][
    ValidTables,
    on = "Dataset Name",
    nomatch = NULL,
    unique(.SD, by = "Dataset Name"),
    .SDcols = c("Vintage", "Dataset Name", "survey")
  ]

BoardLocal |> pin_write(ValidTablesYears, "ValidTablesYears", type = "qs2")

ValidTablesYears[, head(.SD, 5), by = "survey"]

    survey Vintage Dataset Name
    <char>   <int>       <char>
 1:    dec    2020           pl
 2:    dec    2020          dhc
 3:    dec    2020           dp
 4:    dec    2020        ddhca
 5:    dec    2020        ddhcb
 6:    acs    2022        acsse
 7:    acs    2022         acs1
 8:    acs    2022         acs5
 9:    acs    2022 acs1/profile
10:    acs    2022 acs5/profile

Now that we have the years related to each table, let’s apply tidycensus::load_variables for each table and year to see the variables to explore.

get_all_data = function(x, ref_df) {
  imported_data =
    load_variables(ref_df$Vintage[x], ref_df$`Dataset Name`[x])

  setDT(imported_data)

  imported_data[, `:=`(
    year = ref_df$Vintage[x],
    dataset = ref_df$`Dataset Name`[x],
    survey = ref_df$survey[x]
  )]

  return(imported_data)
}

ValidTablesLastYear <-
  lapply(
    seq_along(ValidTablesYears$`Dataset Name`),
    get_all_data,
    ref_df = ValidTablesYears
  ) |>
  rbindlist(use.names = TRUE, fill = TRUE)

BoardLocal |> pin_write(ValidTablesLastYear, "ValidTablesLastYear", type = "qs2")

nrow(ValidTablesLastYear) |> comma()

[1] "159,721"

After making that exploration we can see that we have almost 160K variables to explore.

To understand what have we have let’s plot the number of variables by dataset and survey for each year.

Show the code

VariablesPerDataset <-
  ValidTablesLastYear[, .N, .(year = factor(year), dataset, survey)][,
    dataset := reorder(dataset, N, FUN = sum)
  ]

ggplot(VariablesPerDataset) +
  geom_col(aes(N, dataset, fill = year), color = "black", linewidth = 0.2) +
  scale_fill_manual(values = c("2022" = Params$ColorHighlight, "2020" = Params$ColorGray)) +
  scale_x_continuous(
    labels = scales::label_comma(scale = 1 / 1000, suffix = "k")
  ) +
  expand_limits(x = 1e4) +
  labs(x = "Number of variables", fill = "Last update year") +
  facet_wrap(vars(survey), scale = "free") +
  theme_minimal(base_family = Params$BaseFontFamily) +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  )

It’s clear that there is one way we can read the variables one by one and hypothesize the variables to select. Instead, we are going to flow some steps to reduce the number of variable and the classify them in to groups to select.

Just by looking to the raw data, I found that many variables measure the same thing for particular groups based on:

Race
Place of Birth
Sex

VariablesToIgnore <- c(
  "Hispanic",
  "Latino",
  "American Indian",
  "Alaska Native",
  "White",
  "Black",
  "African American",
  "Asian",
  "Native Hawaiian",
  "Pacific Islander",
  "Races?",

  "Ancestry",
  "Nativity",
  "Citizenship Status",
  "Place of Birth",
  "Naturalization",
  "Current Residence",

  "Male",
  "Female",
  "Women",
  "Men"
)

As we will join the data based on geography it doesn’t make sense to have predictors based on the situations of those particular groups as we won’t have does details of passengers of the trips.

ConceptsToExplore <-
  ValidTablesLastYear[
    !concept %ilike% paste(VariablesToIgnore, collapse = "|") &
      !label %ilike% "Male|Female",
    .(concept = unique(concept))
  ]

nrow(ConceptsToExplore)

[1] 1270

Another aspect we found out by exploring the data it’s that some variables names are really long and the long the concept definition the more specific the variable and less useful for our model. For example, below we can see the variable with the longer number of characters:

GRANDPARENTS LIVING WITH OWN GRANDCHILDREN UNDER 18 YEARS BY RESPONSIBILITY FOR OWN GRANDCHILDREN BY LENGTH OF TIME RESPONSIBLE FOR GRANDCHILDREN FOR THE POPULATION 30 YEARS AND OVER IN HOUSEHOLDS (EXCLUDING PEOPLE IN MILITARY HOUSING UNITS)

Population Measured:
- Grandparents living with own grandchildren under 18 years
- 30 years and over
- In households (excluding military housing units)
Data Categories:
- By responsibility for own grandchildren
- By length of time responsible for grandchildren

By exploring the cumulative distribution we can see that near 50% of the variable present less 70 characters per variable which seems the better group to focus the exploration

Show the code

ConceptsToExplore[, concept_length := nchar(concept)]

ggplot(ConceptsToExplore) +
  stat_ecdf(aes(concept_length)) +
  scale_x_continuous(breaks = breaks_width(10)) +
  scale_y_continuous(labels = label_percent(accuracy = 1)) +
  labs(
    title = "Empirical Cumulative Distribution",
    subtitle = "Number of Character per Variable",
    x = "Number of Characters",
    y = "Cumulative Proportion"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily) +
  theme(panel.grid.minor = element_blank())

After applying that change we only have 675 pending variables to explore the classify.

ConceptsToExploreMaxConceptLength <- 70

ConceptsToExploreValidConceptLength <-
  ConceptsToExplore[concept_length <= ConceptsToExploreMaxConceptLength]

ConceptsToExploreValidConceptLength[, .N]

[1] 675

To confirm if that action make sense we can print the longer variables and confirm that aren’t too specific.

GROUP QUARTERS POPULATION IN NONINSTITUTIONAL FACILITIES BY SEX BY AGE
Class of Worker for the Civilian Employed Population 16 Years and Over
AGGREGATE INCOME DEFICIT IN 2019 (DOLLARS) FOR FAMILIES BY FAMILY TYPE
AGGREGATE NUMBER OF ROOMS BY TENURE (EXCLUDING MILITARY HOUSING UNITS)
Sex by Industry for the Civilian Employed Population 16 Years and Over
Poverty Status in the Past 12 Months of Individuals by Work Experience

Now we can calculate the pairwise Jaccard string distances of each of the variables to identify conceptually similar terms based on similar words to then identify variable clusters to select.

Create word-level hash tokens (not character-level) after case normalization.

ConceptsToExploreHash <-
  tolower(ConceptsToExploreValidConceptLength$concept) |>
  strsplit("\\s+") |>
  hash()

Compute pairwise Jaccard distances between hashed word representations.

ConceptsToExploreWordDistance <-
  seq_distmatrix(
    ConceptsToExploreHash,
    ConceptsToExploreHash,
    method = "jaccard",
    q = 1
  )

Assign concept names to matrix dimensions for interpretability.

rownames(
  ConceptsToExploreWordDistance
) <- ConceptsToExploreValidConceptLength$concept
colnames(
  ConceptsToExploreWordDistance
) <- ConceptsToExploreValidConceptLength$concept

Transform symmetric matrix to long format, eliminating redundant pairwise comparisons.

ConceptsToExploreWordDistance[upper.tri(
  ConceptsToExploreWordDistance,
  diag = TRUE
)] <- NA

ConceptsToExploreWordDistanceDt <-
  as.data.table(
    ConceptsToExploreWordDistance,
    keep.rownames = "Variable Name 1"
  ) |>
  melt(
    id.vars = "Variable Name 1",
    variable.name = "Variable Name 2",
    variable.factor = FALSE,
    value.name = "Jaccard Distance",
    na.rm = TRUE
  )

With this inter-concept distance we can use the edge_betweenness argoritm to define the cluster to use. But as the result for this method will change depending the min Jaccard Distance selected to do the analysis, we need to explore several options before taking a final decision.

define_clusters_by_distance <- function(min_jaccard_distance, dt) {
  dt_filtered = dt[`Jaccard Distance` <= min_jaccard_distance]

  temp_graph =
    graph_from_data_frame(dt_filtered, directed = FALSE)

  temp_cluster = cluster_edge_betweenness(temp_graph)

  node_results = data.table(
    node = V(temp_graph)$name,
    cluster = temp_cluster$membership,
    `Jaccard Distance Threshold` = min_jaccard_distance,
    modularity = max(temp_cluster$modularity),
    n_clusters = max(temp_cluster$membership)
  )

  return(node_results)
}


ConceptsToExploreByDistance <-
  lapply(
    seq(0.1, 0.8, by = 0.10),
    FUN = define_clusters_by_distance,
    dt = ConceptsToExploreWordDistanceDt
  ) |>
  rbindlist()

BoardLocal |> pin_write(ConceptsToExploreByDistance, "ConceptsToExploreByDistance", type = "qs2")

After exploring several configurations we can see that that the as we expand the limit of Jaccard Distance the modularity decrease, but the number of clusters initially increases (peaking around 0.3) and then steadily decreases until 0.7, after which it increases sharply again.

Show the code

ConceptsToExploreByDistance[,
  unique(.SD),
  .SDcols = c("Jaccard Distance Threshold", "modularity", "n_clusters")
] |>
  melt(
    id.vars = "Jaccard Distance Threshold",
    variable.name = "cluster_metric"
  ) |>
  ggplot(aes(`Jaccard Distance Threshold`, value)) +
  geom_line(linewidth = 1, color = "gray40") +
  geom_point(size = 3, color = "gray40") +
  geom_vline(
    xintercept = 0.5,
    linetype = "dashed",
    color = Params$ColorHighlight,
    linewidth = 0.9
  ) +
  # Add the annotation here
  annotate(
    "text",
    x = 0.5,
    y = Inf,
    label = "0.5 Threshold",
    color = Params$ColorHighlight,
    vjust = 1.2,
    hjust = -0.05,
    size = 4
  ) +
  scale_y_continuous(
    labels = \(x) {
      fifelse(x <= 1 & x > 0, label_percent()(x), label_comma(accuracy = 1)(x))
    },
    breaks = breaks_pretty(7)
  ) +
  scale_x_continuous(breaks = breaks_width(0.10)) +
  expand_limits(y = 0) +
  facet_wrap(
    vars(cluster_metric),
    scale = "free_y",
    ncol = 1L,
    strip.position = "top"
  ) +
  labs(
    title = "Modularity and Number of Clusters vs. Jaccard Distance Threshold",
    subtitle = "Assessing the Trade-off: High Modularity and Low Number of Clusters",
    x = "Jaccard Distance Threshold",
    y = "Value"
  ) +
  theme_minimal(base_family = Params$BaseFontFamily) +
  theme(
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray25"),
    strip.text = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10)
  )

Based on this information the best threshold is 0.5 Jaccard Distance as the modularity is higher than 0.8 and the algorithm found less than 60 clusters. After than point the slope to reduce the number of clusters gets lower as concurrence, we don’t see worthy to keep increasing the threshold.

ConceptsToExploreClusters <-
  ConceptsToExploreByDistance[.(0.5), on = "Jaccard Distance Threshold"]

To understand what kind of variables we have in each cluster lets use the statistic tf-idf to list the top 5 words (after removing stop words) of each cluster and use that reference to select the must meaningful ones for this context.

CustomStopWords <- c(
  stop_words[stop_words$lexicon == "snowball", c("word")][[1]],
  "estimate",
  "estimates",
  "percent",
  "months",
  "past",
  "detailed",
  "type",
  "types",
  "current",
  "adjusted",
  "united",
  "round",
  "selected",
  "median",
  "level",
  NA_character_
)

TopWordsByCluster <-
  ConceptsToExploreClusters[, .(
    cluster,
    concept_lower = tolower(node)
  )][, unnest_tokens(.SD, word, concept_lower)][
    !word %chin% CustomStopWords & !word %like% "^\\d+$",
    .N,
    by = c("cluster", "word")
  ][, dataset_total := sum(N), by = "cluster"][, bind_tf_idf(
    .SD,
    word,
    cluster,
    N
  )][order(-tf_idf)][,
    .(
      `Top Words` = head(word, 5L) |>
        stringr::str_to_sentence() |>
        paste0(collapse = ", ")
    ),
    keyby = "cluster"
  ][
    ConceptsToExploreClusters,
    on = "cluster",
    j = .(`Node Count` = .N),
    by = c("Top Words", "cluster")
  ][
    order(-`Node Count`, cluster),
    .(`Cluster ID` = cluster, `Top Words`, `Node Count`)
  ]

Show the code

gt(TopWordsByCluster) |>
  # Apply visual styling
  tab_header(
    title = "Key Variable Clusters from Census Data Analysis",
    subtitle = "Top representative words and cluster sizes for socioeconomic & transportation patterns"
  ) |>
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_title()
  ) |>
  tab_style(
    style = cell_fill(color = "#2c3e50"),
    locations = cells_title()
  ) |>
  data_color(
    columns = `Node Count`,
    fn = scales::col_numeric(
      palette = c("white", Params$ColorHighlightLow, Params$ColorHighlight),
      domain = c(0, max(TopWordsByCluster$`Node Count`))
    )
  ) |>
  fmt_number(
    columns = `Node Count`,
    decimals = 0
  ) |>
  cols_align(
    align = "left",
    columns = `Top Words`
  ) |>
  cols_label(
    `Cluster ID` = "CLUSTER",
    `Top Words` = "TOP REPRESENTATIVE TERMS",
    `Node Count` = "# VARIABLES"
  ) |>
  tab_options(
    table.font.names = "Arial",
    table.background.color = "white",
    column_labels.font.weight = "bold",
    column_labels.background.color = "#f8f9fa",
    row_group.background.color = "#f1f3f5",
    #row.striping.include_table_body = TRUE,
    quarto.disable_processing = TRUE
  ) |>
  tab_source_note(
    source_note = "Clusters identified through NLP analysis of US Census variable concepts"
  )

CLUSTER	TOP REPRESENTATIVE TERMS	# VARIABLES
Key Variable Clusters from Census Data Analysis
Top representative words and cluster sizes for socioeconomic & transportation patterns
23	Excluding, Military, Units, Housing, Water	81
4	Structure, Householder, Built, Tenure, Year	55
12	Health, Insurance, Age, Sex, Coverage	52
21	Rent, Quartile, Dollars, Contract, Gross	39
2	Quarters, Group, Population, Age, Sex	36
10	Status, Mortgage, Vacancy, Owner, Monthly	32
13	Population, Years, Civilian, Employed, Age	32
18	Work, Transportation, Means, Time, Workplace	32
20	Poverty, Status, Ratio, Families, Household	23
3	Household, Size, Including, Categories, Alone	20
28	Number, Rooms, Tenure, Available, Vehicles	19
36	Households, Income, Earnings, Dollars, Security	19
6	Income, Dollars, Household, Nonfamily, Family	18
7	Children, Presence, Years, Family, Age	18
25	Characteristics, Social, Comparative, Puerto, Rico	17
26	Allocation, Rooms, Relationship, Bedrooms, Value	16
5	Moved, Unit, Householder, Year, Tenure	13
31	Available, Vehicles, Allocation, Telephone, Service	6
32	Internet, Household, Subscription, Access, Computer	6
44	Occupied, Housing, Facilities, Units, Plumbing	6
24	Subscriptions, Internet, Presence, Household, Computers	5
1	Occupancy, Status, Bedrooms, Structure	4
9	Tenure, Universe, Housing	4
15	Coupled, Households	4
46	Monthly, Costs, Electricity, Gas, Housing	4
49	Years, Population, States, Puerto, Rico	4
11	Total, Occupied, Population, Units, Housing	3
14	Multigenerational, Presence, Households, Nonrelatives	3
22	Value, Housing	3
30	Percentage, Gross, Rent, Income, Household	3
37	Subfamilies, Subfamily, Population, Relationship	3
43	House, Heating, Fuel, Tenure, Allocation	3
45	Meals, Included, Rent, Householder, Kitchen	3
8	Units, Housing	2
16	Average, Occupied, Size, Units, Housing	2
17	Geographical, Mobility, Year, States, Puerto	2
19	Language, Household	2
27	Rooms	2
29	Bedrooms	2
33	Reason, Moving, People, Primary, Households	2
34	Stamp, Benefits, Food, Snap, Households	2
35	Years, Workers, Vehicle, Households, Private	2
38	Limits, Quintile, Upper, Income, Household	2
39	Quintiles, Mean, Income, Household	2
40	Shares, Quintile, Income, Aggregate, Household	2
41	Gini, Index, Inequality, Income	2
42	Place, Work, Workers, State, County	2
47	Computers, Household	2
48	Overall, Characteristic, Rate, Person, Unit	2
50	Spoken, Home, Language, Characteristics, People	2
51	Financial, Mortgage, Characteristics, Without, Units	2
Clusters identified through NLP analysis of US Census variable concepts

Based on the analysis, the following clusters are selected for inclusion due to their direct relevance to transportation behavior, income patterns, and population density - key factors influencing taxi demand (take_current_trip):

Cluster ID	Top Words	Node Count	Reason for Inclusion
18	Work, Transportation, Means, Time, Workplace	32	Direct commuting behavior - primary driver of taxi demand patterns
13	Population, Years, Civilian, Employed, Age	32	Employment density - correlates with business/event-related trips
36	Households, Income, Earnings, Dollars, Security	19	Income capacity - strongest predictor of taxi affordability and usage frequency

ConceptsToExploreCleanConcept <-
  ConceptsToExploreClusters[
    .(c(18, 13, 36)),
    on = "cluster",
    .(concept = stringr::str_to_sentence(node), cluster)
  ]

ConceptByDataset <-
  ValidTablesLastYear[, .(
    concept = stringr::str_to_sentence(concept),
    year,
    dataset,
    survey
  )][, unique(.SD)]


ConceptByDatasetToExplore <-
  ConceptByDataset[ConceptsToExploreCleanConcept, on = "concept"]

Checking the data we can see that some concept are duplicated in some in some datasets, as we can see in the next table.

Show the code

ConceptByDatasetToExplore[,
  .(N = .N, dataset = sort(dataset) |> paste(collapse = ", ")),
  by = "concept"
][N > 1, .(variables = .N), keyby = "dataset"] |>

  gt() |>

  # Header
  tab_header(
    title = md("**Dataset Variable Analysis**"),
    subtitle = "Distribution of shared variables across survey datasets"
  ) |>

  # Col Names
  cols_label(
    dataset = "Dataset Collection",
    variables = "Number of Variables"
  ) |>

  # Number Format
  fmt_number(columns = variables, decimals = 0) |>

  # Colors
  data_color(
    columns = variables,
    colors = scales::col_numeric(
      palette = c("white", Params$ColorHighlightLow, Params$ColorHighlight),
      domain = NULL
    )
  ) |>
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_title()
  ) |>
  tab_style(
    style = cell_fill(color = "#2c3e50"),
    locations = cells_title()
  ) |>

  # Options
  tab_options(
    table.font.names = "Arial",
    table.background.color = "white",
    column_labels.font.weight = "bold",
    column_labels.background.color = "#f8f9fa",
    row_group.background.color = "#f1f3f5",
    #row.striping.include_table_body = TRUE,
    quarto.disable_processing = TRUE
  )

Dataset Collection	Number of Variables
Dataset Variable Analysis
Distribution of shared variables across survey datasets
acs1, acs5	48
acs1, acs5, acsse	3
acs1/subject, acs5/subject	5
cd118, dhc	2
cd118, dhc, dhcas, dhcgu, dhcvi	3
dhcas, dhcgu, dhcvi	2
dhcas, dhcgu, dhcvi, sdhc	1
dhcas, dhcvi	11

We can see that we have 2 main groups of cases:

Variables shared between acs1 (subject) and acs5 (subject).
Variables shared between datasets of the decennial US Census.

By reading about the differences between each datasets, we are going to focus in the variables available for the next datasets:

acs5 and acs5/subject
- We are going to train the model based on one year data but we want to be able to find patterns that remain for future years as our testing data have trips for 2025 we expect that the model keep working after that year.
dhc
- We are exploring only NYC so won’t need information from indepent island and as result the information provided by those datasets won’t be relevant for our training goal.

ConceptByValidDatasetToExplore <-
  ConceptByDatasetToExplore[
    c("acs5", "acs5/subject", "dhc"),
    on = "dataset",
    .(dataset, concept, cluster)
  ]

Now we are ready to identify the particular codes (names) related to each variable concept. In this step are going to exclude variables that:

Only provide information based on sex with the exception of B08013_001 which provide information about “travel time to work (in minutes) of workers”.
Don’t have information at tract level of geography.

ValidTablesLastYear[, concept := stringr::str_to_sentence(concept)]

VariablesToUse <-
  ValidTablesLastYear[
    ConceptByValidDatasetToExplore,
    on = c("dataset", "concept")

    # Only provide information based on sex
  ][!label %ilike% "Male|Female"][, concept_count := .N, by = "concept"][
    name == "B08013_001" | concept_count > 2L

    # Don't have information at tract
  ][fcoalesce(geography, "") != "county"]

VariablesToUse[, .(N = comma(.N)), by = c("survey", "year")]

   survey  year      N
   <char> <int> <char>
1:    acs  2022  1,382
2:    dec  2020     51

Download data for Manhattan, Queens and Brooklyn with geometry

After defining the variables to use we only need to define the queries to get the data from the API with {tidycensus} for the American Community Survey (ACS) and Decennial US Census (DEC).

Note

As we are getting also the geometry information is much efficient to change the output argument default from “tidy” to “wide” to avoid having a list column repeating the same geometries over and over after applying the function tidyr::gather.

This will make the spatial join independent to the number of variables to query.

AcsVariableTracts <-
  get_acs(
    geography = "tract",
    variables = VariablesToUse["acs", on = "survey", name],
    year = 2022,
    state = "NY",
    # Manhattan, Brooklyn, Queens
    county = c("New York", "Kings", "Queens"),
    geometry = TRUE,
    survey = "acs5",
    output = "wide"
  ) |>
  st_transform(crs = st_crs(ZonesShapes)$epsg)

DecVariableTracts <-
  get_decennial(
    geography = "tract",
    variables = VariablesToUse["dec", on = "survey", name],
    sumfile = "dhc",
    year = 2020,
    state = "NY",
    # Manhattan, Brooklyn, Queens
    county = c("New York", "Kings", "Queens"),
    geometry = TRUE,
    output = "wide"
  ) |>
  st_transform(crs = st_crs(ZonesShapes)$epsg)

BoardLocal |> pin_write(AcsVariableTracts, "AcsVariableTracts", type = "qs2")
BoardLocal |> pin_write(DecVariableTracts, "DecVariableTracts", type = "qs2")

To it’s the perfect time to join all variables into one table.

ConsolidatedVariableTracts <-
  dplyr::full_join(
    AcsVariableTracts,
    st_drop_geometry(DecVariableTracts),
    by = c("GEOID", "NAME")
  )

dim(ConsolidatedVariableTracts)

[1] 1840 2777

dim(AcsVariableTracts)

[1] 1840 2726

dim(DecVariableTracts)

[1] 1840   54

Spatial joins

We need to join to the original zones of our training set in to 2 steps:

Join the tracts full contained within the original zone shapes.
Join the tracts that overlaps in more than one zone shape.

ConsolidatedVariableTractsJoinedList <-
  lapply(
    list(within = st_within, overlaps = st_overlaps),
    FUN = st_join,
    x = ConsolidatedVariableTracts,
    y = ZonesShapes,
    left = FALSE
  )

Now we need to go over each of the original zone shapes and cut the tracts parts that are not contained on each zone and consolidate the within and overlaps elements of TravelTimeToWorkTractJoinedList into one data frame.

apply_intersection_by_id <- function(features, shapes, id_col) {
  intersection_by_id = function(id, features, shapes, id_col) {
    features_sub = features[features[[id_col]] == id, ]
    shapes_sub_geo = shapes[shapes[[id_col]] == id, ] |> st_geometry()

    new_df =
      st_intersection(features_sub, shapes_sub_geo) |>
      st_cast(to = "MULTIPOLYGON")

    return(new_df)
  }

  new_df =
    lapply(
      unique(features[[id_col]]),
      FUN = intersection_by_id,
      features = features,
      shapes = shapes,
      id_col = id_col
    ) |>
    do.call(what = "rbind")

  return(new_df)
}

ConsolidatedVariableTractsJoinedList$overlaps <-
  apply_intersection_by_id(
    ConsolidatedVariableTractsJoinedList$overlaps,
    ZonesShapes,
    id_col = "OBJECTID"
  )

ConsolidatedVariableTractsJoined <-
  do.call(what = "rbind", ConsolidatedVariableTractsJoinedList)

BoardLocal |> pin_write(
  ConsolidatedVariableTractsJoined,
  name = "ConsolidatedVariableTractsJoined",
  type = "qs2",
  title = "Consolidated Variable Tracts Joined"
)

Perfect, we have joined the to shapes, but we still need to have only one estimate value per original zone shape. To solve this challenge we only need to use the weighted arithmetic mean based on the proportion of the area add each tract to the original zone id.

ConsolidatedVariableTractsJoined$tract_area <-
  st_area(ConsolidatedVariableTractsJoined$geometry)

ConsolidatedVariableTractsJoinedDt <-
  st_drop_geometry(ConsolidatedVariableTractsJoined) |>
  setDT()

IdColumns <-
  c("GEOID", "NAME", "tract_area", "LocationID", "OBJECTID")

ConsolidatedVariableTidy <-
  melt(
    ConsolidatedVariableTractsJoinedDt,
    id.var = IdColumns,
    variable.factor = FALSE
  )[, `:=`(
    variable_type = fifelse(variable %like% "\\dM$", "moe", "estimate"),
    variable = sub("(E|M)$", "", variable)
  )][, dcast(
    .SD,
    formula = as.formula(paste0(
      paste0(c(IdColumns, "variable"), collapse = " + "),
      "~ variable_type"
    )),
    value.var = "value"
  )][, `:=`(
    NAME = NULL,
    OBJECTID = NULL,
    moe = NULL,
    estimate_low = estimate - fcoalesce(moe, 0),
    estimate_high = estimate + fcoalesce(moe, 0)
  )][estimate_low < 0, estimate_low := 0][,
    area_prop := as.double(tract_area / sum(tract_area, na.rm = TRUE)),
    by = c("LocationID", "variable")
  ]


ConsolidatedVariableTractsByZoneId <-
  ConsolidatedVariableTidy[
    j = .(
      estimate = sum(estimate * area_prop, na.rm = TRUE),
      estimate_low = sum(estimate_low * area_prop, na.rm = TRUE),
      estimate_high = sum(estimate_high * area_prop, na.rm = TRUE)
    ),
    by = c("LocationID", "variable")
  ][, melt(
    .SD,
    measure.vars = c("estimate", "estimate_low", "estimate_high"),
    variable.factor = FALSE,
    variable.name = "variable_type"
  )][, dcast(.SD, LocationID ~ variable + variable_type, value.var = "value")]


BoardLocal |> pin_write(
  ConsolidatedVariableTractsByZoneId,
  name = "ConsolidatedVariableTractsByZoneId",
  type = "qs2",
  title = "Consolidated Variable Tracts By Zone Id"
)

dim(ConsolidatedVariableTractsByZoneId)

[1]  196 4300

Validation

PuVariableTractsMeanDiff <-
  ConsolidatedVariableTractsByZoneId[
    SampledData[, .(LocationID = PULocationID, take_current_trip)],
    on = "LocationID"
  ][,
    lapply(.SD, \(x) {
      if (uniqueN(x) == 1L) {
        x[1L]
      } else {
        x[!is.na(x)] |> scale(center = FALSE) |> median()
      }
    }),
    .SDcols = !c("LocationID"),
    by = "take_current_trip"
  ][, melt(.SD, id.vars = "take_current_trip", variable.factor = FALSE)][,
    .(not_take_vs_take = diff(value)),
    by = "variable"
  ][order(-abs(not_take_vs_take))]


DoVariableTractsMeanDiff <-
  ConsolidatedVariableTractsByZoneId[
    SampledData[, .(LocationID = DOLocationID, take_current_trip)],
    on = "LocationID"
  ][,
    lapply(.SD, \(x) {
      if (uniqueN(x) == 1L) {
        x[1L]
      } else {
        x[!is.na(x)] |> scale(center = FALSE) |> median()
      }
    }),
    .SDcols = !c("LocationID"),
    by = "take_current_trip"
  ][, melt(.SD, id.vars = "take_current_trip", variable.factor = FALSE)][,
    .(not_take_vs_take = diff(value)),
    by = "variable"
  ][order(-abs(not_take_vs_take))]

BoardLocal |> pin_write(
  PuVariableTractsMeanDiff,
  name = "PuVariableTractsMeanDiff",
  type = "qs2",
  title = "PU Variable Tracts Mean Diff"
)

BoardLocal |> pin_write(
  DoVariableTractsMeanDiff,
  name = "DoVariableTractsMeanDiff",
  type = "qs2",
  title = "DO Variable Tracts Mean Diff"
)

VariableTractsMeanDiff <-
  rbindlist(list(PuVariableTractsMeanDiff, DoVariableTractsMeanDiff))[
    !is.na(not_take_vs_take)
  ]

Show the code

ggplot(VariableTractsMeanDiff) +
  geom_histogram(aes(not_take_vs_take)) +
  scale_y_continuous(
    transform = scales::new_transform(
      "signed_log10",
      transform = function(x) sign(x) * log10(1 + abs(x)),
      inverse = function(x) sign(x) * (10^(abs(x)) - 1),
      breaks = scales::pretty_breaks()
    ),
    breaks = c(0, 10^(1:4))
  ) +
  scale_x_continuous(breaks = breaks_width(0.1)) +
  labs(y = "count on log10 scale") +
  theme_minimal(base_family = Params$BaseFontFamily)

TopVariableTractsMeanDiff <- VariableTractsMeanDiff[
  abs(not_take_vs_take) > 0.30,
  unique(variable)
]

Show the code

# Enhanced contest-winning gt table with improved styling
VariablesToUse[
  .(sub("_estimate(_(low|high))*", "", TopVariableTractsMeanDiff) |> unique()),
  on = "name",
  c("dataset", "concept", "label", "name")
][
  order(concept, name)
  # Add improved row groups with better categorization
][,
  category := fcase(
    grepl("Earnings|Income|Poverty", concept, ignore.case = TRUE),
    "💰 Economic Indicators",
    grepl("Industry|Occupation|Employment", concept, ignore.case = TRUE),
    "🏭 Employment & Industry",
    grepl("Transportation|Commuting|Vehicle", concept, ignore.case = TRUE),
    "🚗 Transportation & Mobility",
    default = "📊 Other Variables"
  )
] |>

  gt(groupname_col = "category") |>

  # Enhanced header with better typography
  tab_header(
    title = md("**🏆 Key Census Variables Analysis**"),
    subtitle = md(
      "*Critical demographic and socioeconomic indicators across geographic clusters*"
    )
  ) |>

  # Improved column headers with icons and better descriptions
  cols_label(
    dataset = md("**📋 Dataset**"),
    concept = md("**🎯 Concept Category**"),
    label = md("**📝 Variable Description**"),
    name = md("**🔖 Variable ID**")
  ) |>

  # Hide the grouping column
  cols_hide(category) |>

  # Enhanced header styling with gradient effect
  tab_style(
    style = list(
      cell_fill(color = Params$ColorHighlight),
      cell_text(color = "white", weight = "bold", size = px(13)),
      cell_borders(sides = "bottom", color = "white", weight = px(2))
    ),
    locations = cells_column_labels()
  ) |>

  # Stunning group headers with enhanced styling
  tab_style(
    style = list(
      cell_fill(color = Params$ColorHighlightLow),
      cell_text(
        weight = "bold",
        size = px(15),
        color = "#2c3e50"
      ),
      cell_borders(
        sides = c("top", "bottom"),
        color = Params$ColorHighlight,
        weight = px(2)
      )
    ),
    locations = cells_row_groups()
  ) |>

  # Enhanced title styling
  tab_style(
    style = list(
      cell_text(
        color = "white",
        weight = "bold",
        size = px(22)
      ),
      cell_fill(color = "#2c3e50")
    ),
    locations = cells_title(groups = "title")
  ) |>

  tab_style(
    style = list(
      cell_text(
        color = "#ecf0f1",
        style = "italic",
        size = px(16)
      ),
      cell_fill(color = "#34495e")
    ),
    locations = cells_title(groups = "subtitle")
  ) |>

  # Variable ID styling with better monospace presentation
  tab_style(
    style = list(
      cell_text(
        font = "Consolas, Monaco, monospace",
        size = px(11)
      )
    ),
    locations = cells_body(columns = name)
  ) |>

  # Dataset column styling
  tab_style(
    style = cell_text(
      weight = "bold",
      size = px(11)
    ),
    locations = cells_body(columns = dataset)
  ) |>

  # Concept column styling
  tab_style(
    style = cell_text(
      size = px(11),
      weight = "bold"
    ),
    locations = cells_body(columns = concept)
  ) |>

  # Label column with improved readability
  tab_style(
    style = cell_text(
      size = px(11),
      color = "#2c3e50"
    ),
    locations = cells_body(columns = label)
  ) |>

  # Optimized column widths for better content display
  cols_width(
    dataset ~ px(100),
    concept ~ px(180),
    label ~ px(350),
    name ~ px(140)
  ) |>

  # Comprehensive table options for professional appearance
  tab_options(
    # Typography
    table.font.names = c("Segoe UI", "Arial", "sans-serif"),
    table.font.size = px(12),
    heading.title.font.size = px(22),
    heading.subtitle.font.size = px(16),
    row_group.font.size = px(15),

    # Background and colors
    table.background.color = "#fdfdfd",

    # Borders and lines
    table.border.top.style = "solid",
    table.border.top.width = px(4),
    table.border.top.color = Params$ColorHighlight,
    table.border.bottom.style = "solid",
    table.border.bottom.width = px(4),
    table.border.bottom.color = Params$ColorHighlight,

    # Column labels
    column_labels.border.bottom.style = "solid",
    column_labels.border.bottom.width = px(3),
    column_labels.border.bottom.color = Params$ColorHighlight,
    column_labels.background.color = Params$ColorHighlight,

    # Row groups
    row_group.border.top.style = "solid",
    row_group.border.top.width = px(2),
    row_group.border.top.color = Params$ColorHighlight,
    row_group.border.bottom.style = "solid",
    row_group.border.bottom.width = px(1),
    row_group.border.bottom.color = Params$ColorHighlightLow,

    # Striping and spacing
    #row.striping.include_table_body = TRUE,
    #row.striping.background_color = "#f8f9fa",
    table_body.hlines.style = "solid",
    table_body.hlines.width = px(1),
    table_body.hlines.color = "#dee2e6",

    # Layout
    # table.width = pct(100),
    # container.overflow.x = TRUE,
    # container.overflow.y = TRUE,
    # container.height = px(600),

    quarto.disable_processing = TRUE
  ) |>

  # Enhanced source note with better formatting
  tab_source_note(
    source_note = md(
      "**📊 Data Source:** U.S. Census Bureau, American Community Survey (ACS) 5-Year Estimates | **📅 Analysis Period:** 2018-2022"
    )
  ) |>

  # Style the source note
  tab_style(
    style = list(
      cell_text(
        size = px(12),
        color = "#7f8c8d",
        style = "italic"
      )
    ),
    locations = cells_source_notes()
  )

📋 Dataset	🎯 Concept Category	📝 Variable Description	🔖 Variable ID
🏆 Key Census Variables Analysis
Critical demographic and socioeconomic indicators across geographic clusters
💰 Economic Indicators
acs5/subject	Earnings in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Percent!!Population 16 years and over with earnings!!FULL-TIME, YEAR-ROUND WORKERS WITH EARNINGS!!$25,000 to $34,999	S2001_C02_007
acs5/subject	Earnings in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Percent!!Population 16 years and over with earnings!!FULL-TIME, YEAR-ROUND WORKERS WITH EARNINGS!!$100,000 or more	S2001_C02_012
acs5/subject	Income in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Families!!Total!!$35,000 to $49,999	S1901_C02_006
acs5/subject	Income in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Families!!Total!!$50,000 to $74,999	S1901_C02_007
acs5/subject	Income in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Families!!Total!!$200,000 or more	S1901_C02_011
acs5/subject	Income in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Married-couple families!!Total!!$200,000 or more	S1901_C03_011
acs5/subject	Income in the past 12 months (in 2022 inflation-adjusted dollars)	Estimate!!Nonfamily households!!Total!!$200,000 or more	S1901_C04_011
🏭 Employment & Industry
acs5/subject	Industry by sex for the civilian employed population 16 years and over	Estimate!!Total!!Civilian employed population 16 years and over!!Transportation and warehousing, and utilities:!!Transportation and warehousing	S2403_C01_010
🚗 Transportation & Mobility
acs5	Means of transportation to work by age	Estimate!!Total:!!Car, truck, or van - carpooled:!!55 to 59 years	B08101_022
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Total!!Workers 16 years and over!!INDUSTRY!!Information and finance and insurance, and real estate and rental and leasing	S0802_C01_055
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Total!!Workers 16 years and over!!PLACE OF WORK!!Worked in state of residence!!Worked outside county of residence	S0802_C01_068
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- drove alone!!EARNINGS IN THE PAST 12 MONTHS (IN 2022 INFLATION-ADJUSTED DOLLARS) FOR WORKERS!!Workers 16 years and over with earnings!!$35,000 to $49,999	S0802_C02_033
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- drove alone!!Workers 16 years and over!!OCCUPATION!!Natural resources, construction, and maintenance occupations	S0802_C02_046
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- drove alone!!Workers 16 years and over who did not work from home!!TIME OF DEPARTURE TO GO TO WORK!!5:30 a.m. to 5:59 a.m.	S0802_C02_073
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over!!AGE!!45 to 54 years	S0802_C03_005
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over!!AGE!!55 to 59 years	S0802_C03_006
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!EARNINGS IN THE PAST 12 MONTHS (IN 2022 INFLATION-ADJUSTED DOLLARS) FOR WORKERS!!Workers 16 years and over with earnings!!$15,000 to $24,999	S0802_C03_031
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!EARNINGS IN THE PAST 12 MONTHS (IN 2022 INFLATION-ADJUSTED DOLLARS) FOR WORKERS!!Workers 16 years and over with earnings!!$25,000 to $34,999	S0802_C03_032
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!EARNINGS IN THE PAST 12 MONTHS (IN 2022 INFLATION-ADJUSTED DOLLARS) FOR WORKERS!!Workers 16 years and over with earnings!!Median earnings (dollars)	S0802_C03_037
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over!!OCCUPATION!!Production, transportation, and material moving occupations	S0802_C03_047
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over!!INDUSTRY!!Arts, entertainment, and recreation, and accommodation and food services	S0802_C03_058
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over!!CLASS OF WORKER!!Government workers	S0802_C03_063
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over who did not work from home!!TIME OF DEPARTURE TO GO TO WORK!!6:30 a.m. to 6:59 a.m.	S0802_C03_075
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over who did not work from home!!TRAVEL TIME TO WORK!!60 or more minutes	S0802_C03_089
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Car, truck, or van -- carpooled!!Workers 16 years and over in households!!VEHICLES AVAILABLE!!2 vehicles available	S0802_C03_096
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Public transportation (excluding taxicab)!!Workers 16 years and over!!OCCUPATION!!Service occupations	S0802_C04_044
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Public transportation (excluding taxicab)!!Workers 16 years and over!!PLACE OF WORK!!Worked in state of residence!!Worked in county of residence	S0802_C04_067
acs5/subject	Means of transportation to work by selected characteristics	Estimate!!Public transportation (excluding taxicab)!!Workers 16 years and over!!PLACE OF WORK!!Worked in state of residence!!Worked outside county of residence	S0802_C04_068
📊 Data Source: U.S. Census Bureau, American Community Survey (ACS) 5-Year Estimates \| 📅 Analysis Period: 2018-2022

After confirming these promising results we can save the most import features for future use.

BoardLocal |> pin_write(
  ConsolidatedVariableTractsByZoneId[, .SD, 
                                      .SDcols = c("LocationID",
                                                  TopVariableTractsMeanDiff)],
  "AcsVariablesByZoneId",
  type = "qs2",
  title = "Acs Variables By Zone Id"
)

Conclusion

This phase of the project has successfully transformed our dataset, moving beyond simple geography to capture the underlying socioeconomic drivers of urban mobility. By navigating the vast landscape of nearly 160,000 US Census variables, we employed a systematic, data-driven approach using Natural Language Processing and graph-based clustering to unearth the most potent predictors of taxi demand.

Our analysis validated the initial hypothesis: key indicators of economic activity (like household income and earnings), employment density, and commuting behavior (such as means of transportation and travel time to work) are strongly correlated with whether a taxi trip is taken. The process of performing a spatial join, using an area-weighted mean to accurately aggregate census tract data into taxi zones, has enriched our features with crucial context about the communities within them.

With these high-impact socioeconomic and transportation features now validated and extracted, we are poised for the next critical phase: preparing the complete dataset for model training. The insights gained here will be instrumental in building a more accurate and nuanced model to predict taxi trip demand across New York City.