Skip to contents

After completing the business understanding phase we are ready to perform the data understanding phase by performing an EDA with the following steps:

  1. Exploring the individual distribution of variables.
  2. Exploring correlations between predictors and target variable.
  3. Exploring correlations between predictors.

As this will help to: - Ensure data quality - Identify key predictors - Detect multicollinearity - Guide model choice and feature engineering

Setting the environment up

Loading packages to use

## Custom functions
library('project.nyc.taxi')

## To manage relative paths
library(here)

## To transform data larger than RAM
library(DBI)
library(duckdb)

## To transform data that fits in RAM
library(data.table)
library(lubridate)

## To create plots
library(ggplot2)
library(scales)

## Defining the print params to use in the report
options(datatable.print.nrows = 15, digits = 4)

Creating DB connections

con <- dbConnect(duckdb(), dbdir = here("my-db.duckdb"))

dbListTables(conn = con)
#> [1] "NycTrips"          "PointMeanDistance" "ZoneCodesFilter"  
#> [4] "ZoneCodesRef"      "ZoneCodesRefClean"

Defining testing data

TrainingSampleQuery <- "
SELECT t1.*
FROM NycTrips t1
INNER JOIN ZoneCodesFilter t2
  ON t1.PULocationID = t2.PULocationID AND
     t1.DOLocationID = t2.DOLocationID
WHERE t1.year = 2022
USING SAMPLE 1% (system, 547548);
"

TrainingSample <- dbGetQuery(con, TrainingSampleQuery)

setDT(TrainingSample)

Before saving the sample on DB, we need to create this function.

is_best_trip_20_min <- function() 5