After completing the business understanding phase we are ready to perform the data understanding phase by performing an EDA with the following steps:
- Exploring the individual distribution of variables.
- Exploring correlations between predictors and target variable.
- Exploring correlations between predictors.
As this will help to: - Ensure data quality - Identify key predictors - Detect multicollinearity - Guide model choice and feature engineering
Setting the environment up
Loading packages to use
## Custom functions
library('project.nyc.taxi')
## To manage relative paths
library(here)
## To transform data larger than RAM
library(DBI)
library(duckdb)
## To transform data that fits in RAM
library(data.table)
library(lubridate)
## To create plots
library(ggplot2)
library(scales)
## Defining the print params to use in the report
options(datatable.print.nrows = 15, digits = 4)
Creating DB connections
con <- dbConnect(duckdb(), dbdir = here("my-db.duckdb"))
dbListTables(conn = con)
#> [1] "NycTrips" "PointMeanDistance" "ZoneCodesFilter"
#> [4] "ZoneCodesRef" "ZoneCodesRefClean"
Defining testing data
TrainingSampleQuery <- "
SELECT t1.*
FROM NycTrips t1
INNER JOIN ZoneCodesFilter t2
ON t1.PULocationID = t2.PULocationID AND
t1.DOLocationID = t2.DOLocationID
WHERE t1.year = 2022
USING SAMPLE 1% (system, 547548);
"
TrainingSample <- dbGetQuery(con, TrainingSampleQuery)
setDT(TrainingSample)
Before saving the sample on DB, we need to create this function.
is_best_trip_20_min <- function() 5