5 Generalized linear models (GLM)

5.1 Linear regresion

5.1.1 Getting the Least Squares Line of a sample

As the population regression line is unobserved the least squares line of a sample is a good estimation. To get it we need to follow the next steps:

Define the function to fit.

\[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} x \]

Define how to calculate residuals.

\[ e_{i} = y_{i} - \hat{y}_{i} \]

Define the residual sum of squares (RSS).

\[ RSS = e_{1}^2 + e_{2}^2 + \dots + e_{n}^2 \]

Use calculus or make estimation with a computer to find the coefficients that minimize the RSS.

\[ \hat{\beta}_{1} = \frac{\Sigma_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})} {\Sigma_{i=1}^{n}(x_{i}-\overline{x})} , \quad \hat{\beta}_{0} = \overline{y} - \hat{\beta}_{1}\overline{x} \]

5.1.2 Getting confident intervarls of coeffients

To estimate the population regression line we can calculate conﬁdence intervals for sample coefficients, to define a range where we can find the population values with a defined confidence level.

If we want to use 95% of confidence we need to know that after taking many samples only 95% of the intervals produced with this confident level would have the true value (parameter).

To generate confident intervals we would need to calculate the variance of the random error.

\[ \sigma^2 = Var(\epsilon) \]

But as we can not calculate that variance an alternative can be to estimate it based on residuals if they meet the next conditions:

Each residual have common variance $\sigma^2$, so the variances of the error terms shouldn’t have any relation with the value of the response.
Residuals are uncorrelated. For example, if $\epsilon_{i}$ is positive, that provides little or no information about the sign of $\epsilon_{i+1}$.

If not, we would end underestimating the true standard errors, reducing the probability a given confident level to contain the true value of the parameter and underrating the p-values associated with the model.

\[ \sigma \approx RSE = \sqrt{\frac{RSS}{(n-p-1)}} \]

Now we can calculate the standard error of each coefficient and calculate the confident intervals.

\[ SE(\hat{\beta_{0}})^2 = \sigma^2 \left[\frac{1}{n}+ \frac{\overline{x}^2} {\Sigma_{i=1}^{n} (x_{i}-\overline{x})^2} \right] \]

\[ SE(\hat{\beta_{1}})^2 = \frac{\sigma^2} {\Sigma_{i=1}^{n} (x_{i} - \overline{x})^2} \]

\[ \hat{\beta_{1}} \pm 2 \cdot SE(\hat{\beta_{1}}), \quad \hat{\beta_{0}} \pm 2 \cdot SE(\hat{\beta_{0}}) \]

5.1.3 Insights to extract

Confirm the relationship between the Response and Predictors

Use the regression overall P-value (based on the F-statistic) to confirm that at least one predictor is related with the Response and avoid interpretative problems associated with the number of observations (n) or predictors (p).

\[ H_{0}: \beta_{1} = \beta_{2} = \dots = \beta_{p} = 0 \]

\[ H_{a}: \text{at least one } \beta_{j} \text{ is non-zero} \]

Accuracy of the model (relationship strength)

If we want to know how well the model fits to the data we have two options:

Residual standard error (RSE): Even if the model were correct, the actual values of $\hat{y}$ would differ from the true regression line by approximately this units, on average. To get the percentage error we can calculate $RSE/\overline{x}$
The $R^2$ statistic: The proportion of variance explained by taking as a reference the total sum of squares (TSS).

\[ TSS = \Sigma(y_{i} - \overline{y})^2 \]

\[ R^2 = \frac{TSS - RSS}{TSS} \]

\[ R^2 = \begin{cases} Cor(X, Y)^2 & \text{Simple Lineal Regresion} \\ Cor(Y,\hat{Y})^2 & \text{Multipline Lineal Regresion} \end{cases} \]

Confirm the relationship between the Response and each predictor

To answer that we can test if a particular subset of q of the coefficients are zero.

\[ H_{0}: \beta_{p-q+1} = \beta_{p-q+2} = \dots = \beta_{p} = 0 \]

In this case, F-statistic reports the partial eﬀect of adding a extra variable to the model (the order matters) to apply a variable selection technique. The classical approach is to:

Fit a model for each variable combination $2^p$.
Select the best model based on Mallow’s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted $R^2$ or plot various model outputs, such as the residuals, in order to search for patterns.

But just think that if we have $p = 30$ we will have $2^{30} = =1,073,741,824\ models$ to fit, that it’s too much. Some alternative approaches for this task:

Forward selection
Backward selection (cannot be used if p >n)
Mixed selection

Size of association between each predictor and the response.

To check that we need to see the $\hat{\beta}_{j}$ confident intervals as the real $\beta_{j}$ is in that range.

Predicting future values

If we want to predict the average response $f(X)$ we can use the confident intervals, but if we want to predict an individual response $Y = f(X) + \epsilon$ we need to use prediction intervals as they account for the uncertainty associated with $\epsilon$, the irreducible error.

5.1.4 Standard linear regression model assumptions

The additivity assumption means that the association between a predictor $X_{j}$ and the response $Y$ does not depend on the values of the other predictors, as it happens when there is a interaction (synergy) effect
The linearity assumption states that the change in the response Y associated with a one-unit change in $X_{j}$ is constant, regardless of the value of $X_{j}$.

Including an interaction term

This approach relax the additivity assumption that models usually have.

2 quantitative variables

It consist in adding an extra coefficient which multiplies two or more variables.

\[ \begin{split} Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{1} X_{2} + \epsilon \\ & = \beta_{0} + (\beta_{1} + \beta_{3} X_{2}) X_{1} + \beta_{2} X_{2} + \epsilon \\ & = \beta_{0} + \tilde{\beta}_{1} X_{1} + \beta_{2} X_{2} + \epsilon \end{split} \]

After adding the interaction term we could interpret the change as making one of the original coefficient a function of the another variable. Now we could say that $\beta_{3}$ represent the change of $X_{1}$ effectiveness associated with a one-unit increase in $X_{2}$.

It very important that we keep hierarchical principle, which states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant as it would alter the meaning of the interaction.

1 quantitative and 1 qualitative variable

If $X_{1}$ is quantitative and $X_{2}$ is qualitative:

\[ \hat{Y} = \begin{cases} (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})X_{1} & \text{if }X_{2} \text{ is TRUE}\\ \beta_{0} + \beta_{1}X_{1} & \text{if }X_{2} \text{ is FALSE} \end{cases} \]

Adding the $\beta_{3}$ interaction allow the line to change the line slope based on $X_{2}$ and not just a different intercept.

Polynomial regression

This approach relax the linearity assumption that models usually have. It consist in including transformed versions of the predictors.

\[ \begin{split} Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} \\ & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{1}^2 \end{split} \]

5.1.5 Possible problems

The target doesn’t follow a normal distribution

We have two ways to detect this problem:

Plot the target variable with a histogram

Plotting the residual and checking the plot creates a funnel shape, demonstrating a non-constant variance (heteroscedasticity) of error terms.

To solve this problem we can apply transformations to the target variable.

Applying a concave function such as $\log{Y}$ or $\sqrt{Y}$. If the target value has 0s we can define the argument offset as 1.

ames_recipe <- 
  recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes(), offset = 0)

Applying Box Cox transformation is more flexible than the log transformation and will find an appropriate transformation from a family of power transforms that will transform the variable as close as possible to a normal distribution. Its lambda ($\lambda$) value varies from -5 to 5 and it’s selected based on the traning data.

\[ y(\lambda) = \begin{cases} \frac{Y^\lambda - 1}{\lambda}, \text{if } \lambda \neq 0 \\ \log{(Y)}, \text{if } \lambda = 0 \end{cases} \]

If your data consists of values $y \leq 1$ use the Yeo-Johnson transformation, which is very similar to the Box-Cox one.

ames_recipe <- 
  recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_outcomes())

To make correct interpretations of our predictions it’s important to remember that we need to revert this transformation after getting the prediction.

This actions will reduce the relative error of each prediction.

Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related (highly correlated) to one another. It reduces the accuracy of the estimates of the regression coeﬃcients and causes the standard error for $\hat{\beta}_{j}$ to grow. That reduce the power of the hypothesis test,that is, the probability of correctly detecting a non-zero coeﬃcient.

Looking at the correlation matrix of the predictors could be usefull, but it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation (multicollinearity).

Detection method	Solutions
The best way to assess multicollinearity is to compute the variance inﬂation factor (VIF), which is the ratio of the variance of $\hat{\beta}_{j}$ when ﬁtting the full model divided by the variance of $\hat{\beta}_{j}$ if ﬁt on its own with 1 as its lowest value and 5 or 10 as problematic values of collinearity	1. Drop one of the problematic variables from the regression. 2. Combine the collinear variables together into a single predictor

\[ \text{VIF}(\hat{\beta}_{j}) = \frac{1} {1 - R_{X_{j}|X_{-j}}^2} \]

Where $R_{X_{j}|X_{-j}}^2$ is the $R^2$ from a regression of $X_{j}$ onto all of the other predictors.

Non-linearity of the response-predictor relationships

Detection method	Solutions
Plot the residuals versus predicted values $\hat{y}_{i}$. Ideally, the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.	A simple approach is to use non-linear transformations of the predictors, such as $\log{X}$, $\sqrt{X}$, and $X^2$, in the regression model

Correlation of error terms

If there is correlation among the errors, then the estimated standard errors of the coefficients will be biased leading to prediction intervals being narrower than they should be.

Detection method	Solutions
1. Plot the residuals from our model as a function of time or execution order. If the errors are uncorrelated, then there should be no discernible pattern. 2. Check if some observation have been exposed to the same environmental factors	Good experimental design is crucial in order to mitigate these problems

Outliers

An outlier is a point for which $y_{i}$ is far from the value predicted by the model. Sometimes, they have little effect on the least squares line, but over estimate the RSE making bigger p-values of the model and under estimate the $R^2$.

Detection method	Solutions
Plot the studentized residuals, computed by dividing each residual $e_{i}$ by its estimated standard error. Then search for points which absolute value is greater than 3	They can be removed if it has occurred due to an error in data collection. Otherwise, they may indicate a deficiency with the model, such as a missing predictor.

High-leverage points

Observations with high leverage have an unusual value for $x_{i}$. High leverage observations tend to have a sizable impact on the estimated regression line and any problems with these points may invalidate the entire fit.

Detection method	Solutions
Compute the leverage statistic. Find an observation with higher value than mean, represented by $(p + 1)/n$. Leverage values are always between $1/n$ and $1$	Make sure that the value is correct and not a data collection problem

\[ h_{i} = \frac{1}{n} + \frac{(x_{i} - \overline{x})^2} {\Sigma_{i'=1}^n(x_{i'} - \overline{x})^2} \]

In a multiple linear regression, it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors.

5.1.6 Avoid using for classification problems

There are better model to achieve that kind of situation. For example, he linear discriminant analysis (LDA) procedure the same response of a linear regression for a binary problem. Other reasons are:

A regression method cannot accommodate a qualitative response with more than two classes.
A regression method will not provide meaningful estimates of $Pr(Y|X)$ as some of our estimates might be outside the [0, 1] interval.

5.1.7 Benefits of Replacing plain least squares ﬁtting

Prediction Accuracy: If n is not much larger than p, then there can be a lot of variability in the least squares ﬁt, resulting in overﬁtting and consequently poor predictions on future observations not used in model training. And if p >n, then there is no longer a unique least squares coeﬃcient estimate: the variance is inﬁnite so the method cannot be used at all. As an alternative, We could reduce the variance by increasing in bias (constraining and shrinking).
Model Interpretability:There are some methods that can exclude irrelevant variables from a multiple regression model (feature selection or variable selection).

5.1.8 Coding example

To perform Linear Regression we just need to create the model specification by using lm engine.

library(ISLR2)
library(tidymodels)

set.seed(123)
BikeshareSpit <- initial_split(Bikeshare)

lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

lm_rec_spec <- 
  recipe(bikers ~ mnth + hr + workingday + temp + weathersit,
         data = Bikeshare ) %>% 
  step_dummy(all_nominal_predictors())

workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(lm_rec_spec) %>%
  last_fit(split = BikeshareSpit) %>%
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard      76.1   Preprocessor1_Model1
2 rsq     standard       0.671 Preprocessor1_Model1

5.2 Extending Linear Regression (Data Transformations)

5.2.1 Polynomial regression

It extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power.

As result, if the response is a numeric variable we can fit our model to follow the next form:

\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i \]

On the other hand, we can use the logistic regression and apply the same structure to predict the probability of particular class:

\[ \Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)} {1 + \exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)} \]

5.2.2 Piecewise constant regression

They cut the range of a variable into K distinct regions (known as bins) in order to produce a qualitative variable. This has the eﬀect of ﬁtting a piecewise constant function.

If we define the cutpoints as $c_1, c_2, \dots, c_K$ in the range of X, we can create dummy variables to represent each range. For example, if $c_1 \leq x_i < c_2$ is TRUE then $C_1(x_i) = 1$ and then we need to repeat that process for each value of $X$ and range. As result we can fit a lineal regression based on the new variables.

\[ y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i) + \epsilon_i \]

On the other hand, we can use the logistic regression and apply the same structure to predict the probability of particular class:

\[ \Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i)} {1 + \exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i))} \]

5.2.3 Piecewise polynomials regression (Natural Spine)

It consist in ﬁtting separate low-degree polynomials over diﬀerent regions of X. For example, a piecewise cubic polynomial with a single knot at a point c takes the form.

\[ y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i<c\\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \geq c \end{cases} \]

As each polynomial has four parameters, we are using a total of 8 degrees of freedom in ﬁtting that model. By using that model with Wagedata, we can see a problem as the model used was too flexible and to solve it we need to constrain it to be continuous at age = 50.

But as you could see after applying the continuity constraint the plot still present an unnatural V-shape that can solve by apply the continuity constraint to the first and second derivative of the function, the end with 5 degrees of freedom model.

In this context, a natural spline refers to a regression spline with the additional constraints of maintaining linearity at the boundaries.

5.3 Subset Selection

This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then ﬁt a model using least squares on the reduced set of variables.

5.3.1 Best Subset Selection

It ﬁts all p models that contain exactly one predictor, all $\left( \begin{array}{c} p \\ 2 \end{array} \right) = p (p-1)/2$ models that contain exactly two predictors, and so forth. Then it selects the best model based on smallest RSS or the largest $R^2$.

Algorithm 6.1
1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation.
2. For $k = 1, 2, \dots, p$:
- Fit all $\left( \begin{array}{c} p \\ k \end{array} \right)$ models that contain exactly k predictors.
- Pick the best among these $\left( \begin{array}{c} p \\ k \end{array} \right)$ models using the smallest RSS or the deviance for classification (negative two times the maximized log-likelihood), and call it $\mathcal{M}_k$
1. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross- validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

This method is really computational expensive as it needs to fit $2^p$ models. Just think that if your data has 20 predicts, then there are over one million possibilities. Thus an enormous search space can also lead to overﬁtting and high variance of the coeﬃcient estimates.

5.3.2 Stepwise Selection: Forward Stepwise Selection

It begins with a model containing no predictors, and then adds the predictors who gives the greatest additional improvement to the ﬁt, one-at-a-time, until all of the predictors are in the model, as result we will need to fit $1+p(p+1)/2$ models.

Algorithm 6.2
1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation.
2. For $k = 0, \dots, p-1$:
- Consider all $p-k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor.
- Choose the best among these $p-k$ models, and call it $\mathcal{M}_{k+1}$. Here best is deﬁned as having smallest RSS.
1. Select a single best model from among $\mathcal{M}_k, \dots, \mathcal{M}_p$ using cross- validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

Though it tends to do well in practice, it is not guaranteed to ﬁnd the best possible model out of all $2^p$ models containing subsets of the p predictors, but can be used even if $n < p$.

5.3.3 Stepwise Selection: Backward Stepwise Selection

It begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. As result we will need to fit $1+p(p+1)/2$ models

Algorithm 6.3
1. Let $\mathcal{M}_p$ denote the full model, which contains all p predictors.
2. For $k = p, p -1, \dots , 1$:
- Consider all k models that contain all but one of the predictors in $\mathcal{M}_k$, for a total of $k-1$ predictors.
- Choose the best among these k models, and call it $\mathcal{M}_{k-1}$. Here best is deﬁned as having smallest RSS.
1. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross- validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

This method don’t guarantee to yield the best model containing a subset of the p predictors and we need to cofirm that $n \geq p$.

5.3.4 Stepwise Selection: Hybrid Approaches

In this method variables adds variables sequentially, but after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model ﬁt. Such an approach attempts to more closely mimic Best Subset Selection while retaining the computational advantages of forward and backward stepwise selection.

5.3.5 Choosing the Optimal Model

As the model which contains all of the predictors will always have the smallest RSS, we need to estimate the test error rate by:

Making an adjustment to the training error to account for the bias due to overﬁtting.
Using either a validation set approach or a cross-validation approach.

Let’s see the methods that relay on correcting the training error:

Method	Formula	Interpretation
$C_p$	$C_p = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$	$2d\hat{\sigma}^2$ represent the penalty of adding new predictors. This method is a good estimation of test MSE if the $\hat{\sigma}^2$ is an unbiased estimate of the $\sigma^2$
Akaike information criterion	$\text{AIC} = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$	The book omits irrelevant constants to show that $C_p$ and $\text{AIC}$ are proportional to each other
Bayesian information criterion	$\text{BIC} = \frac{1}{n}(\text{RSS} +\log(n) d \hat{\sigma}^2)$	After omitting irrelevant constants, we can see that $\log n > 2$ for any $n > 7$, as consequence, the metric trends to add a heavier penalty on models with many variables than the $C_p$ and tent to select models with fewer predictors
Adjusted $R^2$	$\text{Adjusted} \; R^2 = 1 - \frac{\text{RSS}/(n - d -1)}{\text{TSS}/(n-1)}$	A large value of adjusted $R^2$ indicates a model with a small test error, even though the metric doesn’t rely on rigorous theoretical justiﬁcations.

Where:

$n$: Number of observations
$d$: Number of predictors
$\hat{\sigma}^2$: Estimate of the variance of the error $\epsilon$ associated with each response

5.4 Shrinkage

This approach involves ﬁtting a model involving all p predictors and shrinks the estimated coeﬃcients towards zero. Depending on what type of shrinkage is performed, some of the coeﬃcients may be estimated exactly at zero, performing some variable selection.

5.4.1 Ridge Regression

The method rather than using RSS as the metric to minimize with the regression coefficient $\beta_0, \beta_1, \dots, \beta_p$, it is modified by adding a shrinkage penalty that the effect of estimating $\beta_j$ towards zero when the coefficient is close to 0:

\[ RSS + \lambda \sum_{j=1}^p \beta_j^2 \]

Where:

$\lambda$: It’s a tuning parameter $\geq 0$ that can be calculated using cross-validation.
$\text{RSS}$: Present the residual standard error.

As result, the ridge regression will produce a diﬀerent set of coeﬃcient estimates for each $\lambda$ value, $\hat{\beta}_\lambda^R$. By plotting the new coefficients against the penalty used we can see how the coefficients move towards zero without reaching the absolute 0, but may also be useful to plot the $\ell_2 \; \mathcal{norm}$ ($\| \beta \|_2 = \sqrt{\sum_{j=1}^p \beta^2}$) proportion of the ridge vs least squares coefficients, to compute the proportion in which the ridge regression coeﬃcient estimate have been shrunken towards zero.

As the $\lambda$ to select is sensible to the predictors scaling, it’s a good practice scaling the predictors using the next formula, to make all the predictors to have an standard deviation of 1:

\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = \frac{x_{ij}}{\sigma_j} \]

When the number of variables p is almost as large as the number of observations n the least squares estimates will be extremely variable and this method increase the bias by reducing the flexibility through $\lambda$ to reduce the variance and find the lower error rate, without making many computations as happens the best subset selection method.

5.4.2 Lasso Regression

Ridge Regression don’t set the coefficient exactly to zero (unless $\lambda = \infty$). That doesn’t affect the model accuracy but doesn’t provide any help when we have to interpret a model with many predicts. To over come that problem, the Lasso Regression performs variable selection based on minimization of the next function:

\[ RSS + \lambda \sum_{j=1}^p |\beta_j| \]

Where:

$\lambda$: It’s a tuning parameter $\geq 0$ that can be calculated using cross-validation.
$\text{RSS}$: Present the residual standard error.

As consequence, the method uses the $\ell_1$ penalty ($\| \beta \|_1 = \sum_{j=1}^p |\beta|$), instead of the $\ell_2$.

5.4.3 Coding example

To perform a Ridge or Lasso regression we need to use the function linear_reg and define the mixture argument depending on the regression we want to perform.

Model	Mixture
Ridge Regression	mixture = 0
Lasso Regression	mixture = 1

As both regression depend on the penalty parameter we need to use cross-validation to estimate the best one. But, if you want to explore the parameter by yourself you has the next options.

Fit a model and explore different penalties using tidy, augment, predict or autoplot functions.

Hitters <- 
  as_tibble(Hitters) %>%
  filter(!is.na(Salary))

ridge_fit <- 
  linear_reg(mixture = 0, penalty = 0) %>%
  set_mode("regression") %>%
  set_engine("glmnet") %>%
  fit(Salary ~ ., data = Hitters)

tidy(ridge_fit, penalty = 11498)

# A tibble: 20 × 3
   term         estimate penalty
   <chr>           <dbl>   <dbl>
 1 (Intercept) 407.        11498
 2 AtBat         0.0370    11498
 3 Hits          0.138     11498
 4 HmRun         0.525     11498
 5 Runs          0.231     11498
 6 RBI           0.240     11498
 7 Walks         0.290     11498
 8 Years         1.11      11498
 9 CAtBat        0.00314   11498
10 CHits         0.0117    11498
11 CHmRun        0.0876    11498
12 CRuns         0.0234    11498
13 CRBI          0.0242    11498
14 CWalks        0.0250    11498
15 LeagueN       0.0866    11498
16 DivisionW    -6.23      11498
17 PutOuts       0.0165    11498
18 Assists       0.00262   11498
19 Errors       -0.0206    11498
20 NewLeagueN    0.303     11498

predict(ridge_fit, new_data = Hitters, penalty = 11498)

# A tibble: 263 × 1
   .pred
   <dbl>
 1  533.
 2  553.
 3  620.
 4  487.
 5  559.
 6  440.
 7  446.
 8  453.
 9  620.
10  615.
# ℹ 253 more rows

autoplot(ridge_fit)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

Let’s have a example using the tidymodels approach. By following the next steps:

Splitting the data in testing and training data.

set.seed(18)
Hitters_split <- initial_split(Hitters, strata = "Salary")

Hitters_train <- training(Hitters_split)
Hitters_test <- testing(Hitters_split)

Define the workflow to use.

ridge_recipe <- 
  recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

ridge_spec <- 
  linear_reg(penalty = tune(), mixture = 0) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

ridge_workflow <- 
  workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(ridge_spec)

Use cross-validation to estimate the testing error and tune the penalty.

set.seed(40)
Hitters_fold <- vfold_cv(Hitters_train, v = 10)

Define the penalties to check. Where the range indicates the limit of $x$ in the function $10^x$ and the level the number of step to complete the range.

penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)

Let’s fit our models.

tune_res <- 
  tune_grid(ridge_workflow,
            resamples = Hitters_fold, 
            grid = penalty_grid)

Check the test error change in a plot or just export a table.

autoplot(tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

collect_metrics(tune_res) %>%
  filter(.metric == "rsq") %>%
  arrange(desc(mean))

# A tibble: 50 × 7
   penalty .metric .estimator  mean     n std_err .config              
     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1    910. rsq     standard   0.454    10  0.0647 Preprocessor1_Model40
 2    569. rsq     standard   0.454    10  0.0646 Preprocessor1_Model39
 3   1456. rsq     standard   0.453    10  0.0646 Preprocessor1_Model41
 4    356. rsq     standard   0.453    10  0.0644 Preprocessor1_Model38
 5   2330. rsq     standard   0.452    10  0.0645 Preprocessor1_Model42
 6    222. rsq     standard   0.450    10  0.0640 Preprocessor1_Model37
 7   3728. rsq     standard   0.450    10  0.0643 Preprocessor1_Model43
 8   5964. rsq     standard   0.449    10  0.0642 Preprocessor1_Model44
 9   9541. rsq     standard   0.448    10  0.0641 Preprocessor1_Model45
10    139. rsq     standard   0.448    10  0.0634 Preprocessor1_Model36
# ℹ 40 more rows

Select the best model configuration and update the workflow to use that parameter.

best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty

# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1    910. Preprocessor1_Model40

ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_novel()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 910.298177991523
  mixture = 0

Computational engine: glmnet

Fit the final model and validate the performance.

ridge_final_fit <- fit(ridge_final, data = Hitters_train)

augment(ridge_final_fit, new_data = Hitters_test) %>%
  rsq(truth = Salary, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.514

To perform a Lasso regression we don’t need to repeat all the code.

lasso_spec <- 
  linear_reg(penalty = tune(), mixture = 1) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

lasso_workflow <- 
  workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(lasso_spec)

lasso_penalty_grid <- grid_regular(penalty(range = c(-3, 2)), levels = 50)

lasso_tune_res <- 
  tune_grid(lasso_workflow,
            resamples = Hitters_fold, 
            grid = lasso_penalty_grid)

autoplot(lasso_tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

5.5 Dimension Reduction

This method projects the $p$ predictors into an $M$-dimensional subspace. If $Z_1, Z_2, \dots, Z_M$ represent $M < p$ lineal combinations $Z_m = \sum_{j=1}^p \phi_{jm}X_j$ of ALL our original predictors based on some constants $\phi_{1m}, \phi_{2m}, \dots, \phi_{pm}$, then we can use the new variables to fit a linear regression model by least squares.

\[ y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, \qquad i=1, \dots, n. \]

This is not a feature selection method as each of the M principal components used in the regression is a linear combination of all p of the original features.In this sense, PCR is more closely related to ridge regression than to the lasso.

To select the $\phi_{jm}$’s we will discuss two different ways:

5.5.1 Principal Components Regression (PCR)

The PCR assumes that the directions in which $X_1, \dots, X_M$ show the most variation are the directions that are associated with $Y$. If it’s that is true then fitting the model to $Z_1, \dots, Z_m$ will lead better results than using the original variables $X_1, \dots, X_p$

To perform a principal components analysis (PCA):

It’s recommended to standardize each predictor to have the same scale.

\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = \frac{x_{ij}}{\sigma_j} \]

Select as the ﬁrst principal component the variable where the data vary the most.

Project the observations on the first component, to get the largest possible variance

Then maximize the $\text{Var}(\phi_{11} \times (x_1-\overline{x_1}) + \phi_{21} \times (x_2-\overline{x_2}))$ where $\phi_{11}^2 + \phi_{21}^2 = 1$ to the get the principal component loadings. As result $Z_1$ it’s a weighted average of the to variables.
Repeat the process until having p distinct principal components perpendicular to the previews one.

In general, this method performs better when we just need to use few principal components.

The number of principal components, M, is typically chosen by cross-validation.

5.5.2 Partial Least Squares (PLS)

The PLS approach attempts to ﬁnd directions that help explain both the response and the predictors by placing the highest weight on the variables that are most strongly related to the response.

Standardize the predictors and response.

\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = \frac{x_{ij}}{\sigma_j} \]

Compute the ﬁrst direction $Z_1$ by setting each $\phi_{j1}$ equal to the coeﬃcient from the simple linear regression of $Y \sim X_j$, which it’s also proportional to the correlation.
Adjust each of the variables for $Z_1$, by regressing each variable on $Z_1$ and taking residuals.
Compute $Z_2$ using this orthogonalized data (residuals) in exactly the same fashion as $Z_1$ was computed based on the original data and repete the process $M$ times.
Use least squares to ﬁt a linear model to predict $Y$ using $Z_1, \dots, Z_M$

To select the number $M$ we can use cross-validation.

5.5.3 PCR vs PLS

The next figure illustrates how each method work.

The next figure illustrates that the first two PCs when using - PCR: Has very little relationship to the response variable. - PLS: Has a much stronger association to the response variable.

Performance Note

In practice, it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall beneﬁt of PLS relative to PCR is a wash.

5.5.4 Coding example

To run a PCR or a PLS we just need to set a lineal model.

lm_spec <- 
  linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

And tune our recipe to determinate the threshold and the number of components to use num_comp.

pca_recipe <- 
  recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), 
           threshold = tune(),
           num_comp = tune())

pca_workflow <- 
  workflow() %>% 
  add_recipe(pca_recipe) %>% 
  add_model(lm_spec)

threshold_grid <- 
  grid_regular(threshold(), 
               num_comp(c(1, 20)),
               levels = 5)

pca_tune_res <- 
  tune_grid(pca_workflow,
            resamples = Hitters_fold, 
            grid = threshold_grid)

As we can see below the number of component don’t have any effect over the test error.

autoplot(pca_tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

best_threshold <- select_best(pca_tune_res, metric = "rmse")
best_threshold

# A tibble: 1 × 3
  num_comp threshold .config              
     <int>     <dbl> <chr>                
1        1      0.75 Preprocessor04_Model1

final_pcr_fit <-
  finalize_workflow(pca_workflow, best_threshold) |>
  fit(data = Hitters_train)

final_pcr_fit |>
  augment(new_data = Hitters_test) |>
  rmse(truth = Salary, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        402.

But thanks to the DALEXtra package we can interpret these complex models.

library(DALEXtra)

set.seed(1807)
explain_tidymodels(
  final_pcr_fit, 
  data = Hitters_train %>% select(-Salary), 
  y = Hitters_train$Salary,
  label = "RDA",
  verbose = FALSE
) %>% 
  model_parts() |>
  plot()

5.6 Poisson Regression

If match with the next conditions we can fit a model based on Poisson Distribution:

The outcome variable is the result of counting, so $Y \in \{ 0, 1, 2, 3, \dots \}$.
$Ave(Y) Var(Y) $, if not you can use the Quasipoisson Distribution.

The Poisson Distribution follow the next function:

\[ Pr(Y = k) = \frac{e^{-\lambda} \lambda^{k}} {k!} \]

Where: - $\lambda$ must be greater than 0. It represents the expected number of events $E(Y)$ and variance related $Var(Y)$ - $k$ represent the number of events that we want to evaluate base of $\lambda$. Its numbers should be greater or equal to 0.

So, it makes sense that the value that we want to predict with our regression would be $\lambda$, by using next structure:

\[ \log{ \left( \lambda(X_1, X_2, \dots , X_p) \right)} = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p \]

If select this model we need to be aware how to interpret the coefficients. For example, if $\beta_1 = -0.08$ for a categorical variable, we can conclude by calculating $e^{-0.08}$ that 92.31% of events of the base line related to $\beta_0$ would happen.

5.6.1 Coding example

To perform Poisson Regression we just need to create the model specification by loading the poissonreg package and using glm engine.

library(poissonreg)

pois_spec <- poisson_reg() %>% 
  set_mode("regression") %>% 
  set_engine("glm")

pois_rec_spec <- lm_rec_spec

workflow() %>%
  add_model(pois_spec) %>%
  add_recipe(pois_rec_spec) %>%
  last_fit(split = BikeshareSpit) %>%
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard      69.7   Preprocessor1_Model1
2 rsq     standard       0.725 Preprocessor1_Model1

5.7 Logistic Regression

It models the probability ($p(X) = Pr(Y=1|X)$) that Y belongs to a particular category given some predictors by assuming that $Y$ follows a Bernoulli Distribution. This model calculates the probability using the logistic function which produce a S form between 0 and 1:

\[ p(X) = \frac{e^{\beta_{0}+\beta_{1}X}} {1+e^{\beta_{0}+\beta_{1}X}} \]

As the functions returns probabilities is responsibility of the analyst to define a threshold to make classifications.

5.7.1 Estimating coefficients

To estimate $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ the method used is called as maximum likelihood which consists in maximizing the likelihood function. It is important to clarify that the least squares approach is in fact a special case of maximum likelihood.

\[ \ell(\beta_{0}, \beta_{1}) = \prod_{i:y_{i} = 1}p(x_{i})\prod_{i':y_{i'} = 0}p(1-x_{i'}) \]

5.7.2 Multiple regression

We also can generalize the logistic function as you can see bellow.

\[ p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}} {1+e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}} \]

5.7.3 Interpreting the model

To understand how each variable influence the probability $p(X)$, we need to manipulate the logistic function until having a lineal combination on the right site.

\[ \underbrace{ \log{ \left( \overbrace{\frac{p(X)}{1 - p(X)}}^\text{odds ratio} \right)} }_\text{log odds or logit} = \beta_{0}+\beta_{1}X \]

As we can see, the result of the linear combination is the $\log$ of the odds ratio, known as log odd or logit.

An odds ratio of an event presents the likelihood that the event will occur as a proportion of the likelihood that the event won’t occur. It can take any value between $0$ and $\infty$, where low probabilities are close to $0$, higher to $\infty$ and equivalents ones are equals to 1. For example, if we have an $\text{odds ratio} = 2$, we can say that it’s 2 times more likely that the event happens rather than not.

Applying $\log{(\text{odds ratio})}$ makes easier to compare the effect of variables as values below 1 become negative numbers of the scale of possible numbers and 1 becomes 0 for non-significant ones. To have an idea, an odds ratio of 2 has the same effect as 0.5, which it’s hard to see at first hand, but if we apply the $\log$ to each value we can see that $\log{(2)} = 0.69$ and $\log{(0.5)} = -0.69$.

At end, $p(X)$ will increase as $X$ increases if $\beta_{1}$ is positive despite the relationship between each other isn’t a linear one.

Understanding a confounding paradox

Simple Regression	Multiple Regression

The positive coeﬃcient for student indicates that for over all values of balance and income, a student is more likely to default than a non-student.	The negative coeﬃcient for student indicates that for a ﬁxed value of balance and income, a student is less likely to default than a non-student.

The problem relays on the fact that student and balance are correlated. In consequence, a student is riskier than a non-student if no information about the student’s credit card balance is available. However, that student is less risky than a non-student with the same credit card balance!

Multinomial Logistic Regression

We also can generalize the logistic function to support more than 2 categories ($K > 2$) by defining by convention the last category $K$ as a baseline.

For $k = 1, \dotsc,K-1$ we use function.

\[ Pr(Y = k|X= x) = \frac{e^{\beta_{k0}+\beta_{k1}x_{1}+\dots+\beta_{kp}x_{p}}} {1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}} \]

For $k=K$, we use the function.

\[ Pr(Y = K|X= x) = \frac{1} {1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}} \] And after some manipulations we can show that $\log$ of the probability of getting $k$ divided by the probability of the baseline is equivalent to a linear combinations of the functions parameters.

\[ \log{ \left( \frac{Pr(Y = k|X= x)}{Pr(Y = K|X= x)} \right)} = \beta_{k0}+\beta_{k1}x_{1}+\dots+\beta_{kp}x_{p} \]

In consequence, each coefficient represent a measure of how much change the probability from the baseline probability.

5.7.4 Model limitatios

There are models that could make better classifications when:

There is a substantial separation between the $Y$ classes.
The predictors $X$ are approximately normal in each class and the sample size is small.
When the decision boundary is not lineal.

5.7.5 Coding example

To perform Logistic Regression we just need to create the model specification by loading the discrim package and using MASS engine.

Smarket_train <- 
  Smarket %>%
  filter(Year != 2005)

Smarket_test <- 
  Smarket %>%
  filter(Year == 2005)

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

SmarketLrPredictions <-
  lr_spec %>%
  fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
  augment(new_data = Smarket_test) 


conf_mat(SmarketLrPredictions, truth = Direction, estimate = .pred_class)

          Truth
Prediction Down  Up
      Down   35  35
      Up     76 106

accuracy(SmarketLrPredictions, truth = Direction, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.560

5.8 Generalized Additive Models (GAM)

They provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.

To transform each predictor we have the next options:

A constant for each categorical level (step function)
Polynomial regression
Natural spines (optimized with least squares)
Smoothing splines (optimized with backﬁtting)
Local regression

Backﬁtting ﬁts a model involving multiple predictors by repeatedly updating the ﬁt for each predictor in turn, holding the others ﬁxed.

Pros	Cons
1. It finds relationships that a lineal model would miss without applying many transformations as it can fit a non-linear $f_j$ to each $X_j$ 2. We can examine the eﬀect of each predictor on $Y$ individually as the model is additive. 3. The smoothness of each function $f_j$ can be summarized via degrees of freedom.	1. Important interactions can be missed, but they can be added manually

5.8.1 GAM Regression

To predict a numeric variable this method creates a function with the next form:

\[ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \dots + f_p(x_{ip}) + \epsilon_i \]

By taking the other predictor as constant we can plot effect of each function for each predictor in the Wage example:

5.8.2 GAM Classification

To predict a categorical variable this method creates a function with the next form:

\[ \log \left( \frac{p(X)}{1-p(X)} \right)= \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \dots + f_p(x_{ip}) \]

By taking the other predictor as constant we can plot effect of each function for each predictor in the Wage example:

5.8.3 Coding example

The mgcv package supports the next families based on the outcome variable type.

Gaussian (Default): Regular regression
Binomial: Probabilities
Poisson/Quasipoisson: counts

library(tidymodels)
library(mgcv)

load("data/Soybean.RData")

gam_fitted <- gen_additive_mod(mode = "regression") |>
  set_engine(engine = "mgcv",
             family = gaussian) |>
  fit(weight ~ s(Time), data = soybean_train) 

augment(gam_fitted, 
        new_data = soybean_train) |>
  ggplot(aes(Time, weight))+
  geom_point()+
  geom_line(aes(y = .pred),
            color = "blue",
            linewidth = 1)+
  theme_light()

extract_fit_engine(gam_fitted) |>
  summary()


Family: gaussian 
Link function: identity 

Formula:
weight ~ s(Time)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.1645     0.1143   53.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(Time) 8.495   8.93 338.2  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.902   Deviance explained = 90.4%
GCV = 4.4395  Scale est. = 4.3117    n = 330

Method	Formula	Interpretation
\(C_p\)	\(C_p = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)\)	\(2d\hat{\sigma}^2\) represent the penalty of adding new predictors. This method is a good estimation of test MSE if the \(\hat{\sigma}^2\) is an unbiased estimate of the \(\sigma^2\)
Akaike information criterion	\(\text{AIC} = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)\)	The book omits irrelevant constants to show that \(C_p\) and \(\text{AIC}\) are proportional to each other
Bayesian information criterion	\(\text{BIC} = \frac{1}{n}(\text{RSS} +\log(n) d \hat{\sigma}^2)\)	After omitting irrelevant constants, we can see that \(\log n > 2\) for any \(n > 7\), as consequence, the metric trends to add a heavier penalty on models with many variables than the \(C_p\) and tent to select models with fewer predictors
Adjusted \(R^2\)	\(\text{Adjusted} \; R^2 = 1 - \frac{\text{RSS}/(n - d -1)}{\text{TSS}/(n-1)}\)	A large value of adjusted \(R^2\) indicates a model with a small test error, even though the metric doesn’t rely on rigorous theoretical justiﬁcations.