3 Understanding model performance

3.1 Reducible and irreducible error

The goal when we are analyzing data is to find a function that based on some Predictors and some random noise could explain the Response variable.

\[ Y = f(X) + \epsilon \]

$\epsilon$ represent the random error and correspond to the irreducible error as it cannot be predicted using the Predictors in regression models. It would have a mean of 0 unless are missing some relevant Predictors.

In classification models, the irreducible error is represented by the Bayes Error Rate.

\[ 1 - E\left( \underset{j}{max}Pr(Y = j|X) \right) \]

An error is reducible if we can improve the accuracy of $\hat{f}$ by using a most appropriate statistical learning technique to estimate $f$.

The challenge to achieve that goal it’s that we don’t at the beginning how much of the error correspond to each type.

\[ \begin{split} E(Y-\hat{Y})^2 & = E[f(X) + \epsilon - \hat{f}(X)]^2 \\ & = \underbrace{[f(X)- \hat{f}(X)]^2}_\text{Reducible} + \underbrace{Var(\epsilon)}_\text{Irredicible} \end{split} \]

The reducible error can be also spitted in two parts:

Variance refers to the amount by which $\hat{f}$ would change if we estimate it using a different training data set. If a method has high variance then small changes in the training data can result in large changes of $\hat{f}$.
Squared bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model as for example a linear model.Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.

\[ E(y_{0} - \hat{f}(x_{0}))^2 = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))]^2 + Var(\epsilon) \]

Our challenge lies in ﬁnding a method for which both the variance and the squared bias are low.

3.2 Types of models

Parametric methods
1. Make an assumption about the functional form. For example, assuming linearity.
2. Estimate a small number parameters based on training data.
3. Are easy to interpret.
4. Tend to outperform non-parametric approaches when there is a small number of observations per predictor.
Non-parametric methods
1. Don’t make an assumption about the functional form, to accurately ﬁt a wider range of possible shapes for $f$.
2. Need a large number of observations in order to obtain an accurate estimate for $f$.
3. The data analyst must select a level of smoothness (degrees of freedom).

3.3 Evaluating Model Performance

Evaluating the performance of a model is a crucial step in the machine learning process. This involves splitting the available data into two distinct parts:

Training Data: This is the data that we use to “train” or “fit” the model. The model learns from this data, which includes both the input features and the corresponding target values.
Test Data: This is the data that we use to evaluate how well our model generalizes to new, unseen data. It’s important to note that this data is not used during the training phase.

Comparing the performance metrics on both datasets allows us to assess whether our model is too flexible (overfitting) or too simple (underfitting) for the given task.

To mitigate these issues, we can use techniques such as regularization and hyperparameter tuning. These techniques help improve the model’s ability to generalize, leading to more robust and reliable predictions on unseen data.

3.3.1 Regression models

Ground Truth vs. Predicts Plot: In this plot we represent the predictions on the x-axis and the outcome on the y-axis in a scatted plot with a line with slope = 1. If the model predicted perfectly, all the points would be along this line. If you see regions where the points are entirely above or below the line it demonstrates that the errors are correlated with the value of the outcome. Some possible problems could be:
- Don’t having all the important variables in your model
- You need an algorithm that can find more complex relationships in the data

Residual Plots: It plots the residuals ($y_i - \hat{y}_i$) against the predictions ($\hat{y}_i$).In a model with no systematic errors, the errors will be evenly distributed between positive and negative, but When there are systematic errors, there will be clusters of all positive or all negative residuals.

Gain curve plot: It is useful when sorting the instances is more important than predicting the exact outcome value like happens when predicting probabilities, where:
- The x-axis shows the fraction of total houses as sorted by the model.
- The y-axis shows the fraction of total accumulated outcome magnitude.
- The diagonal line represents the gain curve if the houses are sorted randomly. T
- The wizard curve represents what a perfect model would trace out.
- The blue curve is what our model traces out.
- A relative gini coefficient close to one shows that the model correctly sorts high unemployment situations from lower ones.

Code

# Creating a data.frame
set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)

# Getting the predicted top 25% most valuable points
# as sorted by the model
gainx = 0.25  

# Creating a function to calculate the label for the annotated point
labelfun = function (gx, gy) {
  pctx = gx*100
  pcty = gy*100

  paste("The predicted top ", pctx, "% most valuable points by the model\n",
        "are ", pcty, "% of total actual value", sep='')
}

WVPlots::GainCurvePlotWithNotation(
  frm,
  xvar = "model",
  truthVar = "value",
  title = "Example Gain Curve with annotation",
  gainx = gainx,
  labelfun = labelfun
)

Test Root Mean Squared Error (RMSE): It takes the square root of the MSE metric so that your error is in the same units as your response variable.Objective: minimize

To know if the RMSE of our model is high or low can compare it with the standard deviation of the outcome (sd(y)), then if the RMSE being smaller than the standard deviation that demonstrates that the model tends to estimate better than simply taking the average.

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2} \]

Test Root Mean Squared Relative Error (RMSRE): It takes in consideration the magnitude of each prediction so low different a important for low values but doesn’t matter to much for high values. This error goes lower after logging the outcome variable.Objective: minimize.

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{y_i} \right)^2} \]

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2} \]

Mean squared error (MSE): The squared component results in larger errors having larger penalties. Objective: minimize

\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

$R^2$: This is a popular metric that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). But if have too many limitations and You should not place too much emphasis on this metric. Objective: maximize

R^2 = 1 - {_} $$

Root mean squared logarithmic error (RMSLE): When your response variable has a wide range of values, large response values with large errors can dominate the MSE/RMSE metric. RMSLE minimizes this impact so that small response values with large errors can have just as meaningful of an impact as large response values with large errors. Objective: minimize

\[ \text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=i}^n (\log{(y_i + 1)} - \log{(\hat{y}_i + 1)})^2} \]

Mean absolute error (MAE): Similar to MSE but rather than squaring, it just takes the mean absolute difference between the actual and predicted values. Objective: minimize

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n (|y_i - \hat{y}_i|) \]

Deviance: If the response variable distribution is Gaussian, then it will be approximately equal to MSE. When not, it usually gives a more useful estimate of error. It is often used with classification models and compares a saturated model (i.e. fully featured model) to an unsaturated model (i.e. intercept only or average) to provide the degree to which a model explains the variation in a set of data. Objective: minimize

3.3.2 Classification models

Error (misclassification) rate: It represents the overall error. Objective: minimize

\[ I(y_{0} \neq \hat{y}_{0}) = \begin{cases} 1 & \text{If } y_{0} \neq \hat{y}_{0} \\ 0 & \text{If } y_{0} = \hat{y}_{0} \end{cases} \]

\[ \text{Ave}(I(y_{0} \neq \hat{y}_{0})) \]

Mean per class error: This is the average error rate for each class. If your classes are balanced this will be identical to misclassification. Objective: minimize

\[ \begin{split} \text{Ave}(& \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_1, \\ & \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_2, \\ & \dots, \\ & \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_\text{n-class}) \end{split} \]

Mean squared error (MSE): Computes the distance from 1 to the probability assign by the model to the correct category ($\hat{p}$). The squared component results in large differences in probabilities for the true class having larger penalties. Objective: minimize

\[ MSE = \frac{1}{n} \sum_{i=1}^n (1 - \hat{p}_i)^2 \]

pseudo-$R^2$: Measure of the “deviance explained” by any GLM. Objective: maximize.

\[ pseudoR^2 = 1 - \frac{deviance}{null.deviance} \]

Cross-entropy (aka Log Loss or Deviance): Similar to MSE but it incorporates a log of the predicted probability multiplied by the true class, it disproportionately punishes predictions where we predict a small probability for the true class (having high confidence in the wrong answer is really bad). Objective: minimize.
Gini index: Mainly used with tree-based methods and commonly referred to as a measure of purity where a small value indicates that a node contains predominantly observations from a single class. Objective: minimize
Confusion Matrix: Compares actual categorical levels (or events) to the predicted categorical levels

Some metrics related with the confusion matrix that need to be maximized are:

- Accuracy: Overall, how often is the classifier correct? Opposite of misclassification above. $\frac{\text{TP} + \text{TN}}{N + P}$.
- Precision: For the number of predictions that we made, how many were correct? $\frac{\text{TP}}{\text{TP} + \text{FP}}$.
- Sensitivity (aka recall): For the events that occurred, how many did we predict? $\frac{\text{TP}}{\text{TP} + \text{FN}}$.
- Specificity: How accurately does the classifier classify actual negative events? $\frac{\text{TN}}{\text{TN} + \text{FP}}$.
- Area under the curve (AUC): A good binary classifier will have high precision and sensitivity.To capture this balance, we often use a ROC (receiver operating characteristics) curve that plots the false positive rate along the x-axis and the true positive rate along the y-axis. A line that is diagonal from the lower left corner to the upper right corner represents a random guess. The higher the line is in the upper left-hand corner, the better. AUC computes the area under this curve.

You can more metrics in the next table.