Feature Selection and Model Evaluation

Feature Selection and Model Evaluation: A Comprehensive Guide

In modern data science, the ability to choose the right features and evaluate models accurately is as important as building the model itself. This course walks you through the most common feature selection techniques, the role of regularization in handling multicollinearity, and the pitfalls of evaluating models on imbalanced data. By the end of the lesson you will be able to apply elastic net, Lasso, and Ridge correctly, understand when a wrapper method may overfit, and perform a sound principal component analysis (PCA).

Understanding Multicollinearity and Regularization

Multicollinearity occurs when two or more predictor variables are highly correlated. This redundancy inflates the variance of coefficient estimates, making the model unstable. Regularization adds a penalty to the loss function, shrinking coefficients and, in some cases, forcing them to zero.

Which regularization method can both reduce coefficient variance and set some coefficients to zero?

The answer is Lasso regression. Lasso (Least Absolute Shrinkage and Selection Operator) uses an ℓ1 penalty, which encourages sparsity by driving less important coefficients exactly to zero. This dual effect simultaneously combats multicollinearity and performs automatic feature selection.

Ridge regression applies an ℓ2 penalty – it shrinks coefficients but never eliminates them.
Elastic Net combines ℓ1 and ℓ2 penalties, offering a balance between Ridge’s stability and Lasso’s sparsity.
Standard linear regression provides no penalty, leaving multicollinearity unchecked.

High‑Dimensional Data and the Elastic Net

When the number of features far exceeds the number of observations (e.g., 500 features and only 100 samples), traditional methods struggle. The model can easily overfit, and many coefficients become indeterminate.

Which technique is most beneficial for a dataset with 500 features and 100 observations?

The optimal choice is Elastic Net regularization. Elastic Net retains the variable‑selection capability of Lasso while preserving the grouping effect of Ridge, which is crucial when predictors are highly correlated—a common situation in high‑dimensional spaces.

Key advantages:

Handles p > n scenarios gracefully.
Reduces variance without discarding all correlated variables.
Provides a single tuning parameter (α) to balance ℓ1 and ℓ2 penalties.

Wrapper Methods and the Risk of Overfitting

Wrapper methods evaluate subsets of features by training a model on each candidate set. Because they use the same training data repeatedly, they can inadvertently tailor the feature set to the idiosyncrasies of that data.

Why might a wrapper‑selected feature subset overfit the training data?

The primary reason is that many models are trained on the same data. Each evaluation uses the same observations, so the search process can exploit noise patterns that do not generalize. Even though cross‑validation is often employed, the repeated exposure to the same folds can still lead to optimistic performance estimates.

Mitigation strategies:

Use nested cross‑validation to separate feature‑selection and model‑assessment steps.
Limit the size of the search space with heuristic algorithms (e.g., genetic algorithms) that incorporate randomness.
Combine wrapper results with filter scores to add an external, data‑independent perspective.

Evaluating Models on Imbalanced Data

When the positive class constitutes only a small fraction of the dataset (e.g., 5%), some evaluation metrics become misleading.

Which metric becomes unreliable in this scenario?

Accuracy is the metric that loses reliability. A naïve model that always predicts the majority (negative) class would achieve 95% accuracy, yet it would completely fail to detect the minority class.

Better alternatives include:

Recall – measures the ability to capture the positive class.
Precision – evaluates how many predicted positives are correct.
F1‑score – the harmonic mean of precision and recall, balancing both concerns.
Area under the ROC curve (AUC‑ROC) or precision‑recall curve.

When reporting results, always accompany accuracy with at least one metric that reflects performance on the minority class.

Principal Component Analysis (PCA): From Standardization to Eigenvalues

PCA is a dimensionality‑reduction technique that transforms correlated variables into a set of orthogonal components. The typical workflow consists of three main steps.

Step 1 – Standardize the variables

Because PCA is sensitive to scale, each feature is centered (mean = 0) and scaled (standard deviation = 1). This ensures that variables with larger units do not dominate the analysis.

Step 2 – Compute the covariance matrix

After standardization, the next step is to compute the covariance matrix. The covariance matrix captures how each pair of variables varies together. For standardized data, the covariance matrix is equivalent to the correlation matrix.

Step 3 – Eigen‑decomposition

Decompose the covariance matrix into eigenvalues and eigenvectors. The eigenvectors define the direction of each principal component, while the eigenvalues indicate the amount of variance explained.

True statement about eigenvalues in PCA

The correct statement is: The sum of eigenvalues equals the total variance of the standardized data. Because each standardized variable has variance 1, the total variance equals the number of variables, and the eigenvalues partition that total variance among the components.

Common misconceptions:

"Larger eigenvalues correspond to components that explain less variance" – false; larger eigenvalues mean more explained variance.
"Eigenvalues determine the direction of the original variables" – false; eigenvectors determine direction.
"Eigenvalues are always equal to one after standardization" – false; only the trace (sum) equals the number of variables.

Forward Selection and Its Stopping Criterion

Forward selection is a greedy, step‑wise filter/embedded method that starts with an empty model and adds the most promising feature at each iteration.

When does the algorithm typically stop?

The process halts when adding any remaining feature does not produce a statistically significant improvement in the chosen performance metric (e.g., AIC, BIC, adjusted R², or cross‑validated error). This ensures that only features that contribute meaningful predictive power are retained.

Alternative stopping rules include:

Reaching a pre‑specified maximum number of features.
Exhausting a computational budget.
Observing a plateau in validation performance.

Why Ridge Regression Does Not Perform Feature Selection

Ridge regression adds an ℓ2 penalty to the loss function, shrinking coefficients toward zero but never exactly to zero. This property preserves all predictors in the model, which is why Ridge is considered an embedding technique for handling multicollinearity rather than a feature‑selection method.

The key reason is that the penalty term penalizes the square of coefficients but does not set them exactly to zero. Consequently, every variable remains in the final model, albeit with reduced influence.

When pure feature selection is required, Lasso or Elastic Net should be preferred.

Summary & Key Takeaways

Lasso provides both variance reduction and sparsity; it is the go‑to method when you need automatic feature elimination.
Elastic Net shines in high‑dimensional, highly correlated settings, blending the strengths of Lasso and Ridge.
Wrapper methods can overfit because they repeatedly train on the same data; use nested cross‑validation to obtain unbiased estimates.
Accuracy is misleading on imbalanced data; prioritize recall, precision, F1‑score, or AUC‑PR.
In PCA, after standardization the next step is to compute the covariance matrix, then perform eigen‑decomposition; the sum of eigenvalues equals total variance.
Forward selection stops when no remaining feature yields a statistically significant improvement.
Ridge regression shrinks coefficients but never eliminates them, so it does not perform feature selection.

Mastering these concepts equips you to build robust, interpretable models that scale from small tabular data to high‑dimensional genomic or text datasets.

Frequently Asked Questions (FAQ)

Can I use Lasso when my predictors are highly correlated?

Lasso may arbitrarily select one variable from a group of correlated predictors, potentially discarding useful information. In such cases, Elastic Net is preferable because its ℓ2 component keeps correlated variables together.

How many principal components should I retain?

Retain enough components to explain a desired proportion of variance (commonly 80‑90%). Examine the cumulative variance plot (scree plot) and look for an “elbow” where additional components contribute marginal gains.

Is cross‑validation enough to prevent overfitting in wrapper methods?

Cross‑validation reduces optimism but does not fully eliminate it when the same folds are reused for many feature‑subset evaluations. Nested cross‑validation, where an outer loop assesses performance and an inner loop conducts feature selection, provides a more reliable estimate.

When should I prefer filter methods over wrappers?

Filter methods are computationally cheap and independent of any learning algorithm, making them ideal for very large feature spaces or when you need a quick baseline. However, they may miss interactions that wrappers can capture.

Feature Selection and Model Evaluation

When multicollinearity is present, which regularization method can both reduce coefficient variance and set some coefficients to zero?

A dataset with 500 features and only 100 observations is likely to benefit most from which technique?

In a wrapper method, why might the selected feature subset overfit the training data?

Which metric becomes unreliable when the positive class represents only 5% of the data?

During PCA, after standardizing variables, what is the next step to obtain the principal components?

Which of the following statements about the eigenvalues in PCA is true?

In a forward selection process, what criterion typically stops the algorithm?

Why does Ridge regression not perform feature selection?

When comparing two models, why is Adjusted R² preferred over plain R² for variable selection?

In Linear Discriminant Analysis, which matrix captures the separation between class means?