Regularization Techniques

L1 and L2 regularization for preventing overfitting

Imagine training for a marathon! If you train TOO hard on the same route every day, you'll be amazing on that specific route but struggle on different terrain (overfitting). Regularization is like adding variety to your training - running on different surfaces, different distances - so you become a versatile runner who performs well anywhere. In machine learning, regularization adds a 'penalty' for overly complex models, forcing them to stay simple and generalize better to new data!

What is Regularization?

Regularization is a technique that prevents overfitting by adding a penalty term to the loss function. This penalty discourages the model from becoming too complex by penalizing large weights. Think of it as a 'simplicity constraint' - the model must balance between fitting the data well and keeping its parameters small and simple.

Modified Loss Function with Regularization

Loss = Original Loss + λ × Regularization Term
(Balance data fit + simplicity)

Original Loss: How well model fits training data (e.g., MSE)

Regularization Term: Penalty for model complexity (large weights)

λ (lambda): Hyperparameter controlling tradeoff strength

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Regularization Concept: Without vs With
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Generate data
X = np.linspace(0, 1, 20).reshape(-1, 1)
y = X**2 + np.random.normal(0, 0.1, X.shape)
# Create complex polynomial features (15th degree - prone to overfitting!)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
# WITHOUT REGULARIZATION
print("WITHOUT REGULARIZATION (Standard Linear Regression):")
model_no_reg = LinearRegression()
model_no_reg.fit(X_poly, y)
# Check magnitude of weights
weights_no_reg = model_no_reg.coef_
print(f" Max weight magnitude: {np.max(np.abs(weights_no_reg)):.2f}")
print(f" Sum of squared weights: {np.sum(weights_no_reg**2):.2f}")
print(" HUGE weights! Model memorizes noise\n")
# WITH L2 REGULARIZATION (Ridge)
print("WITH L2 REGULARIZATION (Ridge Regression):")
model_ridge = Ridge(alpha=1.0) # alpha = λ (regularization strength)
model_ridge.fit(X_poly, y)
weights_ridge = model_ridge.coef_
print(f" Max weight magnitude: {np.max(np.abs(weights_ridge)):.2f}")
print(f" Sum of squared weights: {np.sum(weights_ridge**2):.2f}")
print(" Smaller weights! Model is simpler and generalizes better\n")
# The key insight:
print("HOW IT WORKS:")
print("Original Loss = MSE(predictions, actual)")
print("Regularized Loss = MSE(predictions, actual) + λ × Σ(weights²)")
print("\nModel must balance:")
print(" 1. Fitting data well (low MSE)")
print(" 2. Keeping weights small (low penalty)")
print("\nResult: Prevents overfitting by constraining model complexity!")
# Output:
# WITHOUT REGULARIZATION:
# Max weight magnitude: 156.34
# Sum of squared weights: 18432.12
# HUGE weights!
#
# WITH L2 REGULARIZATION:
# Max weight magnitude: 2.15
# Sum of squared weights: 12.43
# Much smaller weights!

Types of Regularization

Two main techniques with different behaviors:

L2 Regularization (Ridge)

Loss = MSE + λ Σ w²

Adds sum of SQUARED weights as penalty. Shrinks all weights proportionally toward zero but never exactly to zero.

Best for:

  • All features are relevant
  • Want to keep all features
  • Smooth optimization (differentiable)
  • Multicollinearity (correlated features)

💡 Characteristics:

  • Shrinks large weights more than small ones
  • No feature selection (all weights non-zero)
  • Computationally efficient
  • Handles correlated features well

L1 Regularization (Lasso)

Loss = MSE + λ Σ |w|

Adds sum of ABSOLUTE weights as penalty. Drives some weights to EXACTLY ZERO, performing automatic feature selection.

Best for:

  • Feature selection / sparse models
  • Many irrelevant features
  • Want interpretability
  • High-dimensional data

💡 Characteristics:

  • Sets unimportant weights to exactly 0
  • Built-in feature selection
  • Creates sparse models (interpretable)
  • Not differentiable at zero
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# L1 (Lasso) vs L2 (Ridge) Comparison
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Create data with high-degree polynomial features
X = np.linspace(0, 1, 30).reshape(-1, 1)
y = 2*X + np.random.normal(0, 0.1, X.shape)
# Add many polynomial features (most are irrelevant for linear relationship!)
poly = PolynomialFeatures(degree=10)
X_poly = poly.fit_transform(X)
print(f"Number of features: {X_poly.shape[1]}\n")
# L2 REGULARIZATION (Ridge)
print("L2 REGULARIZATION (Ridge):")
ridge = Ridge(alpha=1.0)
ridge.fit(X_poly, y)
ridge_weights = ridge.coef_
print(f" Non-zero weights: {np.sum(ridge_weights != 0)}/{len(ridge_weights)}")
print(f" Max weight: {np.max(np.abs(ridge_weights)):.4f}")
print(f" All weights: {ridge_weights[:5]}...") # Show first 5
print(" Keeps ALL features (no feature selection)\n")
# L1 REGULARIZATION (Lasso)
print("L1 REGULARIZATION (Lasso):")
lasso = Lasso(alpha=0.01)
lasso.fit(X_poly, y)
lasso_weights = lasso.coef_
print(f" Non-zero weights: {np.sum(lasso_weights != 0)}/{len(lasso_weights)}")
print(f" Max weight: {np.max(np.abs(lasso_weights)):.4f}")
print(f" Weights: {lasso_weights[:5]}...") # Many are EXACTLY 0!
print(" Automatic feature selection! Eliminated irrelevant features\n")
# ELASTIC NET (Combination of L1 + L2)
from sklearn.linear_model import ElasticNet
print("ELASTIC NET (L1 + L2):")
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # 50% L1, 50% L2
elastic.fit(X_poly, y)
elastic_weights = elastic.coef_
print(f" Non-zero weights: {np.sum(elastic_weights != 0)}/{len(elastic_weights)}")
print(" Best of both: feature selection + stability\n")
# Visual comparison
print("WEIGHT COMPARISON:")
print(f"Ridge: shrinks all ({np.sum(np.abs(ridge_weights) < 0.01)} near zero)")
print(f"Lasso: zeros out ({np.sum(lasso_weights == 0)} exactly zero)")
print(f"Elastic: balanced ({np.sum(elastic_weights == 0)} exactly zero)")
# Output:
# L2 REGULARIZATION (Ridge):
# Non-zero weights: 11/11
# Keeps ALL features
#
# L1 REGULARIZATION (Lasso):
# Non-zero weights: 3/11
# Eliminated 8 irrelevant features!

L1 vs L2 Comparison

Understanding when to use which:

AspectL2 (Ridge)L1 (Lasso)
Penalty Termλ Σ w² (squared)λ Σ |w| (absolute)
Weight ShrinkageAll weights shrunk proportionallySome weights become exactly 0
Feature SelectionNo (keeps all features)Yes (automatic)
Solution SparsityDense (all non-zero)Sparse (many zeros)
ComputationalClosed-form solutionIterative optimization
Correlated FeaturesHandles well⚠️ Picks one arbitrarily
Use WhenAll features relevantMany irrelevant features

💡 Choosing Between L1 and L2

  • Use L2 (Ridge): When you believe most features are relevant and you want to keep them all with controlled magnitude
  • Use L1 (Lasso): When you have many features and want automatic feature selection or need an interpretable sparse model
  • Use Elastic Net: When you're unsure or want both feature selection and the ability to keep correlated features together
  • Default recommendation: Try L2 first (most common), then L1 if you need sparsity

Implementation in Practice

How to apply regularization in real models:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Complete Example: Regularization in Practice
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
import numpy as np
# Load data (many features, some correlated)
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features\n")
# STEP 1: Always standardize features for regularization!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("✅ Step 1: Features standardized\n")
# STEP 2: Try different regularization approaches
models = {
'No Regularization': LinearRegression(),
'Ridge (L2)': Ridge(alpha=1.0),
'Lasso (L1)': Lasso(alpha=0.1),
'Elastic Net': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
print("MODEL COMPARISON (5-Fold Cross-Validation):")
print("-" * 60)
for name, model in models.items():
# Cross-validation gives unbiased estimate
scores = cross_val_score(model, X_scaled, y, cv=5,
scoring='neg_mean_squared_error')
rmse = np.sqrt(-scores.mean())
print(f"{name:20s} | RMSE: {rmse:.4f}")
print("\n")
# STEP 3: Tune lambda (alpha) using GridSearchCV
print("STEP 3: Tuning lambda (alpha) for Ridge:")
print("-" * 60)
ridge = Ridge()
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(ridge, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(X_scaled, y)
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best RMSE: {np.sqrt(-grid_search.best_score_):.4f}\n")
# Visualize effect of alpha
for alpha, score in zip(param_grid['alpha'], -grid_search.cv_results_['mean_test_score']):
rmse = np.sqrt(score)
print(f" α = {alpha:7.3f} RMSE = {rmse:.4f}")
print("\n💡 Notice: RMSE is U-shaped!")
print(" - Too small α: underfitting (not enough regularization)")
print(" - Too large α: overfitting (too much regularization)")
print(" - Optimal α: best tradeoff\n")
# STEP 4: Compare feature usage
print("STEP 4: Feature Selection Comparison:")
print("-" * 60)
ridge_best = Ridge(alpha=grid_search.best_params_['alpha'])
ridge_best.fit(X_scaled, y)
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
print(f"Ridge: {np.sum(ridge_best.coef_ != 0)}/{len(ridge_best.coef_)} features used (all)")
print(f"Lasso: {np.sum(lasso.coef_ != 0)}/{len(lasso.coef_)} features used (sparse!)")
print(f"\nLasso eliminated {np.sum(lasso.coef_ == 0)} features automatically!")
# Output:
# MODEL COMPARISON:
# No Regularization | RMSE: 0.7234
# Ridge (L2) | RMSE: 0.7195 Better!
# Lasso (L1) | RMSE: 0.7258
# Elastic Net | RMSE: 0.7203
#
# Best alpha: 10
# Best RMSE: 0.7180
#
# Feature Selection:
# Ridge: 8/8 features used
# Lasso: 5/8 features used (eliminated 3!)

Key Concepts

L2 Regularization (Ridge)

Adds sum of squared weights to loss: λΣw². Shrinks all weights proportionally but keeps all features. Preferred when all features are relevant.

L1 Regularization (Lasso)

Adds sum of absolute weights to loss: λΣ|w|. Drives some weights to exactly zero, performing automatic feature selection. Preferred for sparse models.

Lambda (λ) Hyperparameter

Controls regularization strength. λ=0: no regularization (overfitting risk). Large λ: strong regularization (underfitting risk). Must tune via cross-validation.

Elastic Net

Combines L1 and L2: λ₁Σ|w| + λ₂Σw². Gets benefits of both - feature selection from L1 and stability from L2. Best of both worlds for many problems.

Interview Tips

  • 💡Regularization prevents overfitting by adding penalty term to loss function, discouraging large weights
  • 💡L2 (Ridge): penalty = λΣw², shrinks all weights, keeps all features. L1 (Lasso): penalty = λΣ|w|, sets some weights to 0, feature selection
  • 💡Lambda (λ) controls tradeoff: higher λ = simpler model (more regularization), lower λ = more complex (less regularization)
  • 💡Modified loss function: Loss = Original Loss + λ × Regularization Term. Model must balance both objectives
  • 💡L2 is differentiable everywhere (smooth optimization), L1 is not differentiable at zero (can cause convergence issues)
  • 💡Use L2 when all features are relevant, L1 when you want feature selection/sparsity, Elastic Net for combination
  • 💡Regularization is equivalent to Bayesian prior: L2 = Gaussian prior on weights, L1 = Laplacian prior
  • 💡Always standardize features before applying regularization (different scales would be penalized differently)
  • 💡Tune λ using cross-validation, not test set. Try logarithmic range: 0.001, 0.01, 0.1, 1, 10, 100