Regularization Techniques
L1 and L2 regularization for preventing overfitting
Imagine training for a marathon! If you train TOO hard on the same route every day, you'll be amazing on that specific route but struggle on different terrain (overfitting). Regularization is like adding variety to your training - running on different surfaces, different distances - so you become a versatile runner who performs well anywhere. In machine learning, regularization adds a 'penalty' for overly complex models, forcing them to stay simple and generalize better to new data!
What is Regularization?
Regularization is a technique that prevents overfitting by adding a penalty term to the loss function. This penalty discourages the model from becoming too complex by penalizing large weights. Think of it as a 'simplicity constraint' - the model must balance between fitting the data well and keeping its parameters small and simple.
Modified Loss Function with Regularization
• Original Loss: How well model fits training data (e.g., MSE)
• Regularization Term: Penalty for model complexity (large weights)
• λ (lambda): Hyperparameter controlling tradeoff strength
# Regularization Concept: Without vs Withimport numpy as npfrom sklearn.linear_model import LinearRegression, Ridgefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.metrics import mean_squared_error# Generate dataX = np.linspace(0, 1, 20).reshape(-1, 1)y = X**2 + np.random.normal(0, 0.1, X.shape)# Create complex polynomial features (15th degree - prone to overfitting!)poly = PolynomialFeatures(degree=15)X_poly = poly.fit_transform(X)# WITHOUT REGULARIZATIONprint("WITHOUT REGULARIZATION (Standard Linear Regression):")model_no_reg = LinearRegression()model_no_reg.fit(X_poly, y)# Check magnitude of weightsweights_no_reg = model_no_reg.coef_print(f" Max weight magnitude: {np.max(np.abs(weights_no_reg)):.2f}")print(f" Sum of squared weights: {np.sum(weights_no_reg**2):.2f}")print(" ❌ HUGE weights! Model memorizes noise\n")# WITH L2 REGULARIZATION (Ridge)print("WITH L2 REGULARIZATION (Ridge Regression):")model_ridge = Ridge(alpha=1.0) # alpha = λ (regularization strength)model_ridge.fit(X_poly, y)weights_ridge = model_ridge.coef_print(f" Max weight magnitude: {np.max(np.abs(weights_ridge)):.2f}")print(f" Sum of squared weights: {np.sum(weights_ridge**2):.2f}")print(" ✅ Smaller weights! Model is simpler and generalizes better\n")# The key insight:print("HOW IT WORKS:")print("Original Loss = MSE(predictions, actual)")print("Regularized Loss = MSE(predictions, actual) + λ × Σ(weights²)")print("\nModel must balance:")print(" 1. Fitting data well (low MSE)")print(" 2. Keeping weights small (low penalty)")print("\nResult: Prevents overfitting by constraining model complexity!")# Output:# WITHOUT REGULARIZATION:# Max weight magnitude: 156.34# Sum of squared weights: 18432.12# ❌ HUGE weights!## WITH L2 REGULARIZATION:# Max weight magnitude: 2.15# Sum of squared weights: 12.43# ✅ Much smaller weights!Types of Regularization
Two main techniques with different behaviors:
L2 Regularization (Ridge)
Adds sum of SQUARED weights as penalty. Shrinks all weights proportionally toward zero but never exactly to zero.
✅ Best for:
- • All features are relevant
- • Want to keep all features
- • Smooth optimization (differentiable)
- • Multicollinearity (correlated features)
💡 Characteristics:
- • Shrinks large weights more than small ones
- • No feature selection (all weights non-zero)
- • Computationally efficient
- • Handles correlated features well
L1 Regularization (Lasso)
Adds sum of ABSOLUTE weights as penalty. Drives some weights to EXACTLY ZERO, performing automatic feature selection.
✅ Best for:
- • Feature selection / sparse models
- • Many irrelevant features
- • Want interpretability
- • High-dimensional data
💡 Characteristics:
- • Sets unimportant weights to exactly 0
- • Built-in feature selection
- • Creates sparse models (interpretable)
- • Not differentiable at zero
# L1 (Lasso) vs L2 (Ridge) Comparisonfrom sklearn.linear_model import Ridge, Lassofrom sklearn.preprocessing import PolynomialFeaturesimport numpy as np# Create data with high-degree polynomial featuresX = np.linspace(0, 1, 30).reshape(-1, 1)y = 2*X + np.random.normal(0, 0.1, X.shape)# Add many polynomial features (most are irrelevant for linear relationship!)poly = PolynomialFeatures(degree=10)X_poly = poly.fit_transform(X)print(f"Number of features: {X_poly.shape[1]}\n")# L2 REGULARIZATION (Ridge)print("L2 REGULARIZATION (Ridge):")ridge = Ridge(alpha=1.0)ridge.fit(X_poly, y)ridge_weights = ridge.coef_print(f" Non-zero weights: {np.sum(ridge_weights != 0)}/{len(ridge_weights)}")print(f" Max weight: {np.max(np.abs(ridge_weights)):.4f}")print(f" All weights: {ridge_weights[:5]}...") # Show first 5print(" ❌ Keeps ALL features (no feature selection)\n")# L1 REGULARIZATION (Lasso)print("L1 REGULARIZATION (Lasso):")lasso = Lasso(alpha=0.01)lasso.fit(X_poly, y)lasso_weights = lasso.coef_print(f" Non-zero weights: {np.sum(lasso_weights != 0)}/{len(lasso_weights)}")print(f" Max weight: {np.max(np.abs(lasso_weights)):.4f}")print(f" Weights: {lasso_weights[:5]}...") # Many are EXACTLY 0!print(" ✅ Automatic feature selection! Eliminated irrelevant features\n")# ELASTIC NET (Combination of L1 + L2)from sklearn.linear_model import ElasticNetprint("ELASTIC NET (L1 + L2):")elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # 50% L1, 50% L2elastic.fit(X_poly, y)elastic_weights = elastic.coef_print(f" Non-zero weights: {np.sum(elastic_weights != 0)}/{len(elastic_weights)}")print(" ✅ Best of both: feature selection + stability\n")# Visual comparisonprint("WEIGHT COMPARISON:")print(f"Ridge: shrinks all ({np.sum(np.abs(ridge_weights) < 0.01)} near zero)")print(f"Lasso: zeros out ({np.sum(lasso_weights == 0)} exactly zero)")print(f"Elastic: balanced ({np.sum(elastic_weights == 0)} exactly zero)")# Output:# L2 REGULARIZATION (Ridge):# Non-zero weights: 11/11# ❌ Keeps ALL features## L1 REGULARIZATION (Lasso):# Non-zero weights: 3/11# ✅ Eliminated 8 irrelevant features!L1 vs L2 Comparison
Understanding when to use which:
| Aspect | L2 (Ridge) | L1 (Lasso) |
|---|---|---|
| Penalty Term | λ Σ w² (squared) | λ Σ |w| (absolute) |
| Weight Shrinkage | All weights shrunk proportionally | Some weights become exactly 0 |
| Feature Selection | ❌ No (keeps all features) | ✅ Yes (automatic) |
| Solution Sparsity | Dense (all non-zero) | Sparse (many zeros) |
| Computational | Closed-form solution | Iterative optimization |
| Correlated Features | ✅ Handles well | ⚠️ Picks one arbitrarily |
| Use When | All features relevant | Many irrelevant features |
💡 Choosing Between L1 and L2
- • Use L2 (Ridge): When you believe most features are relevant and you want to keep them all with controlled magnitude
- • Use L1 (Lasso): When you have many features and want automatic feature selection or need an interpretable sparse model
- • Use Elastic Net: When you're unsure or want both feature selection and the ability to keep correlated features together
- • Default recommendation: Try L2 first (most common), then L1 if you need sparsity
Implementation in Practice
How to apply regularization in real models:
# Complete Example: Regularization in Practicefrom sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegressionfrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import load_bostonimport numpy as np# Load data (many features, some correlated)from sklearn.datasets import fetch_california_housingX, y = fetch_california_housing(return_X_y=True)print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features\n")# STEP 1: Always standardize features for regularization!scaler = StandardScaler()X_scaled = scaler.fit_transform(X)print("✅ Step 1: Features standardized\n")# STEP 2: Try different regularization approachesmodels = { 'No Regularization': LinearRegression(), 'Ridge (L2)': Ridge(alpha=1.0), 'Lasso (L1)': Lasso(alpha=0.1), 'Elastic Net': ElasticNet(alpha=0.1, l1_ratio=0.5)}print("MODEL COMPARISON (5-Fold Cross-Validation):")print("-" * 60)for name, model in models.items(): # Cross-validation gives unbiased estimate scores = cross_val_score(model, X_scaled, y, cv=5, scoring='neg_mean_squared_error') rmse = np.sqrt(-scores.mean()) print(f"{name:20s} | RMSE: {rmse:.4f}")print("\n")# STEP 3: Tune lambda (alpha) using GridSearchCVprint("STEP 3: Tuning lambda (alpha) for Ridge:")print("-" * 60)ridge = Ridge()param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')grid_search.fit(X_scaled, y)print(f"Best alpha: {grid_search.best_params_['alpha']}")print(f"Best RMSE: {np.sqrt(-grid_search.best_score_):.4f}\n")# Visualize effect of alphafor alpha, score in zip(param_grid['alpha'], -grid_search.cv_results_['mean_test_score']): rmse = np.sqrt(score) print(f" α = {alpha:7.3f} → RMSE = {rmse:.4f}")print("\n💡 Notice: RMSE is U-shaped!")print(" - Too small α: underfitting (not enough regularization)")print(" - Too large α: overfitting (too much regularization)")print(" - Optimal α: best tradeoff\n")# STEP 4: Compare feature usageprint("STEP 4: Feature Selection Comparison:")print("-" * 60)ridge_best = Ridge(alpha=grid_search.best_params_['alpha'])ridge_best.fit(X_scaled, y)lasso = Lasso(alpha=0.1)lasso.fit(X_scaled, y)print(f"Ridge: {np.sum(ridge_best.coef_ != 0)}/{len(ridge_best.coef_)} features used (all)")print(f"Lasso: {np.sum(lasso.coef_ != 0)}/{len(lasso.coef_)} features used (sparse!)")print(f"\nLasso eliminated {np.sum(lasso.coef_ == 0)} features automatically!")# Output:# MODEL COMPARISON:# No Regularization | RMSE: 0.7234# Ridge (L2) | RMSE: 0.7195 ✅ Better!# Lasso (L1) | RMSE: 0.7258# Elastic Net | RMSE: 0.7203## Best alpha: 10# Best RMSE: 0.7180## Feature Selection:# Ridge: 8/8 features used# Lasso: 5/8 features used (eliminated 3!)Key Concepts
L2 Regularization (Ridge)
Adds sum of squared weights to loss: λΣw². Shrinks all weights proportionally but keeps all features. Preferred when all features are relevant.
L1 Regularization (Lasso)
Adds sum of absolute weights to loss: λΣ|w|. Drives some weights to exactly zero, performing automatic feature selection. Preferred for sparse models.
Lambda (λ) Hyperparameter
Controls regularization strength. λ=0: no regularization (overfitting risk). Large λ: strong regularization (underfitting risk). Must tune via cross-validation.
Elastic Net
Combines L1 and L2: λ₁Σ|w| + λ₂Σw². Gets benefits of both - feature selection from L1 and stability from L2. Best of both worlds for many problems.
Interview Tips
- 💡Regularization prevents overfitting by adding penalty term to loss function, discouraging large weights
- 💡L2 (Ridge): penalty = λΣw², shrinks all weights, keeps all features. L1 (Lasso): penalty = λΣ|w|, sets some weights to 0, feature selection
- 💡Lambda (λ) controls tradeoff: higher λ = simpler model (more regularization), lower λ = more complex (less regularization)
- 💡Modified loss function: Loss = Original Loss + λ × Regularization Term. Model must balance both objectives
- 💡L2 is differentiable everywhere (smooth optimization), L1 is not differentiable at zero (can cause convergence issues)
- 💡Use L2 when all features are relevant, L1 when you want feature selection/sparsity, Elastic Net for combination
- 💡Regularization is equivalent to Bayesian prior: L2 = Gaussian prior on weights, L1 = Laplacian prior
- 💡Always standardize features before applying regularization (different scales would be penalized differently)
- 💡Tune λ using cross-validation, not test set. Try logarithmic range: 0.001, 0.01, 0.1, 1, 10, 100