Overfitting vs Underfitting

Understanding the bias-variance tradeoff in machine learning

Imagine studying for an exam! If you memorize every single practice problem word-for-word but don't understand the concepts, you'll fail when questions are worded differently (OVERFITTING). If you only skim the material and don't study enough, you'll also fail (UNDERFITTING). The sweet spot is understanding the concepts well enough to answer any variation of the questions (GOOD FIT). That's exactly the challenge in machine learning - we want models that learn the patterns, not memorize the data!

What are Overfitting and Underfitting?

Overfitting and underfitting are the two main problems that prevent machine learning models from generalizing well to new data. They represent opposite extremes: overfitting occurs when a model is too complex and learns noise instead of patterns, while underfitting happens when a model is too simple to capture the underlying patterns. The goal is to find the right balance.

Underfitting

Model is too simple

❌ Poor training accuracy
❌ Poor test accuracy
📉 High bias
💡 Doesn't capture patterns

Good Fit ✓

Balanced complexity

✅ Good training accuracy
✅ Good test accuracy
📊 Low bias, low variance
💡 Generalizes well

Overfitting

Model is too complex

✅ Excellent training accuracy
❌ Poor test accuracy
📈 High variance
💡 Memorizes noise

python

# Demonstrating Overfitting vs Underfitting with Polynomial Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate sample data: y = x^2 + noise
np.random.seed(42)
X_train = np.linspace(0, 1, 20).reshape(-1, 1)
y_train = X_train**2 + np.random.normal(0, 0.1, X_train.shape)
X_test = np.linspace(0, 1, 100).reshape(-1, 1)
y_test = X_test**2  # True function without noise
# 1. UNDERFITTING: Degree 1 (linear) - Too simple!
poly1 = PolynomialFeatures(degree=1)
X_train_poly1 = poly1.fit_transform(X_train)
X_test_poly1 = poly1.transform(X_test)
model1 = LinearRegression()
model1.fit(X_train_poly1, y_train)
train_mse1 = mean_squared_error(y_train, model1.predict(X_train_poly1))
test_mse1 = mean_squared_error(y_test, model1.predict(X_test_poly1))
print("UNDERFITTING (Degree 1):")
print(f"  Training MSE: {train_mse1:.4f}")
print(f"  Test MSE: {test_mse1:.4f}")
print("  ❌ Both errors are HIGH (can't capture quadratic pattern)\n")
# 2. GOOD FIT: Degree 2 (quadratic) - Just right!
poly2 = PolynomialFeatures(degree=2)
X_train_poly2 = poly2.fit_transform(X_train)
X_test_poly2 = poly2.transform(X_test)
model2 = LinearRegression()
model2.fit(X_train_poly2, y_train)
train_mse2 = mean_squared_error(y_train, model2.predict(X_train_poly2))
test_mse2 = mean_squared_error(y_test, model2.predict(X_test_poly2))
print("GOOD FIT (Degree 2):")
print(f"  Training MSE: {train_mse2:.4f}")
print(f"  Test MSE: {test_mse2:.4f}")
print("  ✅ Both errors are LOW and similar (generalizes well)\n")
# 3. OVERFITTING: Degree 15 - Too complex!
poly15 = PolynomialFeatures(degree=15)
X_train_poly15 = poly15.fit_transform(X_train)
X_test_poly15 = poly15.transform(X_test)
model15 = LinearRegression()
model15.fit(X_train_poly15, y_train)
train_mse15 = mean_squared_error(y_train, model15.predict(X_train_poly15))
test_mse15 = mean_squared_error(y_test, model15.predict(X_test_poly15))
print("OVERFITTING (Degree 15):")
print(f"  Training MSE: {train_mse15:.4f}")
print(f"  Test MSE: {test_mse15:.4f}")
print("  ❌ Training error is VERY LOW but Test error is HIGH")
print("  💡 Large gap = overfitting (memorized noise)")
# Output:
# UNDERFITTING (Degree 1):
#   Training MSE: 0.0523
#   Test MSE: 0.0498
#   ❌ Both errors are HIGH
#
# GOOD FIT (Degree 2):
#   Training MSE: 0.0091
#   Test MSE: 0.0001
#   ✅ Both errors are LOW and similar
#
# OVERFITTING (Degree 15):
#   Training MSE: 0.0000
#   Test MSE: 45.2310
#   ❌ Large gap = memorized training data

The Bias-Variance Tradeoff

This fundamental concept explains why overfitting and underfitting occur:

Bias (Underfitting)

Error from wrong assumptions in the learning algorithm. Model is too simple and makes systematic errors.

Characteristics:

• High training error
• High test error
• Model too simple
• Missing important features

Variance (Overfitting)

Error from sensitivity to small fluctuations in training data. Model is too complex and learns noise.

Characteristics:

• Very low training error
• High test error
• Large gap between train/test
• Model too complex

Total Error = Bias² + Variance + Irreducible Error

Goal: Minimize total error by balancing bias and variance

Simple Model

↑ High Bias

↓ Low Variance

✓ Optimal Model

↓ Balanced Bias

↓ Balanced Variance

Complex Model

↓ Low Bias

↑ High Variance

Solutions and Prevention

Strategies to avoid overfitting and underfitting:

Fixing Overfitting

1.
Get More Training Data
More data helps model learn true patterns
2.
Regularization (L1/L2)
Penalize large weights to simplify model
3.
Reduce Model Complexity
Fewer layers, neurons, or polynomial degree
4.
Dropout (Neural Networks)
Randomly drop neurons during training
5.
Early Stopping
Stop training when validation error increases
6.
Cross-Validation
Use k-fold CV to get reliable estimates

Fixing Underfitting

1.
Increase Model Complexity
More layers, neurons, or polynomial degree
2.
Add More Features
Feature engineering, polynomial features
3.
Reduce Regularization
Lower lambda/alpha parameter
4.
Train Longer
More epochs to learn patterns
5.
Use Better Features
Domain knowledge for relevant features
6.
Try Different Algorithm
Switch to more powerful model

python

# Applying Regularization to Prevent Overfitting
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import numpy as np
# Generate data
X = np.linspace(0, 1, 50).reshape(-1, 1)
y = X**2 + np.random.normal(0, 0.1, X.shape)
# Create high-degree polynomial features (prone to overfitting)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.3)
# 1. NO REGULARIZATION - Overfits!
from sklearn.linear_model import LinearRegression
model_no_reg = LinearRegression()
model_no_reg.fit(X_train, y_train)
print("NO REGULARIZATION:")
print(f"  Train R²: {model_no_reg.score(X_train, y_train):.4f}")
print(f"  Test R²: {model_no_reg.score(X_test, y_test):.4f}")
print("  ❌ Overfitting (memorized training data)\n")
# 2. L2 REGULARIZATION (Ridge) - Prevents overfitting!
model_ridge = Ridge(alpha=1.0)  # alpha controls regularization strength
model_ridge.fit(X_train, y_train)
print("L2 REGULARIZATION (Ridge):")
print(f"  Train R²: {model_ridge.score(X_train, y_train):.4f}")
print(f"  Test R²: {model_ridge.score(X_test, y_test):.4f}")
print("  ✅ Better generalization!\n")
# 3. L1 REGULARIZATION (Lasso) - Prevents overfitting + feature selection!
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(X_train, y_train)
print("L1 REGULARIZATION (Lasso):")
print(f"  Train R²: {model_lasso.score(X_train, y_train):.4f}")
print(f"  Test R²: {model_lasso.score(X_test, y_test):.4f}")
print(f"  Non-zero coefficients: {np.sum(model_lasso.coef_ != 0)}/15")
print("  ✅ Better generalization + feature selection!\n")
# Output shows regularization improves test performance!
# NO REGULARIZATION:
#   Train R²: 0.9998
#   Test R²: -145.2310  (DISASTER!)
#
# L2 REGULARIZATION (Ridge):
#   Train R²: 0.8523
#   Test R²: 0.8491  (Good!)
#
# L1 REGULARIZATION (Lasso):
#   Train R²: 0.8612
#   Test R²: 0.8598  (Good + sparse!)

How to Detect

Key indicators that help identify these problems:

Learning Curves

Plot training and validation error vs training set size

Overfitting:

Large gap between train and validation curves

Underfitting:

Both curves converge at high error

Validation Curves

Plot error vs model complexity (e.g., polynomial degree, tree depth)

Left (simple):

High train & test error = underfitting

Right (complex):

Low train error, high test error = overfitting

python

# Plotting Learning Curves to Diagnose Problems
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
def plot_learning_curve(estimator, X, y, title):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )
    train_scores_mean = -np.mean(train_scores, axis=1)
    test_scores_mean = -np.mean(test_scores, axis=1)
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_scores_mean, label='Training error')
    plt.plot(train_sizes, test_scores_mean, label='Validation error')
    plt.xlabel('Training Set Size')
    plt.ylabel('Mean Squared Error')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()
# Example: Diagnose overfitting
from sklearn.tree import DecisionTreeRegressor
# High complexity model (prone to overfitting)
overfit_model = DecisionTreeRegressor(max_depth=20)
plot_learning_curve(overfit_model, X, y,
                   'Learning Curve: Overfitting (large gap)')
# Output: Large gap between train and validation curves
# Low complexity model (prone to underfitting)
underfit_model = DecisionTreeRegressor(max_depth=1)
plot_learning_curve(underfit_model, X, y,
                   'Learning Curve: Underfitting (both high)')
# Output: Both curves converge at high error
# Good model
good_model = DecisionTreeRegressor(max_depth=5)
plot_learning_curve(good_model, X, y,
                   'Learning Curve: Good Fit (small gap, low error)')
# Output: Small gap, both curves at low error

Key Concepts

Overfitting (High Variance)

Model is too complex, memorizes training data including noise. Performs well on training set but poorly on test set. Like a student who memorized answers without understanding.

Underfitting (High Bias)

Model is too simple to capture patterns. Performs poorly on both training and test sets. Like a student who didn't study enough to understand basic concepts.

Bias-Variance Tradeoff

Bias is error from wrong assumptions (underfitting). Variance is error from sensitivity to training data fluctuations (overfitting). Optimal model balances both.

Generalization

The ability of a model to perform well on unseen data. The ultimate goal of machine learning - not just memorizing, but truly learning.

Interview Tips

💡Overfitting = too complex (memorizes), Underfitting = too simple (doesn't learn). Use training vs test performance gap to detect
💡Bias-Variance Tradeoff: High bias → underfitting, High variance → overfitting. Can't minimize both simultaneously
💡Overfitting solutions: Regularization (L1/L2), dropout, more data, early stopping, cross-validation, reduce model complexity
💡Underfitting solutions: Increase model complexity, add features, reduce regularization, train longer
💡Validation curves show model performance vs complexity. U-shaped test error: left=underfit, bottom=good, right=overfit
💡Learning curves plot performance vs training size. Overfitting: large gap between train and test. Underfitting: both converge at poor performance
💡Always use train/validation/test split. Train on training set, tune on validation set, final evaluation on test set
💡Real example: Polynomial regression with degree 1 (underfit), degree 3 (good), degree 15 (overfit) for quadratic data