Model Evaluation

Understanding metrics and techniques to assess machine learning model performance

Imagine you hired three people to sort fruits. How do you know who's doing the best job? You could measure: how many they sorted correctly (accuracy), how many apples they called apples (precision), how many real apples they found (recall), and whether they work well with different fruit types (generalization). Model evaluation works the same way - we use different metrics to understand how well our ML model performs and whether it will work on new, unseen data!

What is Model Evaluation?

Model evaluation is the process of measuring how well a machine learning model performs on data. It involves using various metrics and techniques to assess accuracy, reliability, and generalization capability. Proper evaluation ensures the model will perform well in production and helps identify issues like overfitting or class imbalance before deployment.

python

# Model Evaluation Overview
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score
)
import numpy as np
# Generate sample binary classification data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
                          weights=[0.7, 0.3], random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# Train a classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for ROC-AUC
# EVALUATE THE MODEL WITH MULTIPLE METRICS
print("="*60)
print("MODEL EVALUATION RESULTS")
print("="*60)
# 1. CONFUSION MATRIX
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
print(f"True Negatives (TN): {cm[0,0]}")
print(f"False Positives (FP): {cm[0,1]} - Type I Error")
print(f"False Negatives (FN): {cm[1,0]} - Type II Error")
print(f"True Positives (TP): {cm[1,1]}")
# 2. BASIC METRICS
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.3f} - Overall correctness")
print(f"Precision: {precision:.3f} - Of predicted positive, % correct")
print(f"Recall: {recall:.3f} - Of actual positive, % found")
print(f"F1-Score: {f1:.3f} - Harmonic mean of precision & recall")
# 3. ROC-AUC SCORE
auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC-AUC: {auc:.3f} - Overall classification performance")
# 4. DETAILED CLASSIFICATION REPORT
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))
# WHY ACCURACY ISN'T ALWAYS ENOUGH
print("\n" + "="*60)
print("WHY MULTIPLE METRICS MATTER:")
print("="*60)
print("If dataset has 95% class 0 and 5% class 1,")
print("a model predicting everything as class 0 gets 95% accuracy!")
print("But precision, recall, and F1 for class 1 would be 0.")
print("Always use multiple metrics and check confusion matrix!")

Classification Metrics

Metrics for evaluating models that predict categories or classes:

Accuracy

(TP + TN) / (TP + TN + FP + FN)

Percentage of correct predictions (all classes)

⚠️ Misleading with imbalanced datasets

Precision

TP / (TP + FP)

Of predicted positives, how many are correct?

✓ Use when false positives are costly

Recall (Sensitivity)

TP / (TP + FN)

Of actual positives, how many did we find?

✓ Use when false negatives are costly

F1-Score

2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall

✓ Good for imbalanced datasets

python

# Classification Metrics Examples
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_curve, roc_auc_score
)
import matplotlib.pyplot as plt
import numpy as np
# Example: Medical diagnosis (disease detection)
# Class 0 = Healthy, Class 1 = Disease
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0])
y_pred = np.array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1])
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTrue Negatives (Healthy correctly identified): {cm[0,0]}")
print(f"False Positives (Healthy wrongly diagnosed): {cm[0,1]}")
print(f"False Negatives (Disease missed): {cm[1,0]} ⚠️ DANGEROUS!")
print(f"True Positives (Disease correctly detected): {cm[1,1]}")
# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")
print(f"Precision: {precision:.2%} - Of diagnosed cases, {precision:.0%} actually have disease")
print(f"Recall: {recall:.2%} - We detected {recall:.0%} of all disease cases")
print(f"F1-Score: {f1:.2%}")
# ROC-AUC Example
# Need prediction probabilities
y_true_binary = np.array([0, 0, 1, 1, 0, 1, 1, 0, 1, 1])
y_scores = np.array([0.1, 0.3, 0.8, 0.6, 0.2, 0.9, 0.7, 0.15, 0.85, 0.95])
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true_binary, y_scores)
auc = roc_auc_score(y_true_binary, y_scores)
print(f"\nROC-AUC Score: {auc:.3f}")
print("AUC = 1.0: Perfect classifier")
print("AUC = 0.5: Random guessing")
print("AUC < 0.5: Worse than random (inverted predictions)")
# Example: When to prioritize Precision vs Recall
print("\n" + "="*60)
print("PRECISION vs RECALL TRADE-OFF")
print("="*60)
print("\nSPAM DETECTION (Prioritize Precision):")
print("  - False Positive: Important email goes to spam (BAD!)")
print("  - False Negative: Spam in inbox (annoying but ok)")
print("  → High precision to avoid false positives")
print("\nCANCER DETECTION (Prioritize Recall):")
print("  - False Positive: Healthy person gets more tests (ok)")
print("  - False Negative: Cancer patient undiagnosed (TERRIBLE!)")
print("  → High recall to catch all cases")
print("\nFRAUD DETECTION (Balance both):")
print("  - Need good precision (don't block legitimate transactions)")
print("  - Need good recall (catch fraudsters)")
print("  → Use F1-Score to balance both")

Regression Metrics

Metrics for evaluating models that predict continuous values:

python

# Regression Metrics
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error
)
import numpy as np
# Example: House price predictions
y_true = np.array([250000, 300000, 180000, 420000, 350000])  # Actual prices
y_pred = np.array([255000, 290000, 190000, 400000, 360000])  # Predicted prices
# 1. MEAN ABSOLUTE ERROR (MAE)
# Average absolute difference between predicted and actual
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: ${mae:,.0f}")
print("  → Average prediction error in dollars")
print("  → Easy to interpret, same units as target")
print("  → Robust to outliers (doesn't square errors)")
# 2. MEAN SQUARED ERROR (MSE)
# Average squared difference (penalizes large errors heavily)
mse = mean_squared_error(y_true, y_pred)
print(f"\nMSE: {mse:,.0f}")
print("  → Penalizes large errors more than small ones")
print("  → Not in original units (squared)")
# 3. ROOT MEAN SQUARED ERROR (RMSE)
# Square root of MSE (returns to original units)
rmse = np.sqrt(mse)
print(f"\nRMSE: ${rmse:,.0f}")
print("  → In original units (dollars)")
print("  → More sensitive to outliers than MAE")
print("  → Most commonly used regression metric")
# 4. R² (R-squared) - Coefficient of Determination
# Proportion of variance in target explained by model
r2 = r2_score(y_true, y_pred)
print(f"\nR² Score: {r2:.3f}")
print("  → R² = 1.0: Perfect predictions")
print("  → R² = 0.0: Model as good as predicting mean")
print("  → R² < 0.0: Model worse than predicting mean")
print(f"  → Model explains {r2*100:.1f}% of variance")
# 5. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)
# Average percentage error
mape = mean_absolute_percentage_error(y_true, y_pred)
print(f"\nMAPE: {mape:.2%}")
print("  → Scale-independent (good for comparing models)")
print("  → Easy to interpret as percentage")
print("  → Problem: undefined when true value is 0")
# Visual comparison of errors
print("\n" + "="*60)
print("DETAILED PREDICTION ANALYSIS")
print("="*60)
for i in range(len(y_true)):
    error = y_pred[i] - y_true[i]
    pct_error = (error / y_true[i]) * 100
    print(f"House {i+1}: True=${y_true[i]:,}, Pred=${y_pred[i]:,}, "
          f"Error=${error:,} ({pct_error:+.1f}%)")
# Which metric to use?
print("\n" + "="*60)
print("CHOOSING THE RIGHT METRIC")
print("="*60)
print("MAE: When all errors are equally important")
print("RMSE: When large errors are particularly undesirable")
print("R²: When you want to know % of variance explained")
print("MAPE: When you need scale-independent comparison")
# Example with outliers
print("\n" + "="*60)
print("EFFECT OF OUTLIERS")
print("="*60)
y_true_outlier = np.array([100, 110, 105, 108, 500])  # 500 is outlier
y_pred_outlier = np.array([95, 115, 100, 110, 120])
mae_out = mean_absolute_error(y_true_outlier, y_pred_outlier)
rmse_out = np.sqrt(mean_squared_error(y_true_outlier, y_pred_outlier))
print(f"MAE: {mae_out:.1f}")
print(f"RMSE: {rmse_out:.1f} (much larger due to outlier)")
print("→ RMSE is more affected by outliers (squared error)")

Cross-Validation

Techniques to assess model performance and reduce overfitting:

python

100

101

102

# Cross-Validation Techniques
from sklearn.model_selection import (
    cross_val_score, cross_validate,
    KFold, StratifiedKFold, LeaveOneOut
)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Generate sample data
X, y = make_classification(n_samples=500, n_features=20, n_classes=2,
                          weights=[0.7, 0.3], random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# METHOD 1: Simple K-Fold Cross-Validation (k=5)
print("="*60)
print("K-FOLD CROSS-VALIDATION (k=5)")
print("="*60)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
print("\nHow it works:")
print("  1. Split data into 5 equal parts (folds)")
print("  2. Train on 4 folds, test on 1 fold")
print("  3. Repeat 5 times, each fold used as test once")
print("  4. Average results for robust estimate")
# METHOD 2: Stratified K-Fold (maintains class distribution)
print("\n" + "="*60)
print("STRATIFIED K-FOLD (Maintains Class Balance)")
print("="*60)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_stratified = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f"Stratified Accuracy: {scores_stratified.mean():.3f}")
print("\nAdvantage: Each fold has same class distribution as full dataset")
print("Use when: Imbalanced datasets")
# METHOD 3: Cross-Validate with Multiple Metrics
print("\n" + "="*60)
print("MULTIPLE METRICS EVALUATION")
print("="*60)
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = cross_validate(model, X, y, cv=5, scoring=scoring)
for metric in scoring:
    scores = results[f'test_{metric}']
    print(f"{metric.upper()}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
# METHOD 4: Leave-One-Out Cross-Validation (LOOCV)
print("\n" + "="*60)
print("LEAVE-ONE-OUT CROSS-VALIDATION")
print("="*60)
# Note: Using smaller dataset for LOOCV (it's computationally expensive)
X_small, y_small = X[:50], y[:50]
loo = LeaveOneOut()
scores_loo = cross_val_score(model, X_small, y_small, cv=loo)
print(f"LOOCV Accuracy: {scores_loo.mean():.3f}")
print(f"Number of iterations: {len(scores_loo)} (one per sample)")
print("\nAdvantage: Uses almost all data for training")
print("Disadvantage: Computationally expensive for large datasets")
# METHOD 5: Custom K-Fold with Detailed Analysis
print("\n" + "="*60)
print("DETAILED FOLD-BY-FOLD ANALYSIS")
print("="*60)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_num = 1
for train_idx, test_idx in kf.split(X):
    X_train_fold, X_test_fold = X[train_idx], X[test_idx]
    y_train_fold, y_test_fold = y[train_idx], y[test_idx]
    model.fit(X_train_fold, y_train_fold)
    score = model.score(X_test_fold, y_test_fold)
    print(f"Fold {fold_num}: Train size={len(train_idx)}, "
          f"Test size={len(test_idx)}, Accuracy={score:.3f}")
    fold_num += 1
# Train/Validation/Test Split Example
print("\n" + "="*60)
print("TRAIN / VALIDATION / TEST SPLIT")
print("="*60)
print("60% Training: Learn model parameters")
print("20% Validation: Tune hyperparameters, select model")
print("20% Test: Final evaluation (touch only once!)")
print("\nWhy 3 splits?")
print("  - Training: Fit the model")
print("  - Validation: Prevent overfitting during hyperparameter tuning")
print("  - Test: Unbiased final performance estimate")
# Why Cross-Validation?
print("\n" + "="*60)
print("WHY USE CROSS-VALIDATION?")
print("="*60)
print("✓ More reliable performance estimate than single train/test split")
print("✓ Uses all data for both training and testing")
print("✓ Detects overfitting and high variance")
print("✓ Better for small datasets")
print("✗ Computationally expensive (trains k models)")
print("✗ Not suitable for time-series (use TimeSeriesSplit instead)")

Key Concepts

Confusion Matrix

A table showing true positives, true negatives, false positives, and false negatives. Essential for understanding classification model behavior beyond simple accuracy.

Precision vs Recall

Precision: Of predicted positives, how many are correct? Recall: Of actual positives, how many did we find? Trade-off between false positives and false negatives.

ROC-AUC

Receiver Operating Characteristic curve plots true positive rate vs false positive rate. AUC (Area Under Curve) measures overall classification performance across all thresholds.

Cross-Validation

Splitting data into multiple folds and training on different combinations to get reliable performance estimates and detect overfitting.

Interview Tips

💡Explain the confusion matrix components: TP (correctly predicted positive), TN (correctly predicted negative), FP (wrongly predicted positive - Type I error), FN (wrongly predicted negative - Type II error)
💡Know when to prioritize precision vs recall: use precision when false positives are costly (spam detection), recall when false negatives are costly (cancer detection)
💡Understand F1-score as the harmonic mean of precision and recall: 2 * (precision * recall) / (precision + recall). Useful for imbalanced datasets
💡Explain ROC-AUC: AUC=1.0 is perfect, AUC=0.5 is random guessing. ROC curve shows performance across all classification thresholds
💡For regression, know MSE (penalizes large errors heavily), RMSE (in original units), MAE (robust to outliers), and R² (proportion of variance explained)
💡Discuss cross-validation types: k-fold (divide into k parts), stratified k-fold (maintains class distribution), leave-one-out (k=n, expensive)
💡Explain train/validation/test split: training (60-70%) to learn, validation (15-20%) to tune hyperparameters, test (15-20%) for final evaluation
💡Be ready to discuss class imbalance solutions: oversampling (SMOTE), undersampling, class weights, or using precision-recall instead of ROC-AUC