Model Evaluation
Understanding metrics and techniques to assess machine learning model performance
Imagine you hired three people to sort fruits. How do you know who's doing the best job? You could measure: how many they sorted correctly (accuracy), how many apples they called apples (precision), how many real apples they found (recall), and whether they work well with different fruit types (generalization). Model evaluation works the same way - we use different metrics to understand how well our ML model performs and whether it will work on new, unseen data!
What is Model Evaluation?
Model evaluation is the process of measuring how well a machine learning model performs on data. It involves using various metrics and techniques to assess accuracy, reliability, and generalization capability. Proper evaluation ensures the model will perform well in production and helps identify issues like overfitting or class imbalance before deployment.
# Model Evaluation Overviewfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score)import numpy as np# Generate sample binary classification dataX, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.7, 0.3], random_state=42)# Split into train and test setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)# Train a classifiermodel = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)# Make predictionsy_pred = model.predict(X_test)y_pred_proba = model.predict_proba(X_test)[:, 1] # Probabilities for ROC-AUC# EVALUATE THE MODEL WITH MULTIPLE METRICSprint("="*60)print("MODEL EVALUATION RESULTS")print("="*60)# 1. CONFUSION MATRIXcm = confusion_matrix(y_test, y_pred)print("\nConfusion Matrix:")print(cm)print(f"True Negatives (TN): {cm[0,0]}")print(f"False Positives (FP): {cm[0,1]} - Type I Error")print(f"False Negatives (FN): {cm[1,0]} - Type II Error")print(f"True Positives (TP): {cm[1,1]}")# 2. BASIC METRICSaccuracy = accuracy_score(y_test, y_pred)precision = precision_score(y_test, y_pred)recall = recall_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)print(f"\nAccuracy: {accuracy:.3f} - Overall correctness")print(f"Precision: {precision:.3f} - Of predicted positive, % correct")print(f"Recall: {recall:.3f} - Of actual positive, % found")print(f"F1-Score: {f1:.3f} - Harmonic mean of precision & recall")# 3. ROC-AUC SCOREauc = roc_auc_score(y_test, y_pred_proba)print(f"\nROC-AUC: {auc:.3f} - Overall classification performance")# 4. DETAILED CLASSIFICATION REPORTprint("\nDetailed Classification Report:")print(classification_report(y_test, y_pred))# WHY ACCURACY ISN'T ALWAYS ENOUGHprint("\n" + "="*60)print("WHY MULTIPLE METRICS MATTER:")print("="*60)print("If dataset has 95% class 0 and 5% class 1,")print("a model predicting everything as class 0 gets 95% accuracy!")print("But precision, recall, and F1 for class 1 would be 0.")print("Always use multiple metrics and check confusion matrix!")Classification Metrics
Metrics for evaluating models that predict categories or classes:
Accuracy
(TP + TN) / (TP + TN + FP + FN)
Percentage of correct predictions (all classes)
⚠️ Misleading with imbalanced datasets
Precision
TP / (TP + FP)
Of predicted positives, how many are correct?
✓ Use when false positives are costly
Recall (Sensitivity)
TP / (TP + FN)
Of actual positives, how many did we find?
✓ Use when false negatives are costly
F1-Score
2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall
✓ Good for imbalanced datasets
# Classification Metrics Examplesfrom sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score)import matplotlib.pyplot as pltimport numpy as np# Example: Medical diagnosis (disease detection)# Class 0 = Healthy, Class 1 = Diseasey_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0])y_pred = np.array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1])# Confusion Matrixcm = confusion_matrix(y_true, y_pred)print("Confusion Matrix:")print(cm)print(f"\nTrue Negatives (Healthy correctly identified): {cm[0,0]}")print(f"False Positives (Healthy wrongly diagnosed): {cm[0,1]}")print(f"False Negatives (Disease missed): {cm[1,0]} ⚠️ DANGEROUS!")print(f"True Positives (Disease correctly detected): {cm[1,1]}")# Calculate metricsaccuracy = accuracy_score(y_true, y_pred)precision = precision_score(y_true, y_pred)recall = recall_score(y_true, y_pred)f1 = f1_score(y_true, y_pred)print(f"\nAccuracy: {accuracy:.2%}")print(f"Precision: {precision:.2%} - Of diagnosed cases, {precision:.0%} actually have disease")print(f"Recall: {recall:.2%} - We detected {recall:.0%} of all disease cases")print(f"F1-Score: {f1:.2%}")# ROC-AUC Example# Need prediction probabilitiesy_true_binary = np.array([0, 0, 1, 1, 0, 1, 1, 0, 1, 1])y_scores = np.array([0.1, 0.3, 0.8, 0.6, 0.2, 0.9, 0.7, 0.15, 0.85, 0.95])# Calculate ROC curvefpr, tpr, thresholds = roc_curve(y_true_binary, y_scores)auc = roc_auc_score(y_true_binary, y_scores)print(f"\nROC-AUC Score: {auc:.3f}")print("AUC = 1.0: Perfect classifier")print("AUC = 0.5: Random guessing")print("AUC < 0.5: Worse than random (inverted predictions)")# Example: When to prioritize Precision vs Recallprint("\n" + "="*60)print("PRECISION vs RECALL TRADE-OFF")print("="*60)print("\nSPAM DETECTION (Prioritize Precision):")print(" - False Positive: Important email goes to spam (BAD!)")print(" - False Negative: Spam in inbox (annoying but ok)")print(" → High precision to avoid false positives")print("\nCANCER DETECTION (Prioritize Recall):")print(" - False Positive: Healthy person gets more tests (ok)")print(" - False Negative: Cancer patient undiagnosed (TERRIBLE!)")print(" → High recall to catch all cases")print("\nFRAUD DETECTION (Balance both):")print(" - Need good precision (don't block legitimate transactions)")print(" - Need good recall (catch fraudsters)")print(" → Use F1-Score to balance both")Regression Metrics
Metrics for evaluating models that predict continuous values:
# Regression Metricsfrom sklearn.metrics import ( mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error)import numpy as np# Example: House price predictionsy_true = np.array([250000, 300000, 180000, 420000, 350000]) # Actual pricesy_pred = np.array([255000, 290000, 190000, 400000, 360000]) # Predicted prices# 1. MEAN ABSOLUTE ERROR (MAE)# Average absolute difference between predicted and actualmae = mean_absolute_error(y_true, y_pred)print(f"MAE: ${mae:,.0f}")print(" → Average prediction error in dollars")print(" → Easy to interpret, same units as target")print(" → Robust to outliers (doesn't square errors)")# 2. MEAN SQUARED ERROR (MSE)# Average squared difference (penalizes large errors heavily)mse = mean_squared_error(y_true, y_pred)print(f"\nMSE: {mse:,.0f}")print(" → Penalizes large errors more than small ones")print(" → Not in original units (squared)")# 3. ROOT MEAN SQUARED ERROR (RMSE)# Square root of MSE (returns to original units)rmse = np.sqrt(mse)print(f"\nRMSE: ${rmse:,.0f}")print(" → In original units (dollars)")print(" → More sensitive to outliers than MAE")print(" → Most commonly used regression metric")# 4. R² (R-squared) - Coefficient of Determination# Proportion of variance in target explained by modelr2 = r2_score(y_true, y_pred)print(f"\nR² Score: {r2:.3f}")print(" → R² = 1.0: Perfect predictions")print(" → R² = 0.0: Model as good as predicting mean")print(" → R² < 0.0: Model worse than predicting mean")print(f" → Model explains {r2*100:.1f}% of variance")# 5. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)# Average percentage errormape = mean_absolute_percentage_error(y_true, y_pred)print(f"\nMAPE: {mape:.2%}")print(" → Scale-independent (good for comparing models)")print(" → Easy to interpret as percentage")print(" → Problem: undefined when true value is 0")# Visual comparison of errorsprint("\n" + "="*60)print("DETAILED PREDICTION ANALYSIS")print("="*60)for i in range(len(y_true)): error = y_pred[i] - y_true[i] pct_error = (error / y_true[i]) * 100 print(f"House {i+1}: True=${y_true[i]:,}, Pred=${y_pred[i]:,}, " f"Error=${error:,} ({pct_error:+.1f}%)")# Which metric to use?print("\n" + "="*60)print("CHOOSING THE RIGHT METRIC")print("="*60)print("MAE: When all errors are equally important")print("RMSE: When large errors are particularly undesirable")print("R²: When you want to know % of variance explained")print("MAPE: When you need scale-independent comparison")# Example with outliersprint("\n" + "="*60)print("EFFECT OF OUTLIERS")print("="*60)y_true_outlier = np.array([100, 110, 105, 108, 500]) # 500 is outliery_pred_outlier = np.array([95, 115, 100, 110, 120])mae_out = mean_absolute_error(y_true_outlier, y_pred_outlier)rmse_out = np.sqrt(mean_squared_error(y_true_outlier, y_pred_outlier))print(f"MAE: {mae_out:.1f}")print(f"RMSE: {rmse_out:.1f} (much larger due to outlier)")print("→ RMSE is more affected by outliers (squared error)")Cross-Validation
Techniques to assess model performance and reduce overfitting:
# Cross-Validation Techniquesfrom sklearn.model_selection import ( cross_val_score, cross_validate, KFold, StratifiedKFold, LeaveOneOut)from sklearn.datasets import make_classificationfrom sklearn.ensemble import RandomForestClassifierimport numpy as np# Generate sample dataX, y = make_classification(n_samples=500, n_features=20, n_classes=2, weights=[0.7, 0.3], random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)# METHOD 1: Simple K-Fold Cross-Validation (k=5)print("="*60)print("K-FOLD CROSS-VALIDATION (k=5)")print("="*60)scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')print(f"Accuracy scores for each fold: {scores}")print(f"Mean Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")print("\nHow it works:")print(" 1. Split data into 5 equal parts (folds)")print(" 2. Train on 4 folds, test on 1 fold")print(" 3. Repeat 5 times, each fold used as test once")print(" 4. Average results for robust estimate")# METHOD 2: Stratified K-Fold (maintains class distribution)print("\n" + "="*60)print("STRATIFIED K-FOLD (Maintains Class Balance)")print("="*60)skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores_stratified = cross_val_score(model, X, y, cv=skf, scoring='accuracy')print(f"Stratified Accuracy: {scores_stratified.mean():.3f}")print("\nAdvantage: Each fold has same class distribution as full dataset")print("Use when: Imbalanced datasets")# METHOD 3: Cross-Validate with Multiple Metricsprint("\n" + "="*60)print("MULTIPLE METRICS EVALUATION")print("="*60)scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']results = cross_validate(model, X, y, cv=5, scoring=scoring)for metric in scoring: scores = results[f'test_{metric}'] print(f"{metric.upper()}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")# METHOD 4: Leave-One-Out Cross-Validation (LOOCV)print("\n" + "="*60)print("LEAVE-ONE-OUT CROSS-VALIDATION")print("="*60)# Note: Using smaller dataset for LOOCV (it's computationally expensive)X_small, y_small = X[:50], y[:50]loo = LeaveOneOut()scores_loo = cross_val_score(model, X_small, y_small, cv=loo)print(f"LOOCV Accuracy: {scores_loo.mean():.3f}")print(f"Number of iterations: {len(scores_loo)} (one per sample)")print("\nAdvantage: Uses almost all data for training")print("Disadvantage: Computationally expensive for large datasets")# METHOD 5: Custom K-Fold with Detailed Analysisprint("\n" + "="*60)print("DETAILED FOLD-BY-FOLD ANALYSIS")print("="*60)kf = KFold(n_splits=5, shuffle=True, random_state=42)fold_num = 1for train_idx, test_idx in kf.split(X): X_train_fold, X_test_fold = X[train_idx], X[test_idx] y_train_fold, y_test_fold = y[train_idx], y[test_idx] model.fit(X_train_fold, y_train_fold) score = model.score(X_test_fold, y_test_fold) print(f"Fold {fold_num}: Train size={len(train_idx)}, " f"Test size={len(test_idx)}, Accuracy={score:.3f}") fold_num += 1# Train/Validation/Test Split Exampleprint("\n" + "="*60)print("TRAIN / VALIDATION / TEST SPLIT")print("="*60)print("60% Training: Learn model parameters")print("20% Validation: Tune hyperparameters, select model")print("20% Test: Final evaluation (touch only once!)")print("\nWhy 3 splits?")print(" - Training: Fit the model")print(" - Validation: Prevent overfitting during hyperparameter tuning")print(" - Test: Unbiased final performance estimate")# Why Cross-Validation?print("\n" + "="*60)print("WHY USE CROSS-VALIDATION?")print("="*60)print("✓ More reliable performance estimate than single train/test split")print("✓ Uses all data for both training and testing")print("✓ Detects overfitting and high variance")print("✓ Better for small datasets")print("✗ Computationally expensive (trains k models)")print("✗ Not suitable for time-series (use TimeSeriesSplit instead)")Key Concepts
Confusion Matrix
A table showing true positives, true negatives, false positives, and false negatives. Essential for understanding classification model behavior beyond simple accuracy.
Precision vs Recall
Precision: Of predicted positives, how many are correct? Recall: Of actual positives, how many did we find? Trade-off between false positives and false negatives.
ROC-AUC
Receiver Operating Characteristic curve plots true positive rate vs false positive rate. AUC (Area Under Curve) measures overall classification performance across all thresholds.
Cross-Validation
Splitting data into multiple folds and training on different combinations to get reliable performance estimates and detect overfitting.
Interview Tips
- 💡Explain the confusion matrix components: TP (correctly predicted positive), TN (correctly predicted negative), FP (wrongly predicted positive - Type I error), FN (wrongly predicted negative - Type II error)
- 💡Know when to prioritize precision vs recall: use precision when false positives are costly (spam detection), recall when false negatives are costly (cancer detection)
- 💡Understand F1-score as the harmonic mean of precision and recall: 2 * (precision * recall) / (precision + recall). Useful for imbalanced datasets
- 💡Explain ROC-AUC: AUC=1.0 is perfect, AUC=0.5 is random guessing. ROC curve shows performance across all classification thresholds
- 💡For regression, know MSE (penalizes large errors heavily), RMSE (in original units), MAE (robust to outliers), and R² (proportion of variance explained)
- 💡Discuss cross-validation types: k-fold (divide into k parts), stratified k-fold (maintains class distribution), leave-one-out (k=n, expensive)
- 💡Explain train/validation/test split: training (60-70%) to learn, validation (15-20%) to tune hyperparameters, test (15-20%) for final evaluation
- 💡Be ready to discuss class imbalance solutions: oversampling (SMOTE), undersampling, class weights, or using precision-recall instead of ROC-AUC