Supervised Learning

Understanding machine learning with labeled training data

Imagine teaching a child to identify fruits by showing them examples! You show them an apple and say 'this is an apple', then a banana and say 'this is a banana'. After seeing many examples with labels, they can identify new fruits on their own. Supervised learning works the same way - we teach computers by showing them examples with correct answers (labels), and they learn to make predictions on new, unseen data!

What is Supervised Learning?

Supervised learning is a type of machine learning where the algorithm learns from labeled training data. Each training example consists of input features and a corresponding output label. The algorithm learns to map inputs to outputs by finding patterns in the training data, then uses this learned mapping to make predictions on new, unseen data.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Supervised Learning Example: Email Spam Detection
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# LABELED TRAINING DATA (Input + Correct Output)
emails = [
"Win free money now! Click here!", # spam
"Meeting scheduled for tomorrow at 3pm", # not spam
"Congratulations! You won a prize!", # spam
"Can we reschedule our call?", # not spam
"Limited time offer! Act now!", # spam
"Please review the attached document", # not spam
]
# Labels: 1 = spam, 0 = not spam (the "supervision")
labels = [1, 0, 1, 0, 1, 0]
# Convert text to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, labels, test_size=0.2, random_state=42
)
# TRAIN the model using labeled data
model = MultinomialNB()
model.fit(X_train, y_train)
# TEST on new, unseen emails
new_emails = [
"Free gift waiting for you!",
"Project deadline is next week"
]
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)
for email, pred in zip(new_emails, predictions):
print(f"Email: '{email}'")
print(f"Prediction: {'SPAM' if pred == 1 else 'NOT SPAM'}\n")
# Output:
# Email: 'Free gift waiting for you!'
# Prediction: SPAM
#
# Email: 'Project deadline is next week'
# Prediction: NOT SPAM

Types of Supervised Learning

Supervised learning problems are divided into two main categories:

1. Classification

Predicting discrete categories or classes. The output is a label from a finite set.

Examples:

  • Email: Spam or Not Spam
  • Image: Cat, Dog, or Bird
  • Medical: Disease Present or Absent
  • Sentiment: Positive, Negative, Neutral

2. Regression

Predicting continuous numerical values. The output is a real number.

Examples:

  • House Price Prediction ($)
  • Stock Price Forecasting
  • Temperature Prediction (°C)
  • Sales Revenue Estimation
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# CLASSIFICATION Example: Iris Flower Species
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load iris dataset (150 flowers with 4 features each)
iris = load_iris()
X = iris.data # Features: sepal length, sepal width, petal length, petal width
y = iris.target # Labels: 0=setosa, 1=versicolor, 2=virginica
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train classifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
# Predict (output is discrete class)
print(classifier.predict([[5.1, 3.5, 1.4, 0.2]])) # [0] - setosa
# -----------------------------------------------------------
# REGRESSION Example: House Price Prediction
from sklearn.linear_model import LinearRegression
# Features: [square_feet, bedrooms, age]
X = [[1500, 3, 10], [2000, 4, 5], [1200, 2, 15], [1800, 3, 8]]
# Labels: house prices in thousands
y = [300, 400, 250, 350]
# Train regressor
regressor = LinearRegression()
regressor.fit(X, y)
# Predict (output is continuous value)
new_house = [[1600, 3, 7]]
predicted_price = regressor.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:.2f}k") # ~$325.50k

Supervised Learning Workflow

The typical process involves several key steps:

  1. 1.

    Collect & Label Data

    Gather data and assign correct labels (expensive and time-consuming)

  2. 2.

    Prepare & Split Data

    Clean data, handle missing values, split into training (70-80%) and testing (20-30%)

  3. 3.

    Choose Algorithm

    Select appropriate algorithm based on problem type and data characteristics

  4. 4.

    Train Model

    Feed training data to algorithm, which learns patterns and adjusts parameters

  5. 5.

    Evaluate Performance

    Test on unseen data, measure accuracy/error, check for overfitting/underfitting

  6. 6.

    Tune & Deploy

    Optimize hyperparameters, retrain if needed, deploy to production

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Complete Supervised Learning Workflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. COLLECT & LABEL DATA
# Example: Customer churn prediction (will they cancel subscription?)
data = pd.DataFrame({
'age': [25, 45, 35, 50, 23, 40, 33, 29],
'monthly_usage_minutes': [120, 50, 200, 30, 300, 100, 250, 180],
'support_calls': [5, 1, 0, 3, 0, 2, 1, 0],
'churned': [1, 1, 0, 1, 0, 0, 0, 0] # 1=left, 0=stayed (LABELS)
})
# 2. PREPARE & SPLIT DATA
X = data[['age', 'monthly_usage_minutes', 'support_calls']]
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 3. CHOOSE ALGORITHM (Random Forest for classification)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 4. TRAIN MODEL
model.fit(X_train, y_train)
# 5. EVALUATE PERFORMANCE
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nDetailed Report:")
print(classification_report(y_test, y_pred))
# 6. USE MODEL (Make predictions on new customers)
new_customer = [[35, 150, 2]] # 35 years old, 150 min usage, 2 support calls
new_customer_scaled = scaler.transform(new_customer)
prediction = model.predict(new_customer_scaled)
print(f"\nWill customer churn? {'YES' if prediction[0] == 1 else 'NO'}")

Common Algorithms

Popular supervised learning algorithms include:

Linear Regression

Regression

Simple continuous predictions

Logistic Regression

Classification

Binary classification

Decision Trees

Both

Interpretable, handles non-linear

Random Forest

Both

Ensemble, reduces overfitting

K-Nearest Neighbors (KNN)

Both

Instance-based learning

Support Vector Machines

Both

High-dimensional data

Naive Bayes

Classification

Text classification, fast

Neural Networks

Both

Complex patterns, large data

Gradient Boosting

Both

XGBoost, LightGBM, CatBoost

Key Concepts

Labeled Data

Training data where each input has a known correct output. The labels are the 'answers' the model learns from.

Training vs Testing

Data is split into training set (to learn patterns) and testing set (to evaluate performance on unseen data).

Overfitting

When a model learns training data too well, including noise, and performs poorly on new data.

Underfitting

When a model is too simple to capture the underlying patterns in the data.

Interview Tips

  • 💡Clearly distinguish between classification (discrete labels) and regression (continuous values)
  • 💡Explain the importance of train/test split to prevent overfitting
  • 💡Know common algorithms: Linear/Logistic Regression, Decision Trees, Random Forests, SVM, Neural Networks
  • 💡Understand bias-variance tradeoff: simpler models have high bias, complex models have high variance
  • 💡Be ready to discuss evaluation metrics: accuracy, precision, recall, F1-score for classification; MSE, RMSE, R² for regression
  • 💡Explain why labeled data is both the strength (enables learning) and limitation (expensive to obtain) of supervised learning