Supervised Learning
Understanding machine learning with labeled training data
Imagine teaching a child to identify fruits by showing them examples! You show them an apple and say 'this is an apple', then a banana and say 'this is a banana'. After seeing many examples with labels, they can identify new fruits on their own. Supervised learning works the same way - we teach computers by showing them examples with correct answers (labels), and they learn to make predictions on new, unseen data!
What is Supervised Learning?
Supervised learning is a type of machine learning where the algorithm learns from labeled training data. Each training example consists of input features and a corresponding output label. The algorithm learns to map inputs to outputs by finding patterns in the training data, then uses this learned mapping to make predictions on new, unseen data.
# Supervised Learning Example: Email Spam Detectionimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import CountVectorizer# LABELED TRAINING DATA (Input + Correct Output)emails = [ "Win free money now! Click here!", # spam "Meeting scheduled for tomorrow at 3pm", # not spam "Congratulations! You won a prize!", # spam "Can we reschedule our call?", # not spam "Limited time offer! Act now!", # spam "Please review the attached document", # not spam]# Labels: 1 = spam, 0 = not spam (the "supervision")labels = [1, 0, 1, 0, 1, 0]# Convert text to numerical featuresvectorizer = CountVectorizer()X = vectorizer.fit_transform(emails)# Split into training and testing sets (80/20)X_train, X_test, y_train, y_test = train_test_split( X, labels, test_size=0.2, random_state=42)# TRAIN the model using labeled datamodel = MultinomialNB()model.fit(X_train, y_train)# TEST on new, unseen emailsnew_emails = [ "Free gift waiting for you!", "Project deadline is next week"]X_new = vectorizer.transform(new_emails)predictions = model.predict(X_new)for email, pred in zip(new_emails, predictions): print(f"Email: '{email}'") print(f"Prediction: {'SPAM' if pred == 1 else 'NOT SPAM'}\n")# Output:# Email: 'Free gift waiting for you!'# Prediction: SPAM## Email: 'Project deadline is next week'# Prediction: NOT SPAMTypes of Supervised Learning
Supervised learning problems are divided into two main categories:
1. Classification
Predicting discrete categories or classes. The output is a label from a finite set.
Examples:
- • Email: Spam or Not Spam
- • Image: Cat, Dog, or Bird
- • Medical: Disease Present or Absent
- • Sentiment: Positive, Negative, Neutral
2. Regression
Predicting continuous numerical values. The output is a real number.
Examples:
- • House Price Prediction ($)
- • Stock Price Forecasting
- • Temperature Prediction (°C)
- • Sales Revenue Estimation
# CLASSIFICATION Example: Iris Flower Speciesfrom sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_split# Load iris dataset (150 flowers with 4 features each)iris = load_iris()X = iris.data # Features: sepal length, sepal width, petal length, petal widthy = iris.target # Labels: 0=setosa, 1=versicolor, 2=virginica# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)# Train classifierclassifier = DecisionTreeClassifier()classifier.fit(X_train, y_train)# Predict (output is discrete class)print(classifier.predict([[5.1, 3.5, 1.4, 0.2]])) # [0] - setosa# -----------------------------------------------------------# REGRESSION Example: House Price Predictionfrom sklearn.linear_model import LinearRegression# Features: [square_feet, bedrooms, age]X = [[1500, 3, 10], [2000, 4, 5], [1200, 2, 15], [1800, 3, 8]]# Labels: house prices in thousandsy = [300, 400, 250, 350]# Train regressorregressor = LinearRegression()regressor.fit(X, y)# Predict (output is continuous value)new_house = [[1600, 3, 7]]predicted_price = regressor.predict(new_house)print(f"Predicted price: ${predicted_price[0]:.2f}k") # ~$325.50kSupervised Learning Workflow
The typical process involves several key steps:
- 1.
Collect & Label Data
Gather data and assign correct labels (expensive and time-consuming)
- 2.
Prepare & Split Data
Clean data, handle missing values, split into training (70-80%) and testing (20-30%)
- 3.
Choose Algorithm
Select appropriate algorithm based on problem type and data characteristics
- 4.
Train Model
Feed training data to algorithm, which learns patterns and adjusts parameters
- 5.
Evaluate Performance
Test on unseen data, measure accuracy/error, check for overfitting/underfitting
- 6.
Tune & Deploy
Optimize hyperparameters, retrain if needed, deploy to production
# Complete Supervised Learning Workflowimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, classification_report# 1. COLLECT & LABEL DATA# Example: Customer churn prediction (will they cancel subscription?)data = pd.DataFrame({ 'age': [25, 45, 35, 50, 23, 40, 33, 29], 'monthly_usage_minutes': [120, 50, 200, 30, 300, 100, 250, 180], 'support_calls': [5, 1, 0, 3, 0, 2, 1, 0], 'churned': [1, 1, 0, 1, 0, 0, 0, 0] # 1=left, 0=stayed (LABELS)})# 2. PREPARE & SPLIT DATAX = data[['age', 'monthly_usage_minutes', 'support_calls']]y = data['churned']X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42)# Feature scalingscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)# 3. CHOOSE ALGORITHM (Random Forest for classification)model = RandomForestClassifier(n_estimators=100, random_state=42)# 4. TRAIN MODELmodel.fit(X_train, y_train)# 5. EVALUATE PERFORMANCEy_pred = model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy * 100:.2f}%")print("\nDetailed Report:")print(classification_report(y_test, y_pred))# 6. USE MODEL (Make predictions on new customers)new_customer = [[35, 150, 2]] # 35 years old, 150 min usage, 2 support callsnew_customer_scaled = scaler.transform(new_customer)prediction = model.predict(new_customer_scaled)print(f"\nWill customer churn? {'YES' if prediction[0] == 1 else 'NO'}")Common Algorithms
Popular supervised learning algorithms include:
Linear Regression
Regression
Simple continuous predictions
Logistic Regression
Classification
Binary classification
Decision Trees
Both
Interpretable, handles non-linear
Random Forest
Both
Ensemble, reduces overfitting
K-Nearest Neighbors (KNN)
Both
Instance-based learning
Support Vector Machines
Both
High-dimensional data
Naive Bayes
Classification
Text classification, fast
Neural Networks
Both
Complex patterns, large data
Gradient Boosting
Both
XGBoost, LightGBM, CatBoost
Key Concepts
Labeled Data
Training data where each input has a known correct output. The labels are the 'answers' the model learns from.
Training vs Testing
Data is split into training set (to learn patterns) and testing set (to evaluate performance on unseen data).
Overfitting
When a model learns training data too well, including noise, and performs poorly on new data.
Underfitting
When a model is too simple to capture the underlying patterns in the data.
Interview Tips
- 💡Clearly distinguish between classification (discrete labels) and regression (continuous values)
- 💡Explain the importance of train/test split to prevent overfitting
- 💡Know common algorithms: Linear/Logistic Regression, Decision Trees, Random Forests, SVM, Neural Networks
- 💡Understand bias-variance tradeoff: simpler models have high bias, complex models have high variance
- 💡Be ready to discuss evaluation metrics: accuracy, precision, recall, F1-score for classification; MSE, RMSE, R² for regression
- 💡Explain why labeled data is both the strength (enables learning) and limitation (expensive to obtain) of supervised learning