Feature Engineering

Transforming raw data into meaningful features for machine learning

Imagine you're teaching someone to identify good restaurants. Instead of just giving them addresses, you'd tell them about important features like 'distance from home', 'average rating', 'price range', and 'cuisine type'. Feature engineering is exactly this - taking raw data and transforming it into useful, meaningful pieces of information that help machine learning models learn better and make accurate predictions!

What is Feature Engineering?

Feature engineering is the process of using domain knowledge to extract, create, and transform features (input variables) from raw data to improve machine learning model performance. It's one of the most important steps in the ML pipeline, often making the difference between a mediocre and excellent model. Good features can make simple algorithms work remarkably well, while poor features can make even sophisticated algorithms struggle.

python

# Feature Engineering Example: Restaurant Recommendation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from datetime import datetime
# RAW DATA (not very useful as-is)
raw_data = pd.DataFrame({
    'restaurant_id': [101, 102, 103, 104],
    'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm St'],
    'opened_date': ['2020-03-15', '2018-07-22', '2019-11-03', '2021-01-10'],
    'avg_rating': [4.5, 3.8, 4.9, 4.2],
    'review_count': [250, 89, 456, 123],
    'price_range': ['$$', '$', '$$$', '$$'],
    'cuisine': ['Italian', 'Mexican', 'Italian', 'Chinese']
})
# FEATURE ENGINEERING: Transform raw data into useful features
# 1. FEATURE EXTRACTION - Create new features from existing data
# Extract time-based features
raw_data['opened_date'] = pd.to_datetime(raw_data['opened_date'])
raw_data['years_open'] = (datetime.now() - raw_data['opened_date']).dt.days / 365.25
raw_data['is_new'] = (raw_data['years_open'] < 2).astype(int)
# Create interaction features
raw_data['rating_review_score'] = raw_data['avg_rating'] * np.log1p(raw_data['review_count'])
# 2. FEATURE ENCODING - Convert categorical to numerical
# One-hot encoding for nominal categories (no order)
cuisine_encoded = pd.get_dummies(raw_data['cuisine'], prefix='cuisine')
# Ordinal encoding for ordered categories
price_mapping = {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4}
raw_data['price_numeric'] = raw_data['price_range'].map(price_mapping)
# 3. FEATURE SCALING - Normalize numerical features
scaler = StandardScaler()
raw_data['rating_scaled'] = scaler.fit_transform(raw_data[['avg_rating']])
raw_data['review_count_scaled'] = scaler.fit_transform(raw_data[['review_count']])
# Final feature set ready for ML
features = pd.concat([
    raw_data[['years_open', 'is_new', 'rating_review_score', 'price_numeric']],
    cuisine_encoded
], axis=1)
print("Original Raw Data:")
print(raw_data[['address', 'avg_rating', 'price_range']].head())
print("\nEngineered Features:")
print(features.head())
# These engineered features are much more useful for ML models!

Feature Extraction

Feature extraction involves deriving new features from existing data:

Temporal Features

Extract meaningful time components from timestamps

• Hour, day, month, year
• Day of week, is_weekend
• Season, quarter
• Time since event

Text Features

Extract features from text data

• Length, word count
• TF-IDF, word embeddings
• Sentiment scores
• Entity extraction

Interaction Features

Combine existing features to capture relationships

• Feature multiplication
• Ratio features
• Polynomial features

Aggregation Features

Statistical summaries of grouped data

• Mean, median, std by group
• Count, sum per category
• Rolling statistics

python

# Feature Extraction Examples
import pandas as pd
import numpy as np
from datetime import datetime
# 1. TEMPORAL FEATURES
df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=100, freq='H')
})
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
df['month'] = df['timestamp'].dt.month
df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)
# 2. TEXT FEATURES
df_text = pd.DataFrame({
    'review': ['Great product!', 'Terrible service, very disappointed', 'OK']
})
df_text['length'] = df_text['review'].str.len()
df_text['word_count'] = df_text['review'].str.split().str.len()
df_text['exclamation_count'] = df_text['review'].str.count('!')
df_text['avg_word_length'] = df_text['review'].apply(
    lambda x: np.mean([len(w) for w in x.split()])
)
# 3. INTERACTION FEATURES
df_house = pd.DataFrame({
    'square_feet': [1500, 2000, 1200],
    'bedrooms': [3, 4, 2],
    'age': [10, 5, 20]
})
df_house['sqft_per_bedroom'] = df_house['square_feet'] / df_house['bedrooms']
df_house['sqft_age_interaction'] = df_house['square_feet'] * df_house['age']
df_house['is_new_and_large'] = (
    (df_house['age'] < 10) & (df_house['square_feet'] > 1800)
).astype(int)
# 4. AGGREGATION FEATURES
df_sales = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3],
    'purchase_amount': [100, 150, 200, 50, 75, 300]
})
# Group-level statistics
customer_stats = df_sales.groupby('customer_id')['purchase_amount'].agg([
    'mean', 'sum', 'count', 'std', 'min', 'max'
]).reset_index()
customer_stats.columns = ['customer_id', 'avg_purchase', 'total_spent',
                          'purchase_count', 'purchase_std', 'min_purchase', 'max_purchase']
print(customer_stats)

Feature Selection

Choosing the most relevant features to reduce dimensionality and improve model performance:

python

# Feature Selection Methods
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Generate sample data with 20 features (only 5 are informative)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
                          n_redundant=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
# METHOD 1: Filter Methods - Statistical tests
# Select top 5 features using ANOVA F-test
selector_f = SelectKBest(f_classif, k=5)
X_selected_f = selector_f.fit_transform(X, y)
selected_features_f = [feature_names[i] for i in selector_f.get_support(indices=True)]
print("F-test selected features:", selected_features_f)
# METHOD 2: Mutual Information - Captures non-linear relationships
selector_mi = SelectKBest(mutual_info_classif, k=5)
X_selected_mi = selector_mi.fit_transform(X, y)
selected_features_mi = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("Mutual Info selected features:", selected_features_mi)
# METHOD 3: Wrapper Method - Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=5)
X_selected_rfe = rfe.fit_transform(X, y)
selected_features_rfe = [feature_names[i] for i in rfe.get_support(indices=True)]
print("RFE selected features:", selected_features_rfe)
# METHOD 4: Embedded Method - Feature Importance from Tree-based Model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importances
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 5 features by Random Forest importance:")
print(feature_importance.head())
# Select features with importance > threshold
selector_tree = SelectFromModel(rf, prefit=True, threshold='median')
X_selected_tree = selector_tree.transform(X)
selected_features_tree = [feature_names[i] for i in selector_tree.get_support(indices=True)]
print("\nTree-based selected features:", selected_features_tree)
# METHOD 5: Correlation-based Selection
# Remove highly correlated features (multicollinearity)
df = pd.DataFrame(X, columns=feature_names)
correlation_matrix = df.corr().abs()
# Find pairs with correlation > 0.9
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if correlation_matrix.iloc[i, j] > 0.9:
            colname = correlation_matrix.columns[i]
            high_corr_pairs.append(colname)
print(f"\nFeatures with high correlation to remove: {set(high_corr_pairs)}")

Feature Scaling & Normalization

Transforming features to a similar scale to prevent features with larger ranges from dominating:

python

# Feature Scaling Methods
import numpy as np
import pandas as pd
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder
)
# Sample data with different scales
data = pd.DataFrame({
    'age': [25, 45, 35, 50, 28],
    'salary': [50000, 120000, 75000, 150000, 60000],
    'years_experience': [2, 15, 8, 20, 4],
    'category': ['A', 'B', 'A', 'C', 'B']
})
print("Original Data:")
print(data)
# METHOD 1: STANDARDIZATION (Z-score normalization)
# Transforms to mean=0, std=1
# Use when: Features follow normal distribution, for algorithms sensitive to scale
scaler_standard = StandardScaler()
data_standardized = data[['age', 'salary', 'years_experience']].copy()
data_standardized[:] = scaler_standard.fit_transform(data_standardized)
print("\nStandardized (mean=0, std=1):")
print(data_standardized)
# METHOD 2: MIN-MAX NORMALIZATION
# Scales to range [0, 1]
# Use when: Need bounded range, distribution is not Gaussian
scaler_minmax = MinMaxScaler()
data_normalized = data[['age', 'salary', 'years_experience']].copy()
data_normalized[:] = scaler_minmax.fit_transform(data_normalized)
print("\nMin-Max Normalized (0 to 1):")
print(data_normalized)
# METHOD 3: ROBUST SCALING
# Uses median and IQR, robust to outliers
# Use when: Data has many outliers
scaler_robust = RobustScaler()
data_robust = data[['age', 'salary', 'years_experience']].copy()
data_robust[:] = scaler_robust.fit_transform(data_robust)
print("\nRobust Scaled (robust to outliers):")
print(data_robust)
# METHOD 4: LABEL ENCODING (for ordinal categorical)
# Converts categories to integers: A=0, B=1, C=2
label_encoder = LabelEncoder()
data['category_encoded'] = label_encoder.fit_transform(data['category'])
print("\nLabel Encoded:")
print(data[['category', 'category_encoded']])
# METHOD 5: ONE-HOT ENCODING (for nominal categorical)
# Creates binary columns for each category
onehot = pd.get_dummies(data['category'], prefix='cat')
print("\nOne-Hot Encoded:")
print(onehot)
# WHEN TO USE EACH SCALING METHOD:
print("\n" + "="*60)
print("SCALING METHOD GUIDE:")
print("="*60)
print("StandardScaler: Neural Networks, SVM, KNN, Logistic Regression")
print("MinMaxScaler: Neural Networks (when bounded range needed)")
print("RobustScaler: When data has outliers")
print("NO SCALING: Tree-based models (Decision Trees, Random Forest, XGBoost)")
print("LabelEncoder: Ordinal categories (Small < Medium < Large)")
print("OneHotEncoder: Nominal categories (Red, Blue, Green)")

Key Concepts

Feature Extraction

Creating new features from raw data using domain knowledge or automatic methods (e.g., extracting 'day of week' from timestamps, text embeddings from documents).

Feature Selection

Identifying and keeping only the most relevant features to reduce dimensionality, prevent overfitting, and improve model performance.

Feature Encoding

Converting categorical variables into numerical format (one-hot encoding, label encoding, target encoding) so algorithms can process them.

Feature Scaling

Normalizing or standardizing features to similar ranges (0-1 or mean=0, std=1) to ensure fair comparison and faster convergence.

Interview Tips

💡Explain the difference between feature extraction (creating new features) and feature selection (choosing existing features)
💡Know common encoding methods: one-hot encoding for nominal categories, ordinal encoding for ordered categories, target encoding for high cardinality
💡Understand when to use normalization (Min-Max to 0-1) vs standardization (Z-score with mean=0, std=1)
💡Be ready to discuss feature importance methods: correlation analysis, mutual information, feature importance from tree-based models, SHAP values
💡Explain why feature scaling is crucial for distance-based algorithms (KNN, SVM, Neural Networks) but not for tree-based models
💡Know how to handle missing values: imputation (mean/median/mode), forward/backward fill, or creating indicator variables
💡Discuss domain-specific feature engineering examples: extracting hour/day/month from dates, creating interaction features, polynomial features
💡Understand the curse of dimensionality: too many features can lead to overfitting and computational issues