Feature Engineering
Transforming raw data into meaningful features for machine learning
Imagine you're teaching someone to identify good restaurants. Instead of just giving them addresses, you'd tell them about important features like 'distance from home', 'average rating', 'price range', and 'cuisine type'. Feature engineering is exactly this - taking raw data and transforming it into useful, meaningful pieces of information that help machine learning models learn better and make accurate predictions!
What is Feature Engineering?
Feature engineering is the process of using domain knowledge to extract, create, and transform features (input variables) from raw data to improve machine learning model performance. It's one of the most important steps in the ML pipeline, often making the difference between a mediocre and excellent model. Good features can make simple algorithms work remarkably well, while poor features can make even sophisticated algorithms struggle.
# Feature Engineering Example: Restaurant Recommendationimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScaler, LabelEncoderfrom datetime import datetime# RAW DATA (not very useful as-is)raw_data = pd.DataFrame({ 'restaurant_id': [101, 102, 103, 104], 'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm St'], 'opened_date': ['2020-03-15', '2018-07-22', '2019-11-03', '2021-01-10'], 'avg_rating': [4.5, 3.8, 4.9, 4.2], 'review_count': [250, 89, 456, 123], 'price_range': ['$$', '$', '$$$', '$$'], 'cuisine': ['Italian', 'Mexican', 'Italian', 'Chinese']})# FEATURE ENGINEERING: Transform raw data into useful features# 1. FEATURE EXTRACTION - Create new features from existing data# Extract time-based featuresraw_data['opened_date'] = pd.to_datetime(raw_data['opened_date'])raw_data['years_open'] = (datetime.now() - raw_data['opened_date']).dt.days / 365.25raw_data['is_new'] = (raw_data['years_open'] < 2).astype(int)# Create interaction featuresraw_data['rating_review_score'] = raw_data['avg_rating'] * np.log1p(raw_data['review_count'])# 2. FEATURE ENCODING - Convert categorical to numerical# One-hot encoding for nominal categories (no order)cuisine_encoded = pd.get_dummies(raw_data['cuisine'], prefix='cuisine')# Ordinal encoding for ordered categoriesprice_mapping = {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4}raw_data['price_numeric'] = raw_data['price_range'].map(price_mapping)# 3. FEATURE SCALING - Normalize numerical featuresscaler = StandardScaler()raw_data['rating_scaled'] = scaler.fit_transform(raw_data[['avg_rating']])raw_data['review_count_scaled'] = scaler.fit_transform(raw_data[['review_count']])# Final feature set ready for MLfeatures = pd.concat([ raw_data[['years_open', 'is_new', 'rating_review_score', 'price_numeric']], cuisine_encoded], axis=1)print("Original Raw Data:")print(raw_data[['address', 'avg_rating', 'price_range']].head())print("\nEngineered Features:")print(features.head())# These engineered features are much more useful for ML models!Feature Extraction
Feature extraction involves deriving new features from existing data:
Temporal Features
Extract meaningful time components from timestamps
- • Hour, day, month, year
- • Day of week, is_weekend
- • Season, quarter
- • Time since event
Text Features
Extract features from text data
- • Length, word count
- • TF-IDF, word embeddings
- • Sentiment scores
- • Entity extraction
Interaction Features
Combine existing features to capture relationships
- • Feature multiplication
- • Ratio features
- • Polynomial features
Aggregation Features
Statistical summaries of grouped data
- • Mean, median, std by group
- • Count, sum per category
- • Rolling statistics
# Feature Extraction Examplesimport pandas as pdimport numpy as npfrom datetime import datetime# 1. TEMPORAL FEATURESdf = pd.DataFrame({ 'timestamp': pd.date_range('2024-01-01', periods=100, freq='H')})df['hour'] = df['timestamp'].dt.hourdf['day_of_week'] = df['timestamp'].dt.dayofweekdf['is_weekend'] = (df['day_of_week'] >= 5).astype(int)df['month'] = df['timestamp'].dt.monthdf['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)# 2. TEXT FEATURESdf_text = pd.DataFrame({ 'review': ['Great product!', 'Terrible service, very disappointed', 'OK']})df_text['length'] = df_text['review'].str.len()df_text['word_count'] = df_text['review'].str.split().str.len()df_text['exclamation_count'] = df_text['review'].str.count('!')df_text['avg_word_length'] = df_text['review'].apply( lambda x: np.mean([len(w) for w in x.split()]))# 3. INTERACTION FEATURESdf_house = pd.DataFrame({ 'square_feet': [1500, 2000, 1200], 'bedrooms': [3, 4, 2], 'age': [10, 5, 20]})df_house['sqft_per_bedroom'] = df_house['square_feet'] / df_house['bedrooms']df_house['sqft_age_interaction'] = df_house['square_feet'] * df_house['age']df_house['is_new_and_large'] = ( (df_house['age'] < 10) & (df_house['square_feet'] > 1800)).astype(int)# 4. AGGREGATION FEATURESdf_sales = pd.DataFrame({ 'customer_id': [1, 1, 1, 2, 2, 3], 'purchase_amount': [100, 150, 200, 50, 75, 300]})# Group-level statisticscustomer_stats = df_sales.groupby('customer_id')['purchase_amount'].agg([ 'mean', 'sum', 'count', 'std', 'min', 'max']).reset_index()customer_stats.columns = ['customer_id', 'avg_purchase', 'total_spent', 'purchase_count', 'purchase_std', 'min_purchase', 'max_purchase']print(customer_stats)Feature Selection
Choosing the most relevant features to reduce dimensionality and improve model performance:
# Feature Selection Methodsimport pandas as pdimport numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.feature_selection import ( SelectKBest, f_classif, mutual_info_classif, RFE, SelectFromModel)from sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegression# Generate sample data with 20 features (only 5 are informative)X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=10, random_state=42)feature_names = [f'feature_{i}' for i in range(20)]# METHOD 1: Filter Methods - Statistical tests# Select top 5 features using ANOVA F-testselector_f = SelectKBest(f_classif, k=5)X_selected_f = selector_f.fit_transform(X, y)selected_features_f = [feature_names[i] for i in selector_f.get_support(indices=True)]print("F-test selected features:", selected_features_f)# METHOD 2: Mutual Information - Captures non-linear relationshipsselector_mi = SelectKBest(mutual_info_classif, k=5)X_selected_mi = selector_mi.fit_transform(X, y)selected_features_mi = [feature_names[i] for i in selector_mi.get_support(indices=True)]print("Mutual Info selected features:", selected_features_mi)# METHOD 3: Wrapper Method - Recursive Feature Elimination (RFE)model = LogisticRegression(max_iter=1000)rfe = RFE(estimator=model, n_features_to_select=5)X_selected_rfe = rfe.fit_transform(X, y)selected_features_rfe = [feature_names[i] for i in rfe.get_support(indices=True)]print("RFE selected features:", selected_features_rfe)# METHOD 4: Embedded Method - Feature Importance from Tree-based Modelrf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X, y)# Get feature importancesfeature_importance = pd.DataFrame({ 'feature': feature_names, 'importance': rf.feature_importances_}).sort_values('importance', ascending=False)print("\nTop 5 features by Random Forest importance:")print(feature_importance.head())# Select features with importance > thresholdselector_tree = SelectFromModel(rf, prefit=True, threshold='median')X_selected_tree = selector_tree.transform(X)selected_features_tree = [feature_names[i] for i in selector_tree.get_support(indices=True)]print("\nTree-based selected features:", selected_features_tree)# METHOD 5: Correlation-based Selection# Remove highly correlated features (multicollinearity)df = pd.DataFrame(X, columns=feature_names)correlation_matrix = df.corr().abs()# Find pairs with correlation > 0.9high_corr_pairs = []for i in range(len(correlation_matrix.columns)): for j in range(i): if correlation_matrix.iloc[i, j] > 0.9: colname = correlation_matrix.columns[i] high_corr_pairs.append(colname)print(f"\nFeatures with high correlation to remove: {set(high_corr_pairs)}")Feature Scaling & Normalization
Transforming features to a similar scale to prevent features with larger ranges from dominating:
# Feature Scaling Methodsimport numpy as npimport pandas as pdfrom sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder, OneHotEncoder)# Sample data with different scalesdata = pd.DataFrame({ 'age': [25, 45, 35, 50, 28], 'salary': [50000, 120000, 75000, 150000, 60000], 'years_experience': [2, 15, 8, 20, 4], 'category': ['A', 'B', 'A', 'C', 'B']})print("Original Data:")print(data)# METHOD 1: STANDARDIZATION (Z-score normalization)# Transforms to mean=0, std=1# Use when: Features follow normal distribution, for algorithms sensitive to scalescaler_standard = StandardScaler()data_standardized = data[['age', 'salary', 'years_experience']].copy()data_standardized[:] = scaler_standard.fit_transform(data_standardized)print("\nStandardized (mean=0, std=1):")print(data_standardized)# METHOD 2: MIN-MAX NORMALIZATION# Scales to range [0, 1]# Use when: Need bounded range, distribution is not Gaussianscaler_minmax = MinMaxScaler()data_normalized = data[['age', 'salary', 'years_experience']].copy()data_normalized[:] = scaler_minmax.fit_transform(data_normalized)print("\nMin-Max Normalized (0 to 1):")print(data_normalized)# METHOD 3: ROBUST SCALING# Uses median and IQR, robust to outliers# Use when: Data has many outliersscaler_robust = RobustScaler()data_robust = data[['age', 'salary', 'years_experience']].copy()data_robust[:] = scaler_robust.fit_transform(data_robust)print("\nRobust Scaled (robust to outliers):")print(data_robust)# METHOD 4: LABEL ENCODING (for ordinal categorical)# Converts categories to integers: A=0, B=1, C=2label_encoder = LabelEncoder()data['category_encoded'] = label_encoder.fit_transform(data['category'])print("\nLabel Encoded:")print(data[['category', 'category_encoded']])# METHOD 5: ONE-HOT ENCODING (for nominal categorical)# Creates binary columns for each categoryonehot = pd.get_dummies(data['category'], prefix='cat')print("\nOne-Hot Encoded:")print(onehot)# WHEN TO USE EACH SCALING METHOD:print("\n" + "="*60)print("SCALING METHOD GUIDE:")print("="*60)print("StandardScaler: Neural Networks, SVM, KNN, Logistic Regression")print("MinMaxScaler: Neural Networks (when bounded range needed)")print("RobustScaler: When data has outliers")print("NO SCALING: Tree-based models (Decision Trees, Random Forest, XGBoost)")print("LabelEncoder: Ordinal categories (Small < Medium < Large)")print("OneHotEncoder: Nominal categories (Red, Blue, Green)")Key Concepts
Feature Extraction
Creating new features from raw data using domain knowledge or automatic methods (e.g., extracting 'day of week' from timestamps, text embeddings from documents).
Feature Selection
Identifying and keeping only the most relevant features to reduce dimensionality, prevent overfitting, and improve model performance.
Feature Encoding
Converting categorical variables into numerical format (one-hot encoding, label encoding, target encoding) so algorithms can process them.
Feature Scaling
Normalizing or standardizing features to similar ranges (0-1 or mean=0, std=1) to ensure fair comparison and faster convergence.
Interview Tips
- 💡Explain the difference between feature extraction (creating new features) and feature selection (choosing existing features)
- 💡Know common encoding methods: one-hot encoding for nominal categories, ordinal encoding for ordered categories, target encoding for high cardinality
- 💡Understand when to use normalization (Min-Max to 0-1) vs standardization (Z-score with mean=0, std=1)
- 💡Be ready to discuss feature importance methods: correlation analysis, mutual information, feature importance from tree-based models, SHAP values
- 💡Explain why feature scaling is crucial for distance-based algorithms (KNN, SVM, Neural Networks) but not for tree-based models
- 💡Know how to handle missing values: imputation (mean/median/mode), forward/backward fill, or creating indicator variables
- 💡Discuss domain-specific feature engineering examples: extracting hour/day/month from dates, creating interaction features, polynomial features
- 💡Understand the curse of dimensionality: too many features can lead to overfitting and computational issues