Convolutional Neural Networks (CNNs)

Understanding CNNs for computer vision and image recognition

Imagine you're trying to recognize a cat in a photo! You don't need to see every single pixel - you look for patterns: pointy ears, whiskers, furry texture. CNNs work the same way! They're specialized neural networks that automatically learn visual patterns. The 'convolution' part means they scan images with small filters (like 3x3 grids) to detect features: edges in early layers, shapes in middle layers, and complete objects like 'cat face' in final layers. Think of it as looking through a magnifying glass that slides across the image, learning 'if I see THIS pattern here, it might be a cat!' This layer-by-layer learning of increasingly complex features is why CNNs revolutionized computer vision!

What are Convolutional Neural Networks?

CNNs are specialized neural networks designed for processing grid-like data, especially images. Introduced by Yann LeCun (LeNet, 1998) and popularized by AlexNet (2012), they use convolution operations to automatically learn spatial hierarchies of features. Unlike fully-connected networks that treat all pixels equally, CNNs exploit the 2D structure of images through local connectivity, weight sharing, and pooling. This makes them highly efficient and effective for tasks like image classification, object detection, and segmentation.

❌ Fully-Connected Network

• Every pixel connected to every neuron
• Millions of parameters for images
• Ignores spatial structure
• Prone to overfitting
• Not translation-invariant

✅ Convolutional Network (CNN)

• Local connectivity (small filters)
• Weight sharing → far fewer parameters
• Exploits 2D spatial structure
• Better generalization
• Translation-invariant features

python

# Simple CNN with PyTorch
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(
            in_channels=3,      # RGB input (3 color channels)
            out_channels=32,    # 32 different filters
            kernel_size=3,      # 3x3 filter
            stride=1,
            padding=1
        )
        self.conv2 = nn.Conv2d(32, 64, 3, 1, 1)
        # Pooling layer
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)  # 2x2 max pooling
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 8 * 8, 128)  # After 2 poolings: 32x32 → 16x16 → 8x8
        self.fc2 = nn.Linear(128, 10)  # 10 classes (e.g., CIFAR-10)
        self.relu = nn.ReLU()
    def forward(self, x):
        # Input: [batch, 3, 32, 32] (3-channel 32x32 images)
        # Conv block 1
        x = self.relu(self.conv1(x))  # → [batch, 32, 32, 32]
        x = self.pool(x)               # → [batch, 32, 16, 16]
        # Conv block 2
        x = self.relu(self.conv2(x))  # → [batch, 64, 16, 16]
        x = self.pool(x)               # → [batch, 64, 8, 8]
        # Flatten for fully-connected layers
        x = x.view(x.size(0), -1)     # → [batch, 64*8*8 = 4096]
        # Fully connected layers
        x = self.relu(self.fc1(x))    # → [batch, 128]
        x = self.fc2(x)                # → [batch, 10] (logits)
        return x
# Create model
model = SimpleCNN()
print(model)
# Example input
input_image = torch.randn(1, 3, 32, 32)  # Batch of 1, RGB, 32x32
output = model(input_image)
print(f"\nInput shape: {input_image.shape}")
print(f"Output shape: {output.shape}")
print(f"\n✅ CNN successfully processed image!")
# Output:
# SimpleCNN(
#   (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
#   (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
#   (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
#   (fc1): Linear(in_features=4096, out_features=128, bias=True)
#   (fc2): Linear(in_features=128, out_features=10, bias=True)
#   (relu): ReLU()
# )
#
# Input shape: torch.Size([1, 3, 32, 32])
# Output shape: torch.Size([1, 10])

Convolution Operation

The core building block of CNNs:

How Convolution Works

1. Slide Filter Over Image

Image (5x5):          Filter (3x3):
1  2  3  4  5         1  0 -1
6  7  8  9  10   ×    2  0 -2
11 12 13 14 15        1  0 -1
16 17 18 19 20
21 22 23 24 25

Position (top-left):
1  2  3              1  0 -1
6  7  8         ×    2  0 -2  = ?
11 12 13             1  0 -1

2. Element-wise Multiplication

1×1 + 2×0 + 3×(-1) +
6×2 + 7×0 + 8×(-2) +
11×1 + 12×0 + 13×(-1)

= 1 + 0 - 3 + 12 + 0 - 16 + 11 + 0 - 13
= -8

3. Repeat for All Positions

Slide filter across image (left-to-right, top-to-bottom). Each position creates one value in output feature map.

Key Parameters

Kernel Size: Filter dimensions (3x3, 5x5, 7x7). Larger = more context, more computation
Stride: Step size (1=every pixel, 2=skip pixels). Larger = smaller output
Padding: 'valid' (no pad), 'same' (pad to keep size). Prevents shrinking
Filters: Number of different filters (32, 64, 128). Each learns different pattern

What Filters Detect

Layer 1 (Early): Simple features - edges, corners, colors
Layer 2-3 (Middle): Shapes, textures, simple patterns
Layer 4-5 (Deep): Object parts - eyes, wheels, faces
Final Layers: Complete objects - cats, dogs, cars

python

100

101

102

103

104

105

106

107

108

109

110

# Visualizing Convolution Operation
import numpy as np
# Example: Edge detection filter
def convolve2d(image, kernel, stride=1, padding=0):
    """
    Simple 2D convolution implementation
    Args:
        image: 2D array (H x W)
        kernel: 2D array (K x K)
        stride: step size
        padding: zero padding
    Returns:
        Feature map
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    H, W = image.shape
    K = kernel.shape[0]
    # Calculate output dimensions
    out_h = (H - K) // stride + 1
    out_w = (W - K) // stride + 1
    # Initialize output
    output = np.zeros((out_h, out_w))
    # Slide filter over image
    for i in range(0, H - K + 1, stride):
        for j in range(0, W - K + 1, stride):
            # Extract region
            region = image[i:i+K, j:j+K]
            # Element-wise multiplication and sum (dot product)
            output[i//stride, j//stride] = np.sum(region * kernel)
    return output
# Example image (simple 5x5)
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0]
])
# Vertical edge detector (Sobel filter)
kernel_vertical = np.array([
    [1,  0, -1],
    [2,  0, -2],
    [1,  0, -1]
])
# Horizontal edge detector
kernel_horizontal = np.array([
    [ 1,  2,  1],
    [ 0,  0,  0],
    [-1, -2, -1]
])
print("Original Image:")
print(image)
print()
print("Vertical Edges (detects left/right transitions):")
vertical_edges = convolve2d(image, kernel_vertical, padding=1)
print(vertical_edges)
print()
print("Horizontal Edges (detects top/bottom transitions):")
horizontal_edges = convolve2d(image, kernel_horizontal, padding=1)
print(horizontal_edges)
# Output shows where edges were detected!
# High values = strong edge detected
# Low values = no edge
# This is what CNNs learn automatically:
# - Early layers learn edge detectors
# - Middle layers combine edges into shapes
# - Deep layers recognize objects
# With PyTorch
import torch
import torch.nn as nn
# Define convolution layer
conv = nn.Conv2d(
    in_channels=1,    # Grayscale input
    out_channels=3,   # 3 different filters
    kernel_size=3,    # 3x3 filter
    stride=1,
    padding=1
)
# Random image
x = torch.randn(1, 1, 28, 28)  # [batch, channels, height, width]
# Apply convolution
output = conv(x)
print(f"\nPyTorch Convolution:")
print(f"Input shape: {x.shape}")      # [1, 1, 28, 28]
print(f"Output shape: {output.shape}") # [1, 3, 28, 28] (3 feature maps!)
# Each of the 3 filters learned different patterns!

CNN Architecture Layers

Main components that make up a CNN:

Convolutional Layer

Applies filters to detect features. Each filter creates one feature map showing where pattern was found.

• Kernel size (3x3, 5x5)
• Number of filters (32, 64)
• Stride, padding

Activation Layer (ReLU)

Adds non-linearity. ReLU is most common: f(x) = max(0, x). Without it, CNN would just be linear transformation.

• No parameters
• Applied element-wise
• Very fast computation

Pooling Layer

Downsamples feature maps, reduces spatial size. Max pooling takes maximum value in each region.

• Pool size (2x2)
• Max or Average
• Reduces dimensions by 75%

Fully-Connected Layer

Traditional neural network layer at end. Flattens feature maps and classifies based on learned features.

• After flattening
• High-level reasoning
• Final classification

Famous CNN Architectures

Landmark models that advanced the field:

LeNet-5 (1998)

Yann LeCun

First successful CNN, digit recognition

Layers: 7Params: ~60K

AlexNet (2012)

Krizhevsky et al.

Won ImageNet, started deep learning revolution

Layers: 8Params: ~60M

VGG-16 (2014)

Simonyan & Zisserman

Very deep with small 3x3 filters

Layers: 16Params: ~138M

ResNet-50 (2015)

He et al.

Skip connections, trained 152 layers successfully

Layers: 50Params: ~25M

Inception (2014)

Szegedy et al.

Multi-scale feature extraction, efficient

Layers: 22Params: ~7M

EfficientNet (2019)

Tan & Le

Optimal scaling, state-of-the-art efficiency

Layers: VariableParams: 4M-66M

Key Concepts

Convolution Operation

Slide filter (kernel) over image, computing dot product at each position. Creates feature map showing where pattern was detected. Shares weights across spatial locations, making CNNs translation-invariant.

Pooling (Subsampling)

Reduces spatial dimensions while retaining important information. Max pooling takes maximum value in each region. Provides translation invariance and reduces computation. Typical: 2x2 with stride 2.

Feature Hierarchy

Early layers detect simple features (edges, colors). Middle layers detect shapes and textures. Deep layers detect complex objects (faces, cars). Hierarchical learning is CNN's key strength.

Parameter Sharing

Same filter weights used across entire image. Dramatically reduces parameters vs fully-connected layers. Makes CNNs efficient and able to detect features regardless of position in image.

Interview Tips

💡CNNs use convolution layers to automatically learn spatial hierarchies of features from images, exploiting 2D structure
💡Convolution: slide filter (3x3, 5x5) over image, compute dot product, create feature map. Detects local patterns
💡Key properties: local connectivity (neurons connect to small region), parameter sharing (same filter everywhere), translation invariance
💡Typical architecture: Input → [Conv → ReLU → Pool] × N → Flatten → FC layers → Output. Alternating conv-pool reduces spatial size
💡Pooling (max/average): downsamples feature maps, provides translation invariance, reduces computation. Common: 2x2 max pooling with stride 2
💡Padding: 'valid' (no padding, output smaller), 'same' (pad to keep size). Stride: step size of filter (1=every pixel, 2=skip pixels)
💡Receptive field: region of input that affects a neuron. Grows with depth. Deep CNNs see large context despite small filters
💡Famous architectures: LeNet (1998), AlexNet (2012, ImageNet winner), VGG (2014, very deep), ResNet (2015, skip connections), EfficientNet
💡Why CNNs for images: fewer parameters than FC (weight sharing), translation invariance, captures spatial relationships, hierarchical features
💡Applications: image classification, object detection (YOLO, Faster R-CNN), segmentation, face recognition, medical imaging, self-driving cars