Convolutional Neural Networks (CNNs)
Understanding CNNs for computer vision and image recognition
Imagine you're trying to recognize a cat in a photo! You don't need to see every single pixel - you look for patterns: pointy ears, whiskers, furry texture. CNNs work the same way! They're specialized neural networks that automatically learn visual patterns. The 'convolution' part means they scan images with small filters (like 3x3 grids) to detect features: edges in early layers, shapes in middle layers, and complete objects like 'cat face' in final layers. Think of it as looking through a magnifying glass that slides across the image, learning 'if I see THIS pattern here, it might be a cat!' This layer-by-layer learning of increasingly complex features is why CNNs revolutionized computer vision!
What are Convolutional Neural Networks?
CNNs are specialized neural networks designed for processing grid-like data, especially images. Introduced by Yann LeCun (LeNet, 1998) and popularized by AlexNet (2012), they use convolution operations to automatically learn spatial hierarchies of features. Unlike fully-connected networks that treat all pixels equally, CNNs exploit the 2D structure of images through local connectivity, weight sharing, and pooling. This makes them highly efficient and effective for tasks like image classification, object detection, and segmentation.
❌ Fully-Connected Network
- • Every pixel connected to every neuron
- • Millions of parameters for images
- • Ignores spatial structure
- • Prone to overfitting
- • Not translation-invariant
✅ Convolutional Network (CNN)
- • Local connectivity (small filters)
- • Weight sharing → far fewer parameters
- • Exploits 2D spatial structure
- • Better generalization
- • Translation-invariant features
# Simple CNN with PyTorchimport torchimport torch.nn as nnclass SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() # Convolutional layers self.conv1 = nn.Conv2d( in_channels=3, # RGB input (3 color channels) out_channels=32, # 32 different filters kernel_size=3, # 3x3 filter stride=1, padding=1 ) self.conv2 = nn.Conv2d(32, 64, 3, 1, 1) # Pooling layer self.pool = nn.MaxPool2d(kernel_size=2, stride=2) # 2x2 max pooling # Fully connected layers self.fc1 = nn.Linear(64 * 8 * 8, 128) # After 2 poolings: 32x32 → 16x16 → 8x8 self.fc2 = nn.Linear(128, 10) # 10 classes (e.g., CIFAR-10) self.relu = nn.ReLU() def forward(self, x): # Input: [batch, 3, 32, 32] (3-channel 32x32 images) # Conv block 1 x = self.relu(self.conv1(x)) # → [batch, 32, 32, 32] x = self.pool(x) # → [batch, 32, 16, 16] # Conv block 2 x = self.relu(self.conv2(x)) # → [batch, 64, 16, 16] x = self.pool(x) # → [batch, 64, 8, 8] # Flatten for fully-connected layers x = x.view(x.size(0), -1) # → [batch, 64*8*8 = 4096] # Fully connected layers x = self.relu(self.fc1(x)) # → [batch, 128] x = self.fc2(x) # → [batch, 10] (logits) return x# Create modelmodel = SimpleCNN()print(model)# Example inputinput_image = torch.randn(1, 3, 32, 32) # Batch of 1, RGB, 32x32output = model(input_image)print(f"\nInput shape: {input_image.shape}")print(f"Output shape: {output.shape}")print(f"\n✅ CNN successfully processed image!")# Output:# SimpleCNN(# (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))# (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))# (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)# (fc1): Linear(in_features=4096, out_features=128, bias=True)# (fc2): Linear(in_features=128, out_features=10, bias=True)# (relu): ReLU()# )## Input shape: torch.Size([1, 3, 32, 32])# Output shape: torch.Size([1, 10])Convolution Operation
The core building block of CNNs:
How Convolution Works
1. Slide Filter Over Image
Image (5x5): Filter (3x3): 1 2 3 4 5 1 0 -1 6 7 8 9 10 × 2 0 -2 11 12 13 14 15 1 0 -1 16 17 18 19 20 21 22 23 24 25 Position (top-left): 1 2 3 1 0 -1 6 7 8 × 2 0 -2 = ? 11 12 13 1 0 -1
2. Element-wise Multiplication
1×1 + 2×0 + 3×(-1) + 6×2 + 7×0 + 8×(-2) + 11×1 + 12×0 + 13×(-1) = 1 + 0 - 3 + 12 + 0 - 16 + 11 + 0 - 13 = -8
3. Repeat for All Positions
Slide filter across image (left-to-right, top-to-bottom). Each position creates one value in output feature map.
Key Parameters
- Kernel Size: Filter dimensions (3x3, 5x5, 7x7). Larger = more context, more computation
- Stride: Step size (1=every pixel, 2=skip pixels). Larger = smaller output
- Padding: 'valid' (no pad), 'same' (pad to keep size). Prevents shrinking
- Filters: Number of different filters (32, 64, 128). Each learns different pattern
What Filters Detect
- Layer 1 (Early): Simple features - edges, corners, colors
- Layer 2-3 (Middle): Shapes, textures, simple patterns
- Layer 4-5 (Deep): Object parts - eyes, wheels, faces
- Final Layers: Complete objects - cats, dogs, cars
# Visualizing Convolution Operationimport numpy as np# Example: Edge detection filterdef convolve2d(image, kernel, stride=1, padding=0): """ Simple 2D convolution implementation Args: image: 2D array (H x W) kernel: 2D array (K x K) stride: step size padding: zero padding Returns: Feature map """ # Add padding if padding > 0: image = np.pad(image, padding, mode='constant') H, W = image.shape K = kernel.shape[0] # Calculate output dimensions out_h = (H - K) // stride + 1 out_w = (W - K) // stride + 1 # Initialize output output = np.zeros((out_h, out_w)) # Slide filter over image for i in range(0, H - K + 1, stride): for j in range(0, W - K + 1, stride): # Extract region region = image[i:i+K, j:j+K] # Element-wise multiplication and sum (dot product) output[i//stride, j//stride] = np.sum(region * kernel) return output# Example image (simple 5x5)image = np.array([ [0, 0, 0, 0, 0], [0, 1, 1, 1, 0], [0, 1, 1, 1, 0], [0, 1, 1, 1, 0], [0, 0, 0, 0, 0]])# Vertical edge detector (Sobel filter)kernel_vertical = np.array([ [1, 0, -1], [2, 0, -2], [1, 0, -1]])# Horizontal edge detectorkernel_horizontal = np.array([ [ 1, 2, 1], [ 0, 0, 0], [-1, -2, -1]])print("Original Image:")print(image)print()print("Vertical Edges (detects left/right transitions):")vertical_edges = convolve2d(image, kernel_vertical, padding=1)print(vertical_edges)print()print("Horizontal Edges (detects top/bottom transitions):")horizontal_edges = convolve2d(image, kernel_horizontal, padding=1)print(horizontal_edges)# Output shows where edges were detected!# High values = strong edge detected# Low values = no edge# This is what CNNs learn automatically:# - Early layers learn edge detectors# - Middle layers combine edges into shapes# - Deep layers recognize objects# With PyTorchimport torchimport torch.nn as nn# Define convolution layerconv = nn.Conv2d( in_channels=1, # Grayscale input out_channels=3, # 3 different filters kernel_size=3, # 3x3 filter stride=1, padding=1)# Random imagex = torch.randn(1, 1, 28, 28) # [batch, channels, height, width]# Apply convolutionoutput = conv(x)print(f"\nPyTorch Convolution:")print(f"Input shape: {x.shape}") # [1, 1, 28, 28]print(f"Output shape: {output.shape}") # [1, 3, 28, 28] (3 feature maps!)# Each of the 3 filters learned different patterns!CNN Architecture Layers
Main components that make up a CNN:
Convolutional Layer
Applies filters to detect features. Each filter creates one feature map showing where pattern was found.
- • Kernel size (3x3, 5x5)
- • Number of filters (32, 64)
- • Stride, padding
Activation Layer (ReLU)
Adds non-linearity. ReLU is most common: f(x) = max(0, x). Without it, CNN would just be linear transformation.
- • No parameters
- • Applied element-wise
- • Very fast computation
Pooling Layer
Downsamples feature maps, reduces spatial size. Max pooling takes maximum value in each region.
- • Pool size (2x2)
- • Max or Average
- • Reduces dimensions by 75%
Fully-Connected Layer
Traditional neural network layer at end. Flattens feature maps and classifies based on learned features.
- • After flattening
- • High-level reasoning
- • Final classification
Famous CNN Architectures
Landmark models that advanced the field:
LeNet-5 (1998)
Yann LeCun
First successful CNN, digit recognition
AlexNet (2012)
Krizhevsky et al.
Won ImageNet, started deep learning revolution
VGG-16 (2014)
Simonyan & Zisserman
Very deep with small 3x3 filters
ResNet-50 (2015)
He et al.
Skip connections, trained 152 layers successfully
Inception (2014)
Szegedy et al.
Multi-scale feature extraction, efficient
EfficientNet (2019)
Tan & Le
Optimal scaling, state-of-the-art efficiency
Key Concepts
Convolution Operation
Slide filter (kernel) over image, computing dot product at each position. Creates feature map showing where pattern was detected. Shares weights across spatial locations, making CNNs translation-invariant.
Pooling (Subsampling)
Reduces spatial dimensions while retaining important information. Max pooling takes maximum value in each region. Provides translation invariance and reduces computation. Typical: 2x2 with stride 2.
Feature Hierarchy
Early layers detect simple features (edges, colors). Middle layers detect shapes and textures. Deep layers detect complex objects (faces, cars). Hierarchical learning is CNN's key strength.
Parameter Sharing
Same filter weights used across entire image. Dramatically reduces parameters vs fully-connected layers. Makes CNNs efficient and able to detect features regardless of position in image.
Interview Tips
- 💡CNNs use convolution layers to automatically learn spatial hierarchies of features from images, exploiting 2D structure
- 💡Convolution: slide filter (3x3, 5x5) over image, compute dot product, create feature map. Detects local patterns
- 💡Key properties: local connectivity (neurons connect to small region), parameter sharing (same filter everywhere), translation invariance
- 💡Typical architecture: Input → [Conv → ReLU → Pool] × N → Flatten → FC layers → Output. Alternating conv-pool reduces spatial size
- 💡Pooling (max/average): downsamples feature maps, provides translation invariance, reduces computation. Common: 2x2 max pooling with stride 2
- 💡Padding: 'valid' (no padding, output smaller), 'same' (pad to keep size). Stride: step size of filter (1=every pixel, 2=skip pixels)
- 💡Receptive field: region of input that affects a neuron. Grows with depth. Deep CNNs see large context despite small filters
- 💡Famous architectures: LeNet (1998), AlexNet (2012, ImageNet winner), VGG (2014, very deep), ResNet (2015, skip connections), EfficientNet
- 💡Why CNNs for images: fewer parameters than FC (weight sharing), translation invariance, captures spatial relationships, hierarchical features
- 💡Applications: image classification, object detection (YOLO, Faster R-CNN), segmentation, face recognition, medical imaging, self-driving cars