Home/System Design/Design Web Crawler

Design Web Crawler

Explain Like I'm 5

Imagine you're in a huge library with millions of books connected by strings. Each book has words pointing to other books: "Check out Book 123 for more!" A web crawler is like a robot librarian that: 1. Starts with one book (a website) 2. Reads it and writes down important information 3. Follows all the strings (links) to other books 4. Remembers which books it already read so it doesn't read them twice 5. Keeps doing this for MILLIONS of books! This is how Google knows what's on every website - it sends millions of these robot librarians to read the entire internet!

Key Features

  • Download web pages and extract content
  • Follow links to discover new pages (BFS/DFS traversal)
  • Handle billions of URLs efficiently
  • Be polite - respect robots.txt and rate limits
  • Handle failures and duplicate URLs
  • Prioritize important pages (PageRank, freshness)
  • Distributed crawling across multiple machines
  • Store and index extracted content

Requirements

Functional Requirements:

  • Crawl given seed URLs and discover new pages
  • Extract content, metadata, and outgoing links
  • Store crawled data for indexing/search
  • Respect robots.txt and crawl-delay
  • Detect and avoid duplicate URLs
  • Support prioritization (important pages first)

Non-Functional Requirements:

  • Scalability: Crawl billions of pages
  • Performance: 1000s of pages per second
  • Politeness: Don't overwhelm servers
  • Reliability: Handle failures gracefully
  • Freshness: Re-crawl pages to detect updates

Back-of-Envelope Estimation

  • Pages to crawl: 15 billion pages
  • Crawl frequency: Once per month (30 days)
  • Pages per second: 15B / (30 * 24 * 3600) ≈ 5,800 pages/sec
  • Average page size: 500 KB
  • Storage per month: 15B * 500 KB = 7.5 PB
  • Network bandwidth: 5,800 * 500 KB = 2.9 GB/s

System Architecture


┌─────────────────────┐
│   Seed URLs         │
│   (Start points)    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐         ┌──────────────────┐
│   URL Frontier      │◄────────│  URL Filter      │
│  (Priority Queue)   │         │ (Dedup, robots)  │
└──────────┬──────────┘         └──────────────────┘
           │
           │  Get URLs to crawl
           │
┌──────────▼──────────┐
│  Downloader Pool    │
│  (Fetch HTML)       │
└──────────┬──────────┘
           │
           │  HTML Content
           │
┌──────────▼──────────┐
│   Content Parser    │───────► Parse HTML, Extract:
│   (Extract Data)    │         • Text Content
└──────────┬──────────┘         • Metadata
           │                    • Outgoing Links
           │
           ├─────────────────┐
           │                 │
           ▼                 ▼
┌──────────────────┐  ┌────────────────┐
│  Content Store   │  │  Link Extractor│
│  (Blob Storage)  │  └────────┬───────┘
└──────────────────┘           │
                              │ New URLs
                              │
                              ▼
                    ┌──────────────────┐
                    │   URL Filter     │
                    │  (Back to queue) │
                    └──────────────────┘

Key Components

1. URL Frontier (Priority Queue)

Manages URLs to crawl. Priorities: PageRank, freshness, depth.

2. Downloader Pool

Worker threads that fetch HTML from web servers.

3. DNS Resolver

Resolves domain names to IP addresses (cached).

4. Content Parser

Parses HTML, extracts text, metadata, and links.

5. Duplicate Detector

Uses URL fingerprinting (hash) to avoid re-crawling.

6. Content Store

Stores raw HTML and extracted content (S3, HDFS).

7. URL Filter

Checks robots.txt, blacklists, and deduplication.

Deep Dive: Implementation

1. URL Frontier - Prioritized Queue

Manages which URLs to crawl next. Uses priority queue with multiple factors.

URLFrontier.java
java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
public class URLFrontier {
// Prioritized queue for URLs to crawl
private final PriorityBlockingQueue<URLTask> urlQueue;
private final Set<String> seenURLs; // Deduplication
private final Map<String, Long> hostLastCrawl; // Politeness
private final int minCrawlDelayMs = 1000; // 1 second between requests
public URLFrontier() {
// Priority: higher priority value = crawl first
this.urlQueue = new PriorityBlockingQueue<>(
10000,
(a, b) -> Double.compare(b.priority, a.priority)
);
this.seenURLs = ConcurrentHashMap.newKeySet();
this.hostLastCrawl = new ConcurrentHashMap<>();
}
/**
* Adds URL to frontier with priority calculation.
*/
public void addURL(String url, int depth, double pageRank) {
// 1. Normalize URL (lowercase, remove fragments)
String normalized = normalizeURL(url);
// 2. Check if already seen
if (seenURLs.contains(normalized)) {
return; // Skip duplicate
}
// 3. Calculate priority (0-100)
double priority = calculatePriority(depth, pageRank);
// 4. Add to queue
URLTask task = new URLTask(normalized, depth, priority);
urlQueue.offer(task);
seenURLs.add(normalized);
}
/**
* Gets next URL to crawl (respecting politeness).
*/
public URLTask getNextURL() throws InterruptedException {
while (true) {
URLTask task = urlQueue.take(); // Blocking call
String host = extractHost(task.url);
// Check politeness: wait if crawled this host recently
Long lastCrawl = hostLastCrawl.get(host);
if (lastCrawl != null) {
long elapsed = System.currentTimeMillis() - lastCrawl;
if (elapsed < minCrawlDelayMs) {
// Put back in queue and wait
urlQueue.offer(task);
Thread.sleep(minCrawlDelayMs - elapsed);
continue;
}
}
// Update last crawl time
hostLastCrawl.put(host, System.currentTimeMillis());
return task;
}
}
/**
* Calculate priority based on multiple factors.
* Higher priority = crawl sooner.
*/
private double calculatePriority(int depth, double pageRank) {
// Factors:
// 1. Depth: Prefer shallow pages (closer to root)
// 2. PageRank: Prefer important pages
// 3. Freshness: Could add timestamp here
double depthScore = 100.0 / (depth + 1); // Depth 0 = 100, depth 9 = 10
double pageRankScore = pageRank * 50; // PageRank 0-1 -> 0-50
return depthScore + pageRankScore;
}
/**
* Normalize URL for consistency.
*/
private String normalizeURL(String url) {
try {
URL parsed = new URL(url.toLowerCase());
// Remove fragment (#section)
return parsed.getProtocol() + "://" +
parsed.getHost() +
parsed.getPath() +
(parsed.getQuery() != null ? "?" + parsed.getQuery() : "");
} catch (Exception e) {
return url.toLowerCase();
}
}
private String extractHost(String url) {
try {
return new URL(url).getHost();
} catch (Exception e) {
return "";
}
}
/**
* Task representing a URL to crawl.
*/
static class URLTask {
String url;
int depth;
double priority;
URLTask(String url, int depth, double priority) {
this.url = url;
this.depth = depth;
this.priority = priority;
}
}
}

2. Web Crawler - Main Engine

Coordinates downloading, parsing, and link extraction.

WebCrawler.java
java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
public class WebCrawler {
private final URLFrontier frontier;
private final ContentStore contentStore;
private final RobotsCache robotsCache;
private final ExecutorService downloaderPool;
private final int maxDepth = 5;
private final int numThreads = 100;
public WebCrawler(URLFrontier frontier, ContentStore store) {
this.frontier = frontier;
this.contentStore = store;
this.robotsCache = new RobotsCache();
this.downloaderPool = Executors.newFixedThreadPool(numThreads);
}
/**
* Start crawling from seed URLs.
*/
public void crawl(List<String> seedURLs) {
// 1. Add seed URLs to frontier
for (String url : seedURLs) {
frontier.addURL(url, 0, 1.0); // depth=0, high priority
}
// 2. Start crawler threads
for (int i = 0; i < numThreads; i++) {
downloaderPool.submit(this::crawlerWorker);
}
}
/**
* Worker thread that continuously crawls URLs.
*/
private void crawlerWorker() {
while (true) {
try {
// 1. Get next URL from frontier
URLTask task = frontier.getNextURL();
// 2. Check depth limit
if (task.depth > maxDepth) {
continue;
}
// 3. Check robots.txt
if (!robotsCache.isAllowed(task.url)) {
System.out.println("Blocked by robots.txt: " + task.url);
continue;
}
// 4. Download page
PageContent content = downloadPage(task.url);
if (content == null) {
continue; // Failed to download
}
// 5. Store content
contentStore.save(task.url, content);
// 6. Extract and queue outgoing links
List<String> links = extractLinks(content.html, task.url);
for (String link : links) {
// Add with increased depth and inherited PageRank
frontier.addURL(link, task.depth + 1, content.pageRank * 0.85);
}
System.out.println("Crawled: " + task.url +
" (depth=" + task.depth +
", links=" + links.size() + ")");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
} catch (Exception e) {
System.err.println("Error in crawler: " + e.getMessage());
}
}
}
/**
* Downloads HTML content from URL.
*/
private PageContent downloadPage(String url) {
try {
HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
conn.setRequestMethod("GET");
conn.setConnectTimeout(5000);
conn.setReadTimeout(5000);
conn.setRequestProperty("User-Agent", "MyCrawler/1.0");
int responseCode = conn.getResponseCode();
if (responseCode != 200) {
return null; // Failed
}
// Read response
StringBuilder html = new StringBuilder();
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(conn.getInputStream()))) {
String line;
while ((line = reader.readLine()) != null) {
html.append(line).append("\n");
}
}
return new PageContent(url, html.toString());
} catch (Exception e) {
System.err.println("Failed to download " + url + ": " + e.getMessage());
return null;
}
}
/**
* Extract all links from HTML content.
*/
private List<String> extractLinks(String html, String baseURL) {
List<String> links = new ArrayList<>();
// Simple regex to extract href attributes
// Production: Use proper HTML parser (Jsoup)
Pattern pattern = Pattern.compile("href=[\"']([^\"']+)[\"']");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
String link = matcher.group(1);
// Convert relative URLs to absolute
String absoluteURL = resolveURL(baseURL, link);
if (absoluteURL != null && isValidURL(absoluteURL)) {
links.add(absoluteURL);
}
}
return links;
}
/**
* Resolve relative URL to absolute.
*/
private String resolveURL(String baseURL, String relativeURL) {
try {
URL base = new URL(baseURL);
URL resolved = new URL(base, relativeURL);
return resolved.toString();
} catch (Exception e) {
return null;
}
}
/**
* Check if URL is valid HTTP/HTTPS.
*/
private boolean isValidURL(String url) {
return url.startsWith("http://") || url.startsWith("https://");
}
/**
* Shutdown crawler gracefully.
*/
public void shutdown() {
downloaderPool.shutdown();
try {
if (!downloaderPool.awaitTermination(60, TimeUnit.SECONDS)) {
downloaderPool.shutdownNow();
}
} catch (InterruptedException e) {
downloaderPool.shutdownNow();
}
}
static class PageContent {
String url;
String html;
double pageRank = 0.5; // Default
PageContent(String url, String html) {
this.url = url;
this.html = html;
}
}
}

3. Robots.txt Handler

Respects website crawling rules to be a polite crawler.

RobotsCache.java
java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
public class RobotsCache {
// Cache robots.txt rules per domain
private final Map<String, RobotRules> cache;
private final int cacheTTL = 86400; // 24 hours
public RobotsCache() {
this.cache = new ConcurrentHashMap<>();
}
/**
* Check if URL is allowed by robots.txt.
*/
public boolean isAllowed(String url) {
try {
URL parsed = new URL(url);
String domain = parsed.getHost();
// Get or fetch robots.txt rules
RobotRules rules = cache.computeIfAbsent(domain, this::fetchRobotsTxt);
// Check if URL path is allowed
String path = parsed.getPath();
return rules.isAllowed(path);
} catch (Exception e) {
// If can't parse or fetch, allow by default
return true;
}
}
/**
* Fetch and parse robots.txt for domain.
*/
private RobotRules fetchRobotsTxt(String domain) {
try {
String robotsURL = "https://" + domain + "/robots.txt";
HttpURLConnection conn = (HttpURLConnection) new URL(robotsURL).openConnection();
conn.setRequestMethod("GET");
conn.setConnectTimeout(5000);
conn.setReadTimeout(5000);
if (conn.getResponseCode() != 200) {
return new RobotRules(); // No robots.txt = allow all
}
// Read robots.txt
StringBuilder content = new StringBuilder();
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(conn.getInputStream()))) {
String line;
while ((line = reader.readLine()) != null) {
content.append(line).append("\n");
}
}
return parseRobotsTxt(content.toString());
} catch (Exception e) {
return new RobotRules(); // Allow all on error
}
}
/**
* Parse robots.txt content.
*/
private RobotRules parseRobotsTxt(String content) {
RobotRules rules = new RobotRules();
boolean inOurSection = false;
for (String line : content.split("\n")) {
line = line.trim().toLowerCase();
if (line.startsWith("user-agent:")) {
String agent = line.substring(11).trim();
inOurSection = agent.equals("*") || agent.equals("mycrawler");
} else if (inOurSection) {
if (line.startsWith("disallow:")) {
String path = line.substring(9).trim();
if (!path.isEmpty()) {
rules.disallowedPaths.add(path);
}
} else if (line.startsWith("allow:")) {
String path = line.substring(6).trim();
if (!path.isEmpty()) {
rules.allowedPaths.add(path);
}
} else if (line.startsWith("crawl-delay:")) {
String delay = line.substring(12).trim();
rules.crawlDelay = Integer.parseInt(delay);
}
}
}
return rules;
}
/**
* Rules extracted from robots.txt.
*/
static class RobotRules {
List<String> disallowedPaths = new ArrayList<>();
List<String> allowedPaths = new ArrayList<>();
int crawlDelay = 0; // seconds
/**
* Check if path is allowed.
*/
boolean isAllowed(String path) {
// Check explicit allow first
for (String allowed : allowedPaths) {
if (path.startsWith(allowed)) {
return true;
}
}
// Check disallow
for (String disallowed : disallowedPaths) {
if (path.startsWith(disallowed)) {
return false;
}
}
return true; // Default allow
}
}
}

4. Content Store

Stores crawled pages for later indexing and search.

ContentStore.java
java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
public class ContentStore {
private final String storageBasePath;
private final MessageDigest md5;
public ContentStore(String basePath) {
this.storageBasePath = basePath;
try {
this.md5 = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
}
}
/**
* Save page content to storage.
*/
public void save(String url, PageContent content) {
try {
// 1. Generate unique filename from URL hash
String hash = hashURL(url);
String filename = hash + ".html";
String filepath = storageBasePath + "/" + filename;
// 2. Create metadata
PageMetadata metadata = extractMetadata(content);
// 3. Write to disk (or S3, HDFS, etc.)
writeToFile(filepath, content.html);
writeToFile(filepath + ".meta", serializeMetadata(metadata));
System.out.println("Stored: " + url + " -> " + filename);
} catch (Exception e) {
System.err.println("Failed to store " + url + ": " + e.getMessage());
}
}
/**
* Generate MD5 hash of URL for unique filename.
*/
private String hashURL(String url) {
byte[] hash = md5.digest(url.getBytes());
StringBuilder hex = new StringBuilder();
for (byte b : hash) {
hex.append(String.format("%02x", b));
}
return hex.toString();
}
/**
* Extract metadata from page content.
*/
private PageMetadata extractMetadata(PageContent content) {
PageMetadata meta = new PageMetadata();
meta.url = content.url;
meta.crawlTime = System.currentTimeMillis();
meta.contentLength = content.html.length();
// Extract title
Pattern titlePattern = Pattern.compile("<title>(.+?)</title>", Pattern.CASE_INSENSITIVE);
Matcher titleMatcher = titlePattern.matcher(content.html);
if (titleMatcher.find()) {
meta.title = titleMatcher.group(1);
}
// Extract meta description
Pattern descPattern = Pattern.compile(
"<meta\s+name=\"description\"\s+content=\"(.+?)\"",
Pattern.CASE_INSENSITIVE
);
Matcher descMatcher = descPattern.matcher(content.html);
if (descMatcher.find()) {
meta.description = descMatcher.group(1);
}
return meta;
}
/**
* Write content to file.
*/
private void writeToFile(String filepath, String content) throws IOException {
try (FileWriter writer = new FileWriter(filepath)) {
writer.write(content);
}
}
/**
* Serialize metadata to JSON.
*/
private String serializeMetadata(PageMetadata meta) {
return String.format(
"{\"url\":\"%s\",\"title\":\"%s\",\"crawlTime\":%d,\"contentLength\":%d}",
meta.url, meta.title, meta.crawlTime, meta.contentLength
);
}
static class PageMetadata {
String url;
String title;
String description;
long crawlTime;
int contentLength;
}
}

Trade-offs and Optimizations

BFS vs DFS Crawling:

BFS finds important pages faster (shallow first), DFS uses less memory but may get stuck in deep paths.

Centralized vs Distributed:

Centralized simpler but limited scale, distributed can crawl billions but needs coordination (Kafka, Redis).

Politeness vs Speed:

Aggressive crawling (high speed) may get IP banned, polite crawling (delays) is slower but respectful.

Storage Format:

Raw HTML flexible but large, extracted text smaller but loses structure. Consider compression.

Optimizations

DNS Caching: Cache DNS lookups for 24h to reduce latency
Connection Pooling: Reuse HTTP connections to same hosts
URL Fingerprinting: Use MD5/SHA hash for O(1) duplicate detection
Bloom Filters: Probabilistic deduplication for memory efficiency
Sharding by Domain: Each crawler handles specific domains to avoid collisions
Incremental Crawling: Only re-crawl pages that changed (ETags, Last-Modified)
Focused Crawling: Prioritize crawling relevant topics using ML classifiers
Distributed Queue: Use Kafka for URL frontier in distributed setup

Challenges & Solutions

Challenge: Duplicate Content

Solution: Use content fingerprinting (SimHash) to detect near-duplicates, not just URL dedup.

Challenge: Dynamic Content (JavaScript)

Solution: Use headless browsers (Selenium, Puppeteer) for JavaScript-heavy sites. Expensive but necessary.

Challenge: Crawler Traps

Solution: Set max depth limit, detect URL patterns (calendar pages), use bloom filters for seen URLs.

Challenge: Distributed Coordination

Solution: Use consistent hashing to assign URL ranges to crawlers. ZooKeeper for coordination.

Challenge: Fresh Content

Solution: Prioritize news sites, use sitemaps, check Last-Modified headers, adaptive re-crawl frequency.

Real-World Examples

  • Googlebot: 15+ billion pages, distributed across thousands of machines, respects robots.txt
  • Common Crawl: Public dataset with 250+ billion pages, runs monthly crawls using Hadoop
  • Scrapy: Python framework with built-in politeness, dedup, and middleware support
  • Nutch: Apache distributed crawler used with Hadoop, supports plugins for custom parsing