Design Web Crawler
Explain Like I'm 5
Imagine you're in a huge library with millions of books connected by strings. Each book has words pointing to other books: "Check out Book 123 for more!" A web crawler is like a robot librarian that: 1. Starts with one book (a website) 2. Reads it and writes down important information 3. Follows all the strings (links) to other books 4. Remembers which books it already read so it doesn't read them twice 5. Keeps doing this for MILLIONS of books! This is how Google knows what's on every website - it sends millions of these robot librarians to read the entire internet!
Key Features
- •Download web pages and extract content
- •Follow links to discover new pages (BFS/DFS traversal)
- •Handle billions of URLs efficiently
- •Be polite - respect robots.txt and rate limits
- •Handle failures and duplicate URLs
- •Prioritize important pages (PageRank, freshness)
- •Distributed crawling across multiple machines
- •Store and index extracted content
Requirements
Functional Requirements:
- •Crawl given seed URLs and discover new pages
- •Extract content, metadata, and outgoing links
- •Store crawled data for indexing/search
- •Respect robots.txt and crawl-delay
- •Detect and avoid duplicate URLs
- •Support prioritization (important pages first)
Non-Functional Requirements:
- •Scalability: Crawl billions of pages
- •Performance: 1000s of pages per second
- •Politeness: Don't overwhelm servers
- •Reliability: Handle failures gracefully
- •Freshness: Re-crawl pages to detect updates
Back-of-Envelope Estimation
- Pages to crawl: 15 billion pages
- Crawl frequency: Once per month (30 days)
- Pages per second: 15B / (30 * 24 * 3600) ≈ 5,800 pages/sec
- Average page size: 500 KB
- Storage per month: 15B * 500 KB = 7.5 PB
- Network bandwidth: 5,800 * 500 KB = 2.9 GB/s
System Architecture
┌─────────────────────┐
│ Seed URLs │
│ (Start points) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐ ┌──────────────────┐
│ URL Frontier │◄────────│ URL Filter │
│ (Priority Queue) │ │ (Dedup, robots) │
└──────────┬──────────┘ └──────────────────┘
│
│ Get URLs to crawl
│
┌──────────▼──────────┐
│ Downloader Pool │
│ (Fetch HTML) │
└──────────┬──────────┘
│
│ HTML Content
│
┌──────────▼──────────┐
│ Content Parser │───────► Parse HTML, Extract:
│ (Extract Data) │ • Text Content
└──────────┬──────────┘ • Metadata
│ • Outgoing Links
│
├─────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌────────────────┐
│ Content Store │ │ Link Extractor│
│ (Blob Storage) │ └────────┬───────┘
└──────────────────┘ │
│ New URLs
│
▼
┌──────────────────┐
│ URL Filter │
│ (Back to queue) │
└──────────────────┘Key Components
1. URL Frontier (Priority Queue)
Manages URLs to crawl. Priorities: PageRank, freshness, depth.
2. Downloader Pool
Worker threads that fetch HTML from web servers.
3. DNS Resolver
Resolves domain names to IP addresses (cached).
4. Content Parser
Parses HTML, extracts text, metadata, and links.
5. Duplicate Detector
Uses URL fingerprinting (hash) to avoid re-crawling.
6. Content Store
Stores raw HTML and extracted content (S3, HDFS).
7. URL Filter
Checks robots.txt, blacklists, and deduplication.
Deep Dive: Implementation
1. URL Frontier - Prioritized Queue
Manages which URLs to crawl next. Uses priority queue with multiple factors.
public class URLFrontier { // Prioritized queue for URLs to crawl private final PriorityBlockingQueue<URLTask> urlQueue; private final Set<String> seenURLs; // Deduplication private final Map<String, Long> hostLastCrawl; // Politeness private final int minCrawlDelayMs = 1000; // 1 second between requests public URLFrontier() { // Priority: higher priority value = crawl first this.urlQueue = new PriorityBlockingQueue<>( 10000, (a, b) -> Double.compare(b.priority, a.priority) ); this.seenURLs = ConcurrentHashMap.newKeySet(); this.hostLastCrawl = new ConcurrentHashMap<>(); } /** * Adds URL to frontier with priority calculation. */ public void addURL(String url, int depth, double pageRank) { // 1. Normalize URL (lowercase, remove fragments) String normalized = normalizeURL(url); // 2. Check if already seen if (seenURLs.contains(normalized)) { return; // Skip duplicate } // 3. Calculate priority (0-100) double priority = calculatePriority(depth, pageRank); // 4. Add to queue URLTask task = new URLTask(normalized, depth, priority); urlQueue.offer(task); seenURLs.add(normalized); } /** * Gets next URL to crawl (respecting politeness). */ public URLTask getNextURL() throws InterruptedException { while (true) { URLTask task = urlQueue.take(); // Blocking call String host = extractHost(task.url); // Check politeness: wait if crawled this host recently Long lastCrawl = hostLastCrawl.get(host); if (lastCrawl != null) { long elapsed = System.currentTimeMillis() - lastCrawl; if (elapsed < minCrawlDelayMs) { // Put back in queue and wait urlQueue.offer(task); Thread.sleep(minCrawlDelayMs - elapsed); continue; } } // Update last crawl time hostLastCrawl.put(host, System.currentTimeMillis()); return task; } } /** * Calculate priority based on multiple factors. * Higher priority = crawl sooner. */ private double calculatePriority(int depth, double pageRank) { // Factors: // 1. Depth: Prefer shallow pages (closer to root) // 2. PageRank: Prefer important pages // 3. Freshness: Could add timestamp here double depthScore = 100.0 / (depth + 1); // Depth 0 = 100, depth 9 = 10 double pageRankScore = pageRank * 50; // PageRank 0-1 -> 0-50 return depthScore + pageRankScore; } /** * Normalize URL for consistency. */ private String normalizeURL(String url) { try { URL parsed = new URL(url.toLowerCase()); // Remove fragment (#section) return parsed.getProtocol() + "://" + parsed.getHost() + parsed.getPath() + (parsed.getQuery() != null ? "?" + parsed.getQuery() : ""); } catch (Exception e) { return url.toLowerCase(); } } private String extractHost(String url) { try { return new URL(url).getHost(); } catch (Exception e) { return ""; } } /** * Task representing a URL to crawl. */ static class URLTask { String url; int depth; double priority; URLTask(String url, int depth, double priority) { this.url = url; this.depth = depth; this.priority = priority; } }}2. Web Crawler - Main Engine
Coordinates downloading, parsing, and link extraction.
public class WebCrawler { private final URLFrontier frontier; private final ContentStore contentStore; private final RobotsCache robotsCache; private final ExecutorService downloaderPool; private final int maxDepth = 5; private final int numThreads = 100; public WebCrawler(URLFrontier frontier, ContentStore store) { this.frontier = frontier; this.contentStore = store; this.robotsCache = new RobotsCache(); this.downloaderPool = Executors.newFixedThreadPool(numThreads); } /** * Start crawling from seed URLs. */ public void crawl(List<String> seedURLs) { // 1. Add seed URLs to frontier for (String url : seedURLs) { frontier.addURL(url, 0, 1.0); // depth=0, high priority } // 2. Start crawler threads for (int i = 0; i < numThreads; i++) { downloaderPool.submit(this::crawlerWorker); } } /** * Worker thread that continuously crawls URLs. */ private void crawlerWorker() { while (true) { try { // 1. Get next URL from frontier URLTask task = frontier.getNextURL(); // 2. Check depth limit if (task.depth > maxDepth) { continue; } // 3. Check robots.txt if (!robotsCache.isAllowed(task.url)) { System.out.println("Blocked by robots.txt: " + task.url); continue; } // 4. Download page PageContent content = downloadPage(task.url); if (content == null) { continue; // Failed to download } // 5. Store content contentStore.save(task.url, content); // 6. Extract and queue outgoing links List<String> links = extractLinks(content.html, task.url); for (String link : links) { // Add with increased depth and inherited PageRank frontier.addURL(link, task.depth + 1, content.pageRank * 0.85); } System.out.println("Crawled: " + task.url + " (depth=" + task.depth + ", links=" + links.size() + ")"); } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } catch (Exception e) { System.err.println("Error in crawler: " + e.getMessage()); } } } /** * Downloads HTML content from URL. */ private PageContent downloadPage(String url) { try { HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection(); conn.setRequestMethod("GET"); conn.setConnectTimeout(5000); conn.setReadTimeout(5000); conn.setRequestProperty("User-Agent", "MyCrawler/1.0"); int responseCode = conn.getResponseCode(); if (responseCode != 200) { return null; // Failed } // Read response StringBuilder html = new StringBuilder(); try (BufferedReader reader = new BufferedReader( new InputStreamReader(conn.getInputStream()))) { String line; while ((line = reader.readLine()) != null) { html.append(line).append("\n"); } } return new PageContent(url, html.toString()); } catch (Exception e) { System.err.println("Failed to download " + url + ": " + e.getMessage()); return null; } } /** * Extract all links from HTML content. */ private List<String> extractLinks(String html, String baseURL) { List<String> links = new ArrayList<>(); // Simple regex to extract href attributes // Production: Use proper HTML parser (Jsoup) Pattern pattern = Pattern.compile("href=[\"']([^\"']+)[\"']"); Matcher matcher = pattern.matcher(html); while (matcher.find()) { String link = matcher.group(1); // Convert relative URLs to absolute String absoluteURL = resolveURL(baseURL, link); if (absoluteURL != null && isValidURL(absoluteURL)) { links.add(absoluteURL); } } return links; } /** * Resolve relative URL to absolute. */ private String resolveURL(String baseURL, String relativeURL) { try { URL base = new URL(baseURL); URL resolved = new URL(base, relativeURL); return resolved.toString(); } catch (Exception e) { return null; } } /** * Check if URL is valid HTTP/HTTPS. */ private boolean isValidURL(String url) { return url.startsWith("http://") || url.startsWith("https://"); } /** * Shutdown crawler gracefully. */ public void shutdown() { downloaderPool.shutdown(); try { if (!downloaderPool.awaitTermination(60, TimeUnit.SECONDS)) { downloaderPool.shutdownNow(); } } catch (InterruptedException e) { downloaderPool.shutdownNow(); } } static class PageContent { String url; String html; double pageRank = 0.5; // Default PageContent(String url, String html) { this.url = url; this.html = html; } }}3. Robots.txt Handler
Respects website crawling rules to be a polite crawler.
public class RobotsCache { // Cache robots.txt rules per domain private final Map<String, RobotRules> cache; private final int cacheTTL = 86400; // 24 hours public RobotsCache() { this.cache = new ConcurrentHashMap<>(); } /** * Check if URL is allowed by robots.txt. */ public boolean isAllowed(String url) { try { URL parsed = new URL(url); String domain = parsed.getHost(); // Get or fetch robots.txt rules RobotRules rules = cache.computeIfAbsent(domain, this::fetchRobotsTxt); // Check if URL path is allowed String path = parsed.getPath(); return rules.isAllowed(path); } catch (Exception e) { // If can't parse or fetch, allow by default return true; } } /** * Fetch and parse robots.txt for domain. */ private RobotRules fetchRobotsTxt(String domain) { try { String robotsURL = "https://" + domain + "/robots.txt"; HttpURLConnection conn = (HttpURLConnection) new URL(robotsURL).openConnection(); conn.setRequestMethod("GET"); conn.setConnectTimeout(5000); conn.setReadTimeout(5000); if (conn.getResponseCode() != 200) { return new RobotRules(); // No robots.txt = allow all } // Read robots.txt StringBuilder content = new StringBuilder(); try (BufferedReader reader = new BufferedReader( new InputStreamReader(conn.getInputStream()))) { String line; while ((line = reader.readLine()) != null) { content.append(line).append("\n"); } } return parseRobotsTxt(content.toString()); } catch (Exception e) { return new RobotRules(); // Allow all on error } } /** * Parse robots.txt content. */ private RobotRules parseRobotsTxt(String content) { RobotRules rules = new RobotRules(); boolean inOurSection = false; for (String line : content.split("\n")) { line = line.trim().toLowerCase(); if (line.startsWith("user-agent:")) { String agent = line.substring(11).trim(); inOurSection = agent.equals("*") || agent.equals("mycrawler"); } else if (inOurSection) { if (line.startsWith("disallow:")) { String path = line.substring(9).trim(); if (!path.isEmpty()) { rules.disallowedPaths.add(path); } } else if (line.startsWith("allow:")) { String path = line.substring(6).trim(); if (!path.isEmpty()) { rules.allowedPaths.add(path); } } else if (line.startsWith("crawl-delay:")) { String delay = line.substring(12).trim(); rules.crawlDelay = Integer.parseInt(delay); } } } return rules; } /** * Rules extracted from robots.txt. */ static class RobotRules { List<String> disallowedPaths = new ArrayList<>(); List<String> allowedPaths = new ArrayList<>(); int crawlDelay = 0; // seconds /** * Check if path is allowed. */ boolean isAllowed(String path) { // Check explicit allow first for (String allowed : allowedPaths) { if (path.startsWith(allowed)) { return true; } } // Check disallow for (String disallowed : disallowedPaths) { if (path.startsWith(disallowed)) { return false; } } return true; // Default allow } }}4. Content Store
Stores crawled pages for later indexing and search.
public class ContentStore { private final String storageBasePath; private final MessageDigest md5; public ContentStore(String basePath) { this.storageBasePath = basePath; try { this.md5 = MessageDigest.getInstance("MD5"); } catch (NoSuchAlgorithmException e) { throw new RuntimeException(e); } } /** * Save page content to storage. */ public void save(String url, PageContent content) { try { // 1. Generate unique filename from URL hash String hash = hashURL(url); String filename = hash + ".html"; String filepath = storageBasePath + "/" + filename; // 2. Create metadata PageMetadata metadata = extractMetadata(content); // 3. Write to disk (or S3, HDFS, etc.) writeToFile(filepath, content.html); writeToFile(filepath + ".meta", serializeMetadata(metadata)); System.out.println("Stored: " + url + " -> " + filename); } catch (Exception e) { System.err.println("Failed to store " + url + ": " + e.getMessage()); } } /** * Generate MD5 hash of URL for unique filename. */ private String hashURL(String url) { byte[] hash = md5.digest(url.getBytes()); StringBuilder hex = new StringBuilder(); for (byte b : hash) { hex.append(String.format("%02x", b)); } return hex.toString(); } /** * Extract metadata from page content. */ private PageMetadata extractMetadata(PageContent content) { PageMetadata meta = new PageMetadata(); meta.url = content.url; meta.crawlTime = System.currentTimeMillis(); meta.contentLength = content.html.length(); // Extract title Pattern titlePattern = Pattern.compile("<title>(.+?)</title>", Pattern.CASE_INSENSITIVE); Matcher titleMatcher = titlePattern.matcher(content.html); if (titleMatcher.find()) { meta.title = titleMatcher.group(1); } // Extract meta description Pattern descPattern = Pattern.compile( "<meta\s+name=\"description\"\s+content=\"(.+?)\"", Pattern.CASE_INSENSITIVE ); Matcher descMatcher = descPattern.matcher(content.html); if (descMatcher.find()) { meta.description = descMatcher.group(1); } return meta; } /** * Write content to file. */ private void writeToFile(String filepath, String content) throws IOException { try (FileWriter writer = new FileWriter(filepath)) { writer.write(content); } } /** * Serialize metadata to JSON. */ private String serializeMetadata(PageMetadata meta) { return String.format( "{\"url\":\"%s\",\"title\":\"%s\",\"crawlTime\":%d,\"contentLength\":%d}", meta.url, meta.title, meta.crawlTime, meta.contentLength ); } static class PageMetadata { String url; String title; String description; long crawlTime; int contentLength; }}Trade-offs and Optimizations
BFS vs DFS Crawling:
BFS finds important pages faster (shallow first), DFS uses less memory but may get stuck in deep paths.
Centralized vs Distributed:
Centralized simpler but limited scale, distributed can crawl billions but needs coordination (Kafka, Redis).
Politeness vs Speed:
Aggressive crawling (high speed) may get IP banned, polite crawling (delays) is slower but respectful.
Storage Format:
Raw HTML flexible but large, extracted text smaller but loses structure. Consider compression.
Optimizations
Challenges & Solutions
Challenge: Duplicate Content
Solution: Use content fingerprinting (SimHash) to detect near-duplicates, not just URL dedup.
Challenge: Dynamic Content (JavaScript)
Solution: Use headless browsers (Selenium, Puppeteer) for JavaScript-heavy sites. Expensive but necessary.
Challenge: Crawler Traps
Solution: Set max depth limit, detect URL patterns (calendar pages), use bloom filters for seen URLs.
Challenge: Distributed Coordination
Solution: Use consistent hashing to assign URL ranges to crawlers. ZooKeeper for coordination.
Challenge: Fresh Content
Solution: Prioritize news sites, use sitemaps, check Last-Modified headers, adaptive re-crawl frequency.
Real-World Examples
- ✓Googlebot: 15+ billion pages, distributed across thousands of machines, respects robots.txt
- ✓Common Crawl: Public dataset with 250+ billion pages, runs monthly crawls using Hadoop
- ✓Scrapy: Python framework with built-in politeness, dedup, and middleware support
- ✓Nutch: Apache distributed crawler used with Hadoop, supports plugins for custom parsing