Design Web Crawler

Explain Like I'm 5

Imagine you're in a huge library with millions of books connected by strings. Each book has words pointing to other books: "Check out Book 123 for more!" A web crawler is like a robot librarian that: 1. Starts with one book (a website) 2. Reads it and writes down important information 3. Follows all the strings (links) to other books 4. Remembers which books it already read so it doesn't read them twice 5. Keeps doing this for MILLIONS of books! This is how Google knows what's on every website - it sends millions of these robot librarians to read the entire internet!

Key Features

•Download web pages and extract content
•Follow links to discover new pages (BFS/DFS traversal)
•Handle billions of URLs efficiently
•Be polite - respect robots.txt and rate limits
•Handle failures and duplicate URLs
•Prioritize important pages (PageRank, freshness)
•Distributed crawling across multiple machines
•Store and index extracted content

Requirements

Functional Requirements:

•Crawl given seed URLs and discover new pages
•Extract content, metadata, and outgoing links
•Store crawled data for indexing/search
•Respect robots.txt and crawl-delay
•Detect and avoid duplicate URLs
•Support prioritization (important pages first)

Non-Functional Requirements:

•Scalability: Crawl billions of pages
•Performance: 1000s of pages per second
•Politeness: Don't overwhelm servers
•Reliability: Handle failures gracefully
•Freshness: Re-crawl pages to detect updates

Back-of-Envelope Estimation

Pages to crawl: 15 billion pages
Crawl frequency: Once per month (30 days)
Pages per second: 15B / (30 * 24 * 3600) ≈ 5,800 pages/sec
Average page size: 500 KB
Storage per month: 15B * 500 KB = 7.5 PB
Network bandwidth: 5,800 * 500 KB = 2.9 GB/s

System Architecture


┌─────────────────────┐
│   Seed URLs         │
│   (Start points)    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐         ┌──────────────────┐
│   URL Frontier      │◄────────│  URL Filter      │
│  (Priority Queue)   │         │ (Dedup, robots)  │
└──────────┬──────────┘         └──────────────────┘
           │
           │  Get URLs to crawl
           │
┌──────────▼──────────┐
│  Downloader Pool    │
│  (Fetch HTML)       │
└──────────┬──────────┘
           │
           │  HTML Content
           │
┌──────────▼──────────┐
│   Content Parser    │───────► Parse HTML, Extract:
│   (Extract Data)    │         • Text Content
└──────────┬──────────┘         • Metadata
           │                    • Outgoing Links
           │
           ├─────────────────┐
           │                 │
           ▼                 ▼
┌──────────────────┐  ┌────────────────┐
│  Content Store   │  │  Link Extractor│
│  (Blob Storage)  │  └────────┬───────┘
└──────────────────┘           │
                              │ New URLs
                              │
                              ▼
                    ┌──────────────────┐
                    │   URL Filter     │
                    │  (Back to queue) │
                    └──────────────────┘

Key Components

1. URL Frontier (Priority Queue)

Manages URLs to crawl. Priorities: PageRank, freshness, depth.

2. Downloader Pool

Worker threads that fetch HTML from web servers.

3. DNS Resolver

Resolves domain names to IP addresses (cached).

4. Content Parser

Parses HTML, extracts text, metadata, and links.

5. Duplicate Detector

Uses URL fingerprinting (hash) to avoid re-crawling.

6. Content Store

Stores raw HTML and extracted content (S3, HDFS).

7. URL Filter

Checks robots.txt, blacklists, and deduplication.

Deep Dive: Implementation

1. URL Frontier - Prioritized Queue

Manages which URLs to crawl next. Uses priority queue with multiple factors.

URLFrontier.java

java

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

public class URLFrontier {
    // Prioritized queue for URLs to crawl
    private final PriorityBlockingQueue<URLTask> urlQueue;
    private final Set<String> seenURLs;  // Deduplication
    private final Map<String, Long> hostLastCrawl;  // Politeness
    private final int minCrawlDelayMs = 1000;  // 1 second between requests
    public URLFrontier() {
        // Priority: higher priority value = crawl first
        this.urlQueue = new PriorityBlockingQueue<>(
            10000,
            (a, b) -> Double.compare(b.priority, a.priority)
        );
        this.seenURLs = ConcurrentHashMap.newKeySet();
        this.hostLastCrawl = new ConcurrentHashMap<>();
    }
    /**
     * Adds URL to frontier with priority calculation.
     */
    public void addURL(String url, int depth, double pageRank) {
        // 1. Normalize URL (lowercase, remove fragments)
        String normalized = normalizeURL(url);
        // 2. Check if already seen
        if (seenURLs.contains(normalized)) {
            return;  // Skip duplicate
        }
        // 3. Calculate priority (0-100)
        double priority = calculatePriority(depth, pageRank);
        // 4. Add to queue
        URLTask task = new URLTask(normalized, depth, priority);
        urlQueue.offer(task);
        seenURLs.add(normalized);
    }
    /**
     * Gets next URL to crawl (respecting politeness).
     */
    public URLTask getNextURL() throws InterruptedException {
        while (true) {
            URLTask task = urlQueue.take();  // Blocking call
            String host = extractHost(task.url);
            // Check politeness: wait if crawled this host recently
            Long lastCrawl = hostLastCrawl.get(host);
            if (lastCrawl != null) {
                long elapsed = System.currentTimeMillis() - lastCrawl;
                if (elapsed < minCrawlDelayMs) {
                    // Put back in queue and wait
                    urlQueue.offer(task);
                    Thread.sleep(minCrawlDelayMs - elapsed);
                    continue;
                }
            }
            // Update last crawl time
            hostLastCrawl.put(host, System.currentTimeMillis());
            return task;
        }
    }
    /**
     * Calculate priority based on multiple factors.
     * Higher priority = crawl sooner.
     */
    private double calculatePriority(int depth, double pageRank) {
        // Factors:
        // 1. Depth: Prefer shallow pages (closer to root)
        // 2. PageRank: Prefer important pages
        // 3. Freshness: Could add timestamp here
        double depthScore = 100.0 / (depth + 1);  // Depth 0 = 100, depth 9 = 10
        double pageRankScore = pageRank * 50;     // PageRank 0-1 -> 0-50
        return depthScore + pageRankScore;
    }
    /**
     * Normalize URL for consistency.
     */
    private String normalizeURL(String url) {
        try {
            URL parsed = new URL(url.toLowerCase());
            // Remove fragment (#section)
            return parsed.getProtocol() + "://" +
                   parsed.getHost() +
                   parsed.getPath() +
                   (parsed.getQuery() != null ? "?" + parsed.getQuery() : "");
        } catch (Exception e) {
            return url.toLowerCase();
        }
    }
    private String extractHost(String url) {
        try {
            return new URL(url).getHost();
        } catch (Exception e) {
            return "";
        }
    }
    /**
     * Task representing a URL to crawl.
     */
    static class URLTask {
        String url;
        int depth;
        double priority;
        URLTask(String url, int depth, double priority) {
            this.url = url;
            this.depth = depth;
            this.priority = priority;
        }
    }
}

2. Web Crawler - Main Engine

Coordinates downloading, parsing, and link extraction.

WebCrawler.java

java

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

public class WebCrawler {
    private final URLFrontier frontier;
    private final ContentStore contentStore;
    private final RobotsCache robotsCache;
    private final ExecutorService downloaderPool;
    private final int maxDepth = 5;
    private final int numThreads = 100;
    public WebCrawler(URLFrontier frontier, ContentStore store) {
        this.frontier = frontier;
        this.contentStore = store;
        this.robotsCache = new RobotsCache();
        this.downloaderPool = Executors.newFixedThreadPool(numThreads);
    }
    /**
     * Start crawling from seed URLs.
     */
    public void crawl(List<String> seedURLs) {
        // 1. Add seed URLs to frontier
        for (String url : seedURLs) {
            frontier.addURL(url, 0, 1.0);  // depth=0, high priority
        }
        // 2. Start crawler threads
        for (int i = 0; i < numThreads; i++) {
            downloaderPool.submit(this::crawlerWorker);
        }
    }
    /**
     * Worker thread that continuously crawls URLs.
     */
    private void crawlerWorker() {
        while (true) {
            try {
                // 1. Get next URL from frontier
                URLTask task = frontier.getNextURL();
                // 2. Check depth limit
                if (task.depth > maxDepth) {
                    continue;
                }
                // 3. Check robots.txt
                if (!robotsCache.isAllowed(task.url)) {
                    System.out.println("Blocked by robots.txt: " + task.url);
                    continue;
                }
                // 4. Download page
                PageContent content = downloadPage(task.url);
                if (content == null) {
                    continue;  // Failed to download
                }
                // 5. Store content
                contentStore.save(task.url, content);
                // 6. Extract and queue outgoing links
                List<String> links = extractLinks(content.html, task.url);
                for (String link : links) {
                    // Add with increased depth and inherited PageRank
                    frontier.addURL(link, task.depth + 1, content.pageRank * 0.85);
                }
                System.out.println("Crawled: " + task.url +
                                 " (depth=" + task.depth +
                                 ", links=" + links.size() + ")");
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            } catch (Exception e) {
                System.err.println("Error in crawler: " + e.getMessage());
            }
        }
    }
    /**
     * Downloads HTML content from URL.
     */
    private PageContent downloadPage(String url) {
        try {
            HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(5000);
            conn.setReadTimeout(5000);
            conn.setRequestProperty("User-Agent", "MyCrawler/1.0");
            int responseCode = conn.getResponseCode();
            if (responseCode != 200) {
                return null;  // Failed
            }
            // Read response
            StringBuilder html = new StringBuilder();
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(conn.getInputStream()))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    html.append(line).append("\n");
                }
            }
            return new PageContent(url, html.toString());
        } catch (Exception e) {
            System.err.println("Failed to download " + url + ": " + e.getMessage());
            return null;
        }
    }
    /**
     * Extract all links from HTML content.
     */
    private List<String> extractLinks(String html, String baseURL) {
        List<String> links = new ArrayList<>();
        // Simple regex to extract href attributes
        // Production: Use proper HTML parser (Jsoup)
        Pattern pattern = Pattern.compile("href=[\"']([^\"']+)[\"']");
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            String link = matcher.group(1);
            // Convert relative URLs to absolute
            String absoluteURL = resolveURL(baseURL, link);
            if (absoluteURL != null && isValidURL(absoluteURL)) {
                links.add(absoluteURL);
            }
        }
        return links;
    }
    /**
     * Resolve relative URL to absolute.
     */
    private String resolveURL(String baseURL, String relativeURL) {
        try {
            URL base = new URL(baseURL);
            URL resolved = new URL(base, relativeURL);
            return resolved.toString();
        } catch (Exception e) {
            return null;
        }
    }
    /**
     * Check if URL is valid HTTP/HTTPS.
     */
    private boolean isValidURL(String url) {
        return url.startsWith("http://") || url.startsWith("https://");
    }
    /**
     * Shutdown crawler gracefully.
     */
    public void shutdown() {
        downloaderPool.shutdown();
        try {
            if (!downloaderPool.awaitTermination(60, TimeUnit.SECONDS)) {
                downloaderPool.shutdownNow();
            }
        } catch (InterruptedException e) {
            downloaderPool.shutdownNow();
        }
    }
    static class PageContent {
        String url;
        String html;
        double pageRank = 0.5;  // Default
        PageContent(String url, String html) {
            this.url = url;
            this.html = html;
        }
    }
}

3. Robots.txt Handler

Respects website crawling rules to be a polite crawler.

RobotsCache.java

java

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

public class RobotsCache {
    // Cache robots.txt rules per domain
    private final Map<String, RobotRules> cache;
    private final int cacheTTL = 86400;  // 24 hours
    public RobotsCache() {
        this.cache = new ConcurrentHashMap<>();
    }
    /**
     * Check if URL is allowed by robots.txt.
     */
    public boolean isAllowed(String url) {
        try {
            URL parsed = new URL(url);
            String domain = parsed.getHost();
            // Get or fetch robots.txt rules
            RobotRules rules = cache.computeIfAbsent(domain, this::fetchRobotsTxt);
            // Check if URL path is allowed
            String path = parsed.getPath();
            return rules.isAllowed(path);
        } catch (Exception e) {
            // If can't parse or fetch, allow by default
            return true;
        }
    }
    /**
     * Fetch and parse robots.txt for domain.
     */
    private RobotRules fetchRobotsTxt(String domain) {
        try {
            String robotsURL = "https://" + domain + "/robots.txt";
            HttpURLConnection conn = (HttpURLConnection) new URL(robotsURL).openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(5000);
            conn.setReadTimeout(5000);
            if (conn.getResponseCode() != 200) {
                return new RobotRules();  // No robots.txt = allow all
            }
            // Read robots.txt
            StringBuilder content = new StringBuilder();
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(conn.getInputStream()))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    content.append(line).append("\n");
                }
            }
            return parseRobotsTxt(content.toString());
        } catch (Exception e) {
            return new RobotRules();  // Allow all on error
        }
    }
    /**
     * Parse robots.txt content.
     */
    private RobotRules parseRobotsTxt(String content) {
        RobotRules rules = new RobotRules();
        boolean inOurSection = false;
        for (String line : content.split("\n")) {
            line = line.trim().toLowerCase();
            if (line.startsWith("user-agent:")) {
                String agent = line.substring(11).trim();
                inOurSection = agent.equals("*") || agent.equals("mycrawler");
            } else if (inOurSection) {
                if (line.startsWith("disallow:")) {
                    String path = line.substring(9).trim();
                    if (!path.isEmpty()) {
                        rules.disallowedPaths.add(path);
                    }
                } else if (line.startsWith("allow:")) {
                    String path = line.substring(6).trim();
                    if (!path.isEmpty()) {
                        rules.allowedPaths.add(path);
                    }
                } else if (line.startsWith("crawl-delay:")) {
                    String delay = line.substring(12).trim();
                    rules.crawlDelay = Integer.parseInt(delay);
                }
            }
        }
        return rules;
    }
    /**
     * Rules extracted from robots.txt.
     */
    static class RobotRules {
        List<String> disallowedPaths = new ArrayList<>();
        List<String> allowedPaths = new ArrayList<>();
        int crawlDelay = 0;  // seconds
        /**
         * Check if path is allowed.
         */
        boolean isAllowed(String path) {
            // Check explicit allow first
            for (String allowed : allowedPaths) {
                if (path.startsWith(allowed)) {
                    return true;
                }
            }
            // Check disallow
            for (String disallowed : disallowedPaths) {
                if (path.startsWith(disallowed)) {
                    return false;
                }
            }
            return true;  // Default allow
        }
    }
}

4. Content Store

Stores crawled pages for later indexing and search.

ContentStore.java

java

100

101

102

103

104

105

public class ContentStore {
    private final String storageBasePath;
    private final MessageDigest md5;
    public ContentStore(String basePath) {
        this.storageBasePath = basePath;
        try {
            this.md5 = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException(e);
        }
    }
    /**
     * Save page content to storage.
     */
    public void save(String url, PageContent content) {
        try {
            // 1. Generate unique filename from URL hash
            String hash = hashURL(url);
            String filename = hash + ".html";
            String filepath = storageBasePath + "/" + filename;
            // 2. Create metadata
            PageMetadata metadata = extractMetadata(content);
            // 3. Write to disk (or S3, HDFS, etc.)
            writeToFile(filepath, content.html);
            writeToFile(filepath + ".meta", serializeMetadata(metadata));
            System.out.println("Stored: " + url + " -> " + filename);
        } catch (Exception e) {
            System.err.println("Failed to store " + url + ": " + e.getMessage());
        }
    }
    /**
     * Generate MD5 hash of URL for unique filename.
     */
    private String hashURL(String url) {
        byte[] hash = md5.digest(url.getBytes());
        StringBuilder hex = new StringBuilder();
        for (byte b : hash) {
            hex.append(String.format("%02x", b));
        }
        return hex.toString();
    }
    /**
     * Extract metadata from page content.
     */
    private PageMetadata extractMetadata(PageContent content) {
        PageMetadata meta = new PageMetadata();
        meta.url = content.url;
        meta.crawlTime = System.currentTimeMillis();
        meta.contentLength = content.html.length();
        // Extract title
        Pattern titlePattern = Pattern.compile("<title>(.+?)</title>", Pattern.CASE_INSENSITIVE);
        Matcher titleMatcher = titlePattern.matcher(content.html);
        if (titleMatcher.find()) {
            meta.title = titleMatcher.group(1);
        }
        // Extract meta description
        Pattern descPattern = Pattern.compile(
            "<meta\s+name=\"description\"\s+content=\"(.+?)\"",
            Pattern.CASE_INSENSITIVE
        );
        Matcher descMatcher = descPattern.matcher(content.html);
        if (descMatcher.find()) {
            meta.description = descMatcher.group(1);
        }
        return meta;
    }
    /**
     * Write content to file.
     */
    private void writeToFile(String filepath, String content) throws IOException {
        try (FileWriter writer = new FileWriter(filepath)) {
            writer.write(content);
        }
    }
    /**
     * Serialize metadata to JSON.
     */
    private String serializeMetadata(PageMetadata meta) {
        return String.format(
            "{\"url\":\"%s\",\"title\":\"%s\",\"crawlTime\":%d,\"contentLength\":%d}",
            meta.url, meta.title, meta.crawlTime, meta.contentLength
        );
    }
    static class PageMetadata {
        String url;
        String title;
        String description;
        long crawlTime;
        int contentLength;
    }
}

Trade-offs and Optimizations

BFS vs DFS Crawling:

BFS finds important pages faster (shallow first), DFS uses less memory but may get stuck in deep paths.

Centralized vs Distributed:

Centralized simpler but limited scale, distributed can crawl billions but needs coordination (Kafka, Redis).

Politeness vs Speed:

Aggressive crawling (high speed) may get IP banned, polite crawling (delays) is slower but respectful.

Storage Format:

Raw HTML flexible but large, extracted text smaller but loses structure. Consider compression.

Optimizations

✓DNS Caching: Cache DNS lookups for 24h to reduce latency

✓Connection Pooling: Reuse HTTP connections to same hosts

✓URL Fingerprinting: Use MD5/SHA hash for O(1) duplicate detection

✓Bloom Filters: Probabilistic deduplication for memory efficiency

✓Sharding by Domain: Each crawler handles specific domains to avoid collisions

✓Incremental Crawling: Only re-crawl pages that changed (ETags, Last-Modified)

✓Focused Crawling: Prioritize crawling relevant topics using ML classifiers

✓Distributed Queue: Use Kafka for URL frontier in distributed setup

Challenges & Solutions

Challenge: Duplicate Content

Solution: Use content fingerprinting (SimHash) to detect near-duplicates, not just URL dedup.

Challenge: Dynamic Content (JavaScript)

Solution: Use headless browsers (Selenium, Puppeteer) for JavaScript-heavy sites. Expensive but necessary.

Challenge: Crawler Traps

Solution: Set max depth limit, detect URL patterns (calendar pages), use bloom filters for seen URLs.

Challenge: Distributed Coordination

Solution: Use consistent hashing to assign URL ranges to crawlers. ZooKeeper for coordination.

Challenge: Fresh Content

Solution: Prioritize news sites, use sitemaps, check Last-Modified headers, adaptive re-crawl frequency.

Real-World Examples

✓Googlebot: 15+ billion pages, distributed across thousands of machines, respects robots.txt
✓Common Crawl: Public dataset with 250+ billion pages, runs monthly crawls using Hadoop
✓Scrapy: Python framework with built-in politeness, dedup, and middleware support
✓Nutch: Apache distributed crawler used with Hadoop, supports plugins for custom parsing