Designing a Web Crawler at Scale

Recently, Anthropic’s 5 stage interview process was leaked. As is the case with many FAANG companies, the system design interview is just one of those stages. The role was for an Infra Software Engineer and in this case the interview was all about designer a web crawler. This blog post was inspired largely by inspired by Hello Interview’s system design walkthrough on the same problem. Worth watching for a different angle on the same architecture. In this post, we’ll walk through the architecture of a web crawler that can crawl 10 billion web pages, extract clean text, and do it in 5 days. That’s the scale you need to build a training dataset for a large language model.

The Goal

We need to fetch raw HTML from 10 billion pages across the public web, extract the text content, and store it for downstream training pipelines. The system needs to:

Scale to 10B pages in 5 days — this isn’t a weekend project, it requires parallelism at every layer
Be fault-tolerant — individual page fetches will fail; the system can’t lose track of work
Respect politeness rules — sites publish robots.txt and expect crawlers to honour crawl delays; ignoring this gets you blocked for sure
De-duplicate aggressively — the web has enormous amounts of duplicate content; crawling the same page twice is wasted compute

The output is a corpus of extracted text files in S3, ready for preprocessing and tokenisation.

Back-of-the-Envelope Math

Before designing anything, it’s worth checking whether the goal is physically achievable — and roughly how many machines it requires.

A typical web page is around 2 MB of raw HTML. Assume available network bandwidth of ~400 Gbps:

400 Gbps ÷ 8 bits/byte = 50 GB/s raw throughput
50 GB/s ÷ 2 MB/page   = 25,000 pages/s theoretical maximum

Real-world crawlers don’t hit theoretical limits. DNS resolution, TCP handshakes, server response times, and politeness delays eat into throughput. A realistic efficiency factor is around 30–40%, giving us roughly 10,000 effective pages/second.

Now check whether 10k pages/s is enough:

5 days × 86,400 s/day = 432,000 seconds
10B pages ÷ 432,000 s ≈ 23,000 pages/s needed

That means we actually need closer to 23k pages/s — so we’re short by a factor of ~2 at 30% efficiency. Bump efficiency to 50% or double the crawler fleet and the numbers work. The math tells us we need a modest cluster — call it 4–8 crawler machines — not hundreds.

The point of this exercise isn’t precision. It’s to anchor the design in reality before over- or under-engineering it.

Architecture Overview

The system is a two-stage pipeline: crawling (fetching and storing raw HTML) and parsing (extracting text and discovering new URLs). Keeping these stages separate lets each scale independently — crawlers are I/O-bound (network), parsers are CPU-bound (HTML parsing, link extraction).

graph TD
%% External Components
subgraph Internet
WebPages[Web Servers / HTML]
DNS[DNS Providers]
end

    %% Entry Point
    Seed[Seed URLs] --> FrontierQueue

    %% Stage 1: Fetching
    subgraph Scaling_Crawler_Cluster
        FrontierQueue[Frontier SQS Queue] -- "Pull URL" --> Crawler[Crawler Worker]
        Crawler -- "1. Resolve IP" --> DNS_Cache{Redis DNS Cache}
        DNS_Cache -- "Miss" --> DNS
        Crawler -- "2. Check Politeness" --> RateLimiter{Redis Rate Limiter}
        Crawler -- "3. Fetch HTML" --> WebPages
    end

    %% Storage & Metadata
    subgraph Storage_Layer
        S3_HTML[(S3: Raw HTML)]
        S3_Text[(S3: Extracted Text)]
        MetaDB[(DynamoDB: URL Metadata)]
    end

    Crawler -- "4. Store Raw" --> S3_HTML
    Crawler -- "5. Update Status" --> MetaDB

    %% Stage 2: Parsing
    subgraph Scaling_Parser_Cluster
        Crawler -- "6. Trigger" --> ParsingQueue[Parsing SQS Queue]
        ParsingQueue -- "Pull Pointer" --> Parser[Parsing Worker]
        Parser -- "7. Get HTML" --> S3_HTML

        Parser -- "8. De-duplicate Content" --> HashCheck{Content Hash GSI}

        subgraph Logic
            Parser -- "9. Extract Text" --> S3_Text
            Parser -- "10. Extract Links" --> Filter{Link Filter/Depth Check}
        end
    end

    %% Feedback Loop
    Filter -- "New Unique URLs" --> FrontierQueue

    %% Styling
    style FrontierQueue fill:#f96,stroke:#333
    style ParsingQueue fill:#f96,stroke:#333
    style Crawler fill:#bbf,stroke:#333
    style Parser fill:#bbf,stroke:#333
    style S3_HTML fill:#dfd,stroke:#333
    style S3_Text fill:#dfd,stroke:#333
    style MetaDB fill:#dfd,stroke:#333

Stage 1: Crawling

Frontier Queue (SQS)

The frontier is the queue of URLs waiting to be crawled. It starts with a seed list — known high-quality domains — and grows as parsers discover new links.

SQS is a natural fit here:

Message visibility timeout handles in-flight failures — if a crawler dies mid-fetch, the message becomes visible again and another worker picks it up
Exponential backoff retries are built in — transient errors (timeouts, 503s) get retried without manual intervention
Dead Letter Queue (DLQ) catches URLs that fail repeatedly — after 5 attempts, the message moves to the DLQ for inspection rather than poisoning the queue

Crawler Workers

Each worker follows a fixed sequence:

Pull a URL from the Frontier Queue
Resolve DNS — check Redis cache first (more on this below); fall back to a DNS provider on miss
Check politeness — consult the domain’s robots.txt rules and the Redis rate limiter; back off if the domain’s crawl delay hasn’t elapsed
Fetch HTML — issue an HTTP GET; handle redirects; respect noindex/nofollow meta tags
Hash the content — compute a SHA-256 of the HTML; used later for de-duplication
Store raw HTML to S3 (s3://raw-html/{domain}/{url-hash}.html)
Update URL Metadata in DynamoDB — mark the URL as crawled, store the S3 pointer, hash, and timestamp
Push an {url, s3Link} message to the Parsing Queue

Workers are stateless. They can be horizontally scaled behind an autoscaling group triggered by Frontier Queue depth.

URL Metadata Table (DynamoDB)

Schema:

Field	Type	Notes
`url`	String (PK)	Full URL
`s3Link`	String	S3 path to raw HTML
`lastCrawledTime`	Number	Unix timestamp
`hash`	String	SHA-256 of HTML content
`depth`	Number	Link depth from seed

A GSI on hash enables fast content de-duplication at the parsing stage.

Stage 2: Parsing

Parsing Queue (SQS)

Crawlers write {url, s3Link} messages to a second SQS queue. This decouples the crawl rate from the parse rate — if parsing falls behind (e.g. during a CPU-intensive spike), the queue absorbs the backlog without slowing down crawlers.

Parser autoscaling is driven by queue depth: more messages → more parser instances.

Parsing Workers

Each parser:

Pulls a message from the Parsing Queue
Fetches the raw HTML from S3
Checks for content duplicates — computes the hash and checks the DynamoDB GSI (or a Redis Set); if already seen, discard and move on
Extracts clean text — strips HTML tags, normalises whitespace, removes boilerplate (nav, footer, ads); saves output to S3 (s3://extracted-text/{domain}/{url-hash}.txt)
Extracts new URLs — parses <a href> tags; applies depth check and de-dup; pushes new candidates to the Frontier Queue

Why Two Stages?

Fetching is network-bound: most of a crawler’s time is spent waiting on TCP connections and server responses. Parsing is CPU-bound: HTML parsing, text extraction, and link filtering are compute-intensive.

Bundling both into a single worker would mean your CPU sits idle during fetches, and your network connection sits idle during parsing. Separating them lets each stage run at its natural throughput, scaled by the right resource type.

Key Deep Dives

Politeness and Rate Limiting

Crawling a site aggressively can degrade its performance for real users. Beyond being bad manners, it’s a fast way to get your IP range banned.

robots.txt caching — Each domain publishes rules at https://domain.com/robots.txt. Fetching it on every request is wasteful; cache it in a DynamoDB Domains table with a TTL (e.g. 24 hours):

Field	Notes
`domain` (PK)	e.g. `example.com`
`userAgent`	Which bots the rules apply to
`disallow`	List of disallowed path prefixes
`crawlDelay`	Requested delay between requests
`lastFetched`	Timestamp for cache invalidation

Redis rate limiter — Even when robots.txt doesn’t specify a crawl delay, limit to a maximum of 1 request/second per domain by default. A Redis key per domain with a 1-second TTL acts as a sliding window limiter. Cheap, fast, and effective.

Jitter on retries — When a batch of URLs for the same domain all fail and retry simultaneously, they create a thundering herd. Adding random jitter (e.g. ±30% of the backoff interval) spreads the load.

DNS Caching

Third-party DNS resolvers impose rate limits. At 10,000 pages/second, DNS lookups become a serious bottleneck if you’re hitting a resolver cold on every request.

Cache resolved IPs in Redis with a TTL matching the record’s actual TTL (typically 5 minutes to 24 hours). On a cache miss, round-robin across multiple DNS providers (e.g. Cloudflare 1.1.1.1, Google 8.8.8.8, your own Route53 resolver) to distribute load and avoid hitting any single provider’s rate limit.

De-duplication

The web has a staggering amount of duplicate content — mirrors, syndicated articles, pagination variants (?page=1 vs ?page=2 with identical content), and near-identical boilerplate pages.

URL de-duplication is the first line of defence. Before enqueuing a URL, check whether it already exists in the Metadata table. If lastCrawledTime is set and recent, skip it. This is a DynamoDB point read — fast and cheap at scale.

Content de-duplication catches pages with different URLs but identical content. After fetching HTML, hash it (SHA-256) and check whether that hash has been seen before. Two options:

DynamoDB GSI on hash — durable, slightly slower, naturally consistent with the rest of your metadata
Redis Set — faster lookups, but requires managing memory and persistence

Unless you’re extremely memory-constrained (trillions of hashes), a DynamoDB GSI is easier to operate than a Bloom filter and has zero false positives. Bloom filters are a good option if you need sub-millisecond lookups and can tolerate a small false-positive rate, but the operational complexity rarely pays off at this scale.

Crawler Traps

Some sites generate URLs dynamically and infinitely — calendar pages that link to the next day, search result pages with arbitrary query parameters, session-encoded URLs. A crawler that follows every link will loop forever.

The fix is simple: store a depth field in the Metadata table, incremented each time a URL is discovered from a parent page. Reject any URL with depth > MAX_DEPTH (a reasonable default is 20). Most legitimate content on the web is reachable within 10 hops from a seed URL; anything deeper is usually noise.

Closing

The design reduces to a few core decisions:

Two-stage pipeline (crawl → parse) decouples I/O-bound and CPU-bound work, letting each scale independently
SQS queues provide fault tolerance and natural backpressure without custom retry logic
Redis handles the hot path — DNS cache and rate limiter — where DynamoDB latency would be too slow
DynamoDB stores durable metadata and supports both URL and content de-duplication via GSIs
S3 absorbs both the raw HTML and extracted text at effectively unlimited scale

The math shows this is achievable with a small-ish fleet. The architecture makes it reliable at that scale.

This post was inspired by Hello Interview’s system design walkthrough on the same problem. Worth watching for a different angle on the same architecture.