Recently, Anthropic’s 5 stage interview process was leaked. As is the case with many FAANG companies, the system design interview is just one of those stages. The role was for an Infra Software Engineer and in this case the interview was all about designer a web crawler. This blog post was inspired largely by inspired by Hello Interview’s system design walkthrough on the same problem. Worth watching for a different angle on the same architecture. In this post, we’ll walk through the architecture of a web crawler that can crawl 10 billion web pages, extract clean text, and do it in 5 days. That’s the scale you need to build a training dataset for a large language model.
The Goal
We need to fetch raw HTML from 10 billion pages across the public web, extract the text content, and store it for downstream training pipelines. The system needs to:
- Scale to 10B pages in 5 days — this isn’t a weekend project, it requires parallelism at every layer
- Be fault-tolerant — individual page fetches will fail; the system can’t lose track of work
- Respect politeness rules — sites publish
robots.txtand expect crawlers to honour crawl delays; ignoring this gets you blocked for sure - De-duplicate aggressively — the web has enormous amounts of duplicate content; crawling the same page twice is wasted compute
The output is a corpus of extracted text files in S3, ready for preprocessing and tokenisation.
Back-of-the-Envelope Math
Before designing anything, it’s worth checking whether the goal is physically achievable — and roughly how many machines it requires.
A typical web page is around 2 MB of raw HTML. Assume available network bandwidth of ~400 Gbps:
400 Gbps ÷ 8 bits/byte = 50 GB/s raw throughput
50 GB/s ÷ 2 MB/page = 25,000 pages/s theoretical maximum
Real-world crawlers don’t hit theoretical limits. DNS resolution, TCP handshakes, server response times, and politeness delays eat into throughput. A realistic efficiency factor is around 30–40%, giving us roughly 10,000 effective pages/second.
Now check whether 10k pages/s is enough:
5 days × 86,400 s/day = 432,000 seconds
10B pages ÷ 432,000 s ≈ 23,000 pages/s needed
That means we actually need closer to 23k pages/s — so we’re short by a factor of ~2 at 30% efficiency. Bump efficiency to 50% or double the crawler fleet and the numbers work. The math tells us we need a modest cluster — call it 4–8 crawler machines — not hundreds.
The point of this exercise isn’t precision. It’s to anchor the design in reality before over- or under-engineering it.
Architecture Overview
The system is a two-stage pipeline: crawling (fetching and storing raw HTML) and parsing (extracting text and discovering new URLs). Keeping these stages separate lets each scale independently — crawlers are I/O-bound (network), parsers are CPU-bound (HTML parsing, link extraction).
graph TD
%% External Components
subgraph Internet
WebPages[Web Servers / HTML]
DNS[DNS Providers]
end
%% Entry Point
Seed[Seed URLs] --> FrontierQueue
%% Stage 1: Fetching
subgraph Scaling_Crawler_Cluster
FrontierQueue[Frontier SQS Queue] -- "Pull URL" --> Crawler[Crawler Worker]
Crawler -- "1. Resolve IP" --> DNS_Cache{Redis DNS Cache}
DNS_Cache -- "Miss" --> DNS
Crawler -- "2. Check Politeness" --> RateLimiter{Redis Rate Limiter}
Crawler -- "3. Fetch HTML" --> WebPages
end
%% Storage & Metadata
subgraph Storage_Layer
S3_HTML[(S3: Raw HTML)]
S3_Text[(S3: Extracted Text)]
MetaDB[(DynamoDB: URL Metadata)]
end
Crawler -- "4. Store Raw" --> S3_HTML
Crawler -- "5. Update Status" --> MetaDB
%% Stage 2: Parsing
subgraph Scaling_Parser_Cluster
Crawler -- "6. Trigger" --> ParsingQueue[Parsing SQS Queue]
ParsingQueue -- "Pull Pointer" --> Parser[Parsing Worker]
Parser -- "7. Get HTML" --> S3_HTML
Parser -- "8. De-duplicate Content" --> HashCheck{Content Hash GSI}
subgraph Logic
Parser -- "9. Extract Text" --> S3_Text
Parser -- "10. Extract Links" --> Filter{Link Filter/Depth Check}
end
end
%% Feedback Loop
Filter -- "New Unique URLs" --> FrontierQueue
%% Styling
style FrontierQueue fill:#f96,stroke:#333
style ParsingQueue fill:#f96,stroke:#333
style Crawler fill:#bbf,stroke:#333
style Parser fill:#bbf,stroke:#333
style S3_HTML fill:#dfd,stroke:#333
style S3_Text fill:#dfd,stroke:#333
style MetaDB fill:#dfd,stroke:#333
Stage 1: Crawling
Frontier Queue (SQS)
The frontier is the queue of URLs waiting to be crawled. It starts with a seed list — known high-quality domains — and grows as parsers discover new links.
SQS is a natural fit here:
- Message visibility timeout handles in-flight failures — if a crawler dies mid-fetch, the message becomes visible again and another worker picks it up
- Exponential backoff retries are built in — transient errors (timeouts, 503s) get retried without manual intervention
- Dead Letter Queue (DLQ) catches URLs that fail repeatedly — after 5 attempts, the message moves to the DLQ for inspection rather than poisoning the queue
Crawler Workers
Each worker follows a fixed sequence:
- Pull a URL from the Frontier Queue
- Resolve DNS — check Redis cache first (more on this below); fall back to a DNS provider on miss
- Check politeness — consult the domain’s
robots.txtrules and the Redis rate limiter; back off if the domain’s crawl delay hasn’t elapsed - Fetch HTML — issue an HTTP GET; handle redirects; respect
noindex/nofollowmeta tags - Hash the content — compute a SHA-256 of the HTML; used later for de-duplication
- Store raw HTML to S3 (
s3://raw-html/{domain}/{url-hash}.html) - Update URL Metadata in DynamoDB — mark the URL as crawled, store the S3 pointer, hash, and timestamp
- Push an
{url, s3Link}message to the Parsing Queue
Workers are stateless. They can be horizontally scaled behind an autoscaling group triggered by Frontier Queue depth.
URL Metadata Table (DynamoDB)
Schema:
| Field | Type | Notes |
|---|---|---|
url | String (PK) | Full URL |
s3Link | String | S3 path to raw HTML |
lastCrawledTime | Number | Unix timestamp |
hash | String | SHA-256 of HTML content |
depth | Number | Link depth from seed |
A GSI on hash enables fast content de-duplication at the parsing stage.
Stage 2: Parsing
Parsing Queue (SQS)
Crawlers write {url, s3Link} messages to a second SQS queue. This decouples the crawl rate from the parse rate — if parsing falls behind (e.g. during a CPU-intensive spike), the queue absorbs the backlog without slowing down crawlers.
Parser autoscaling is driven by queue depth: more messages → more parser instances.
Parsing Workers
Each parser:
- Pulls a message from the Parsing Queue
- Fetches the raw HTML from S3
- Checks for content duplicates — computes the hash and checks the DynamoDB GSI (or a Redis Set); if already seen, discard and move on
- Extracts clean text — strips HTML tags, normalises whitespace, removes boilerplate (nav, footer, ads); saves output to S3 (
s3://extracted-text/{domain}/{url-hash}.txt) - Extracts new URLs — parses
<a href>tags; applies depth check and de-dup; pushes new candidates to the Frontier Queue
Why Two Stages?
Fetching is network-bound: most of a crawler’s time is spent waiting on TCP connections and server responses. Parsing is CPU-bound: HTML parsing, text extraction, and link filtering are compute-intensive.
Bundling both into a single worker would mean your CPU sits idle during fetches, and your network connection sits idle during parsing. Separating them lets each stage run at its natural throughput, scaled by the right resource type.
Key Deep Dives
Politeness and Rate Limiting
Crawling a site aggressively can degrade its performance for real users. Beyond being bad manners, it’s a fast way to get your IP range banned.
robots.txt caching — Each domain publishes rules at https://domain.com/robots.txt. Fetching it on every request is wasteful; cache it in a DynamoDB Domains table with a TTL (e.g. 24 hours):
| Field | Notes |
|---|---|
domain (PK) | e.g. example.com |
userAgent | Which bots the rules apply to |
disallow | List of disallowed path prefixes |
crawlDelay | Requested delay between requests |
lastFetched | Timestamp for cache invalidation |
Redis rate limiter — Even when robots.txt doesn’t specify a crawl delay, limit to a maximum of 1 request/second per domain by default. A Redis key per domain with a 1-second TTL acts as a sliding window limiter. Cheap, fast, and effective.
Jitter on retries — When a batch of URLs for the same domain all fail and retry simultaneously, they create a thundering herd. Adding random jitter (e.g. ±30% of the backoff interval) spreads the load.
DNS Caching
Third-party DNS resolvers impose rate limits. At 10,000 pages/second, DNS lookups become a serious bottleneck if you’re hitting a resolver cold on every request.
Cache resolved IPs in Redis with a TTL matching the record’s actual TTL (typically 5 minutes to 24 hours). On a cache miss, round-robin across multiple DNS providers (e.g. Cloudflare 1.1.1.1, Google 8.8.8.8, your own Route53 resolver) to distribute load and avoid hitting any single provider’s rate limit.
De-duplication
The web has a staggering amount of duplicate content — mirrors, syndicated articles, pagination variants (?page=1 vs ?page=2 with identical content), and near-identical boilerplate pages.
URL de-duplication is the first line of defence. Before enqueuing a URL, check whether it already exists in the Metadata table. If lastCrawledTime is set and recent, skip it. This is a DynamoDB point read — fast and cheap at scale.
Content de-duplication catches pages with different URLs but identical content. After fetching HTML, hash it (SHA-256) and check whether that hash has been seen before. Two options:
- DynamoDB GSI on
hash— durable, slightly slower, naturally consistent with the rest of your metadata - Redis Set — faster lookups, but requires managing memory and persistence
Unless you’re extremely memory-constrained (trillions of hashes), a DynamoDB GSI is easier to operate than a Bloom filter and has zero false positives. Bloom filters are a good option if you need sub-millisecond lookups and can tolerate a small false-positive rate, but the operational complexity rarely pays off at this scale.
Crawler Traps
Some sites generate URLs dynamically and infinitely — calendar pages that link to the next day, search result pages with arbitrary query parameters, session-encoded URLs. A crawler that follows every link will loop forever.
The fix is simple: store a depth field in the Metadata table, incremented each time a URL is discovered from a parent page. Reject any URL with depth > MAX_DEPTH (a reasonable default is 20). Most legitimate content on the web is reachable within 10 hops from a seed URL; anything deeper is usually noise.
Closing
The design reduces to a few core decisions:
- Two-stage pipeline (crawl → parse) decouples I/O-bound and CPU-bound work, letting each scale independently
- SQS queues provide fault tolerance and natural backpressure without custom retry logic
- Redis handles the hot path — DNS cache and rate limiter — where DynamoDB latency would be too slow
- DynamoDB stores durable metadata and supports both URL and content de-duplication via GSIs
- S3 absorbs both the raw HTML and extracted text at effectively unlimited scale
The math shows this is achievable with a small-ish fleet. The architecture makes it reliable at that scale.
This post was inspired by Hello Interview’s system design walkthrough on the same problem. Worth watching for a different angle on the same architecture.