Tech SEO

Crawl budget optimization for large sites

Most sites never need to think about crawl budget. The exception is large sites, where Googlebot’s attention becomes a finite resource and important pages get stuck behind faceted navigation URLs Google didn’t need in the first place.

The threshold is roughly 100,000 URLs, or any site changing thousands of URLs per day. Below that, Googlebot’s crawling stays automatic and the discussion is academic. Above it, pages that should be discovered don’t get crawled. Updates that should propagate quickly get processed late. Commercial pages sit in queue behind low-value parameter URLs that the system generates without anyone noticing.

This piece covers how crawl budget actually works in 2026, how to detect waste, and the work that consistently produces results on sites large enough to need it.


What crawl budget is (and what it isn’t):

Crawl budget is the practical limit on how much of a site Googlebot will fetch in a given period. Two factors set the limit:

Factor What it measures What affects it
<strong>Crawl capacity limit</strong> How fast a server can respond without degrading TTFB, response times, server errors, the host's perceived health
<strong>Crawl demand</strong> How much Google wants to crawl this site Page popularity, content freshness, perceived quality, the site's overall importance

The terms come from Google’s own documentation. Gary Illyes has confirmed in multiple Search Off the Record episodes that the budget concept applies meaningfully only to sites hitting one of the public thresholds: very large sites with over 1M URLs, medium sites with more than 10K URLs that change frequently, or any site with severe crawl efficiency problems.

What crawl budget isn’t:

  • A ranking factor by itself. Google doesn’t rank pages higher because the site has “good crawl budget management.”
  • A reason to obsess if the site is under 10,000 URLs and has a clean structure. Google handles small sites fine.
  • A fixed daily quota. The “budget” expands and contracts based on the two factors above.

The practical implication: crawl budget optimization is relevant for large sites and for sites with crawl efficiency problems. For everyone else, the time is better spent on content quality and on technical SEO fundamentals.


How to detect crawl budget problems:

Three signals indicate a crawl budget problem, in order of how easy they are to confirm.

Start with Google Search Console’s Crawl Stats report. It shows total requests per day, by file type, by response code, and by Googlebot type. The patterns to look for:

  • Total requests trending down while the site grows. Either the site is producing pages faster than Google is crawling them, or Google is reducing its interest in the site.
  • High percentage of 4xx or 5xx responses. Google reduces crawl rate when error rates climb because the server signals it can’t handle more load. Sector practice across log analysis tools (Oncrawl, northrule monitoring guides) treats consistent error rates above 1% as the signal to investigate first.
  • Heavy crawling of low-value URLs. Pagination URLs, parameter URLs, faceted navigation results, archive pages from 2014. If Googlebot spends most of its requests on these, less budget remains for the pages that matter.

Move to log file analysis. Log files show the actual requests Googlebot made: which URLs, which user agents (Googlebot Smartphone vs Desktop, GoogleOther, AdsBot), which response codes, when. Patterns to flag:

  • Pages not in logs at all. Either Google doesn’t know about them, or Google has decided not to crawl them.
  • Pages crawled once and never again. Common for pages Google indexed but treats as static.
  • Same URLs crawled hundreds of times per day. Often a sign of an infinite URL space (calendar widgets producing /events/2027-01-01, /events/2027-01-02, on forever).

Indexation gap is the third signal. The site has, say, 50,000 product pages. Google has indexed 18,000. The remaining 32,000 are either not crawled, crawled and excluded, or crawled but not selected. The Pages report in Search Console shows the breakdown.


What wastes crawl budget:

The patterns recur across most large sites that have crawl budget problems:

  • Faceted navigation creating combinatorial URL space. A category page with 6 filter dimensions (color, size, brand, price, rating, sort) can produce hundreds of thousands of unique URLs from a few dozen products. Each URL gets crawled if it’s discoverable.
  • Internal search results pages indexed and discoverable. “/search?q=red+shoes” producing a unique URL for every query users have ever run, with internal links from related searches modules.
  • Session IDs in URLs. “/product/red-shoes?sid=abc123” producing a fresh URL for every session. Googlebot crawls each as separate URLs.
  • Pagination producing thin pages. Page 47 of the blog archive with one or two posts, mostly chrome and navigation.
  • Tag pages and category overlap. A tag system that produces pages for every combination, many of which duplicate the main category pages.
  • Calendar archives with no end. Date-based archives that produce URLs into the far future, often via “next month” arrows that have no termination logic.
  • Print-friendly versions, AMP fallbacks, and mobile variants that should have been canonicalized or merged.
  • Soft 404s. Pages returning 200 status but containing “no results found” or “this page has moved” text. Google has to crawl them to recognize they’re soft errors.

The common thread: each of these creates URLs that Google’s crawler has to handle but that produce no value to users or to ranking.


What to fix, in priority order:

The work goes in sequence. Each step depends on the previous one holding.

  1. Stop creating new low-value URLs. The faster the URL space grows, the harder cleanup becomes later. Block faceted navigation URLs that don’t have unique commercial value. Disallow search result pages in robots.txt. Eliminate session IDs from URLs by using cookies instead. Add canonical tags to print versions pointing at the main version. The exact technique depends on the URL pattern, but the priority is preventing new waste.
  2. Clean up the existing waste. For URLs that should never have been crawled: 410 Gone status for ones that no longer exist, robots.txt Disallow for ones that exist but shouldn’t be indexed. For URLs that exist and have value but shouldn’t be in the crawl frontier: canonical tags pointing at the version that should rank.
  3. Speed up server response. TTFB matters for crawl budget because Google’s crawl capacity calculation incorporates response time. Gary Illyes pointed out in May 2025 that expensive database queries can slow a server to the point where Googlebot reduces its crawl rate. Documented cases show sites multiplying their crawl rate by 4x (from roughly 150,000 to 600,000 URLs per day) by reducing TTFB from 800ms to 180ms, with no architecture changes. Google’s own documentation recommends keeping TTFB under 300-400ms on average. CDN caching, origin shielding, and database query optimization are the levers that typically move the number.
  4. Strengthen internal linking for important pages. Pages that Google finds through many internal links get more crawl attention. Orphan pages often go uncrawled or get demoted in the crawl queue. The fix is editorial: link to important pages from category pages, from related content, from the navigation. Sitemaps alone don’t pass crawl priority the way internal links do.
  5. Submit only the important URLs in sitemaps. A sitemap with 500,000 URLs, most of which are low-value parameter variants, dilutes the signal. A sitemap with 50,000 URLs, all high-priority commercial pages and the most important editorial content, focuses Google’s attention. Sitemap index files can segment by type (products, articles, categories) so that diagnosis is easier.

Robots.txt patterns for blocking crawl traps:

The most common robots.txt patterns that solve crawl budget waste:

# Block faceted navigation parameter URLs
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Disallow: /*?filter=

# Block internal search results
User-agent: *
Disallow: /search?
Disallow: /search/

# Block session IDs and tracking parameters
User-agent: *
Disallow: /*?sid=
Disallow: /*?session=
Disallow: /*?utm_

# Block calendar and date-archive traps
User-agent: *
Disallow: /events/2027*
Disallow: /events/2028*
Disallow: /archive/

# Block print-friendly versions (use canonical instead if pages have value)
User-agent: *
Disallow: /*?print=
Disallow: /*/print/

A warning about wildcard syntax: Googlebot supports wildcards (* and $) in robots.txt, but the support across other crawlers varies. Test rules with Search Console’s robots.txt Tester before deploying. After deployment, monitor Crawl Stats for 2-3 weeks to confirm crawl reallocation; if expected pages don’t see increased crawl, the rules may be over-blocking.


Common mistakes:

Three patterns recur in site audits.

Blocking with robots.txt when noindex was the right tool. robots.txt blocks crawling, which prevents Google from seeing the noindex tag, which means the URLs may still appear in search results based on external signals. The correct sequence runs in four steps: allow crawling, apply noindex, wait for Google to drop the URLs, then disallow in robots.txt. Reversing that order leaves the URLs visible in results without a way for Google to learn they shouldn’t be there.

Treating the symptom instead of the source. A site with thousands of soft 404s doesn’t fix the problem by manually marking each one in Search Console or adding band-aid noindex rules. The fix is the CMS configuration that’s producing soft 404s for empty category pages in the first place.

Treating crawl budget as a competition with Google. The framing of “saving” or “wasting” budget produces defensive optimizations that don’t help. The framing that works: making sure Google can find and process the pages that matter, fast, without distraction from the pages that don’t.


How AI crawlers affect crawl budget calculations:

The crawl budget framing was developed when Googlebot was the only crawler that mattered for most sites. In 2026, AI crawlers (GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended, Meta-ExternalAgent, Bytespider, and others) account for substantial server load on many sites. Cloudflare Radar’s Q1 2026 data showed AI crawlers as roughly 22% of total bot traffic across its network, with individual sources like Meta-ExternalAgent (16.7%), ClaudeBot (11.7%), GPTBot (9.8%), and Applebot (9.2%) each representing meaningful infrastructure cost.

The interaction with crawl budget is indirect but real. AI crawlers compete with Googlebot for origin server capacity. When server response time degrades because multiple AI crawlers are hitting the site simultaneously, Googlebot reduces its crawl rate in response to the same TTFB signals Gary Illyes described. The site that hasn’t blocked AI crawlers may find Googlebot crawl rates dropping not because Googlebot’s appetite changed, but because origin capacity got consumed by AI crawler traffic.

The diagnostic pattern: server logs show simultaneous high-rate AI crawler activity and reduced Googlebot rates during the same windows. The fix isn’t always blocking AI crawlers (that’s a separate strategic decision about visibility and licensing); it’s making sure origin capacity exists for all crawlers, or selectively rate-limiting at the CDN layer.

The strategic question this raises: AI crawlers don’t pay for the infrastructure they consume, but they may produce visibility in AI answer interfaces. Sites that allow AI crawlers should factor the infrastructure cost into the visibility tradeoff. Sites that block AI crawlers should verify that the blocks are honored at the CDN layer rather than only declared in robots.txt.


Measurement and timeline:

Crawl budget changes are visible in Search Console’s Crawl Stats within days. The compound effect on indexation takes weeks. The full effect on ranking takes 2-6 months, because the new crawl pattern has to interact with Google’s indexing decisions and quality signals.

The metrics that move when crawl budget work is succeeding:

  • Total requests per day in Crawl Stats rises (more pages crawled) or stabilizes (existing budget reallocated to important pages).
  • Percentage of requests to important URLs rises. If product pages were 12% of crawl and after work are 45%, the budget is now focused on what matters.
  • Indexation rate rises. The gap between submitted URLs and indexed URLs closes.
  • Discovery time for new pages falls. New products get indexed within hours rather than days.

The metrics that don’t move (and shouldn’t be expected to):

  • Direct ranking lift for individual pages. Crawl budget work makes ranking possible for pages that weren’t ranking because they weren’t indexed; it doesn’t directly improve ranking for pages that were already indexed.
  • Total traffic in the first month. The compound effect appears 2-4 months later as more pages enter the index and ranking calculations expand to include them. Sites with severe indexation gaps before the work often see the largest second-quarter lift.

When crawl budget becomes operational discipline:

Crawl budget optimization isn’t a separate discipline from technical SEO. It’s the part of technical SEO that becomes urgent when a site grows past the threshold where Google’s natural crawling handles things automatically.

For small sites, the answer is to focus on content quality and on the fundamentals (canonical tags, internal linking, sitemaps). Crawl budget takes care of itself at that scale.

For large sites and for sites with active crawl efficiency problems, crawl budget is an ongoing operational discipline. The work isn’t a one-time project; new URLs get created, old ones change status, and the system needs continuous tuning. The sites that handle crawl budget well treat it the same way they treat server uptime: a continuous-improvement metric, monitored in dashboards, with someone responsible for it.

The sites that don’t handle it well let crawl waste accumulate until a major recrawl initiative becomes the only remaining option. Cleaning up at that point typically takes 2 to 4 months of compound work; preventing accumulation in the first place takes a fraction of that.

Crawl budget rewards taste. A site can have a million URLs and almost none of them mattering, or fifty thousand URLs with every one earning its place. Googlebot will spend attention where attention is repaid; the work of crawl budget optimization is mostly the work of deciding which URLs actually deserve to exist.