Tech SEO

CDN configuration and caching rules

A CDN improves site speed by serving cached content from servers physically closer to the user. The same configuration that improves speed for users also affects how Googlebot crawls the site, what version of pages gets indexed, and how quickly content updates propagate to search results.

Most sites use a CDN by default in 2026. Cloudflare, Akamai, Fastly, AWS CloudFront, Google Cloud CDN, BunnyCDN, and several others handle a substantial share of web traffic before it reaches origin servers. The CDN choice affects performance, security, and cost. Less obviously, the configuration choices made within the CDN affect crawl efficiency, TTFB, and the indirect ranking signals tied to both.


What a CDN does and where SEO touches it:

A CDN sits between users and the origin server. Requests hit the CDN first. If the CDN has the content cached and the cache is still valid, it returns the cached response directly. If not, it forwards the request to the origin, gets the response, caches it according to the rules, and returns it to the user.

The relevant points for SEO:

Aspect What it affects
<strong>Caching rules</strong> What gets cached, how long, and when refreshes happen
<strong>TTFB and response time</strong> Crawl capacity (faster sites get crawled more)
<strong>Cache headers</strong> What Googlebot's renderer caches between fetches
<strong>Edge geographic distribution</strong> How fast different regions experience the site
<strong>Origin shielding</strong> How resilient the site is when traffic spikes hit
<strong>Bot management</strong> Whether Googlebot gets blocked or rate-limited unintentionally
<strong>Edge logic</strong> Whether redirects, headers, and rewrites happen at edge or origin

A CDN configured correctly accelerates everything Googlebot does. A CDN misconfigured can break indexation in ways that are hard to diagnose because the origin still serves correct responses.


Caching rules and what they control:

Caching rules determine what the CDN stores and for how long. The decisions for each content type produce different SEO consequences.

For static assets (CSS, JS, images, fonts):

  • Long TTLs (1 year, immutable) work well because the assets are versioned with hashes in their URLs. Changes produce new URLs; the old cache entries become irrelevant.
  • Cache-Control: public, max-age=31536000, immutable is the standard pattern.
  • Googlebot benefits because it doesn’t refetch unchanged assets repeatedly during rendering.

For HTML pages:

  • Short TTLs (5-60 minutes) work when content changes regularly. Long enough to absorb traffic spikes, short enough that updates propagate quickly.
  • Long TTLs (hours to days) work for genuinely static content (terms of service, about pages, evergreen reference content).
  • No caching works for personalized content (logged-in dashboards, account pages, anything tailored per user).

The pattern that breaks SEO: caching HTML pages for too long. A product page with a 24-hour TTL means price changes, inventory updates, and content fixes take up to 24 hours to reach Googlebot’s view. The site gets crawled, but the crawled version is stale.

For API responses:

  • Short TTLs suit most cases. APIs powering content changes often, and stale data causes user-facing problems.
  • Specific cache keys can include query parameters or headers, but the more variation in the key, the lower the cache hit rate.

TTFB and what affects it:

Time to First Byte measures how long after a request hits the network the server begins sending the response. A CDN can affect TTFB substantially in both directions.

When CDN reduces TTFB:

  • Geographic proximity means the response travels a shorter distance. A user in Singapore hitting a Singapore edge gets much lower TTFB than the same user hitting an origin in Virginia.
  • Caching hits mean the CDN responds directly without contacting origin. Cache hits at major CDN edge nodes typically respond in tens of milliseconds, often under 50ms when the edge is geographically close to the user.
  • TLS termination at edge offloads the TLS handshake from origin, saving 100-300ms on first-time connections.

When CDN increases TTFB:

  • Cache misses add CDN-to-origin latency on top of origin processing time. The cache miss response can be slower than direct-to-origin would have been.
  • Origin shield indirection adds a hop if configured poorly (request hits edge, edge hits shield, shield hits origin).
  • Edge functions or middleware that run on every request add execution time, especially if they make external API calls.
  • TLS negotiation problems at the edge can add seconds in pathological cases.

For SEO, low and consistent TTFB matters because:

  • Googlebot’s crawl capacity calculation incorporates response time. Fast sites get crawled more.
  • Core Web Vitals (specifically LCP) depend on TTFB as the first component. Slow TTFB directly hurts LCP scores.

Measuring TTFB by region (using synthetic monitoring or RUM data) reveals whether the CDN is delivering on its promise. If TTFB is 200ms in the home region but 2 seconds in Asia, the edge distribution isn’t reaching that geography.


Cache headers and Googlebot:

The HTTP cache headers tell intermediaries (CDN, browsers, Googlebot’s renderer) what to cache and for how long. The relevant headers:

  • Cache-Control: the modern standard for cache directives. Public/private, max-age, immutable, no-cache, no-store, must-revalidate.
  • ETag: an opaque token representing the response. Allows conditional requests (If-None-Match) to revalidate cheaply.
  • Last-Modified: a timestamp for the response. Allows conditional requests (If-Modified-Since) similarly.
  • Vary: declares which request headers affect the response. Critical for cache correctness when serving different content per user agent or language.

What Googlebot’s renderer does with these headers:

  • The renderer maintains a cache for resources fetched during rendering. Cache headers determine how long resources stay in that cache.
  • Aggressive caching (long TTLs, immutable) helps Googlebot render faster on repeat visits to similar pages.
  • No-cache or no-store on JS/CSS forces refetch every time, slowing rendering and consuming more capacity.

The pattern that works: long TTLs with proper versioning (hashed filenames) for static assets, short TTLs for HTML, conditional requests via ETag/Last-Modified for resources that might change.

The pattern that breaks: no caching headers at all (defaulting to “no cache” everywhere) or contradictory headers (Cache-Control says max-age=3600 but Expires says 0).


Origin shielding:

Origin shielding is a CDN feature where one designated edge serves as the only CDN node that contacts the origin. Other edges fetch from the shield instead of from origin directly.

Why it matters for SEO:

  • Origin protection during traffic spikes. A viral page or a crawler surge gets absorbed by the CDN/shield layer instead of hitting origin. The site stays up; crawling continues.
  • Cache fill efficiency. New edges joining the rotation can warm up from the shield instead of pulling from origin, reducing origin load.
  • More consistent TTFB. With shielding, the worst-case TTFB (cache miss going to origin) is bounded by the shield’s location, not the user’s distance from origin.

Misconfigured shielding adds latency without benefits (shield in a slow geography, shield not actually configured, multiple shields causing routing issues). The configuration usually works out of the box for major CDNs but should be verified rather than assumed.


Edge geographic distribution:

Not all CDNs have equal geographic reach. The major providers (Cloudflare, Akamai, Fastly, CloudFront) have hundreds of edges globally; smaller providers may have a few dozen in specific regions.

For SEO, the geographic distribution matters when:

  • The target audience spans multiple continents. A US-based site with significant European traffic needs European edges to deliver acceptable TTFB to those users.
  • Googlebot crawls from multiple regions. Most Googlebot traffic originates from US IPs, but Google does crawl from other regions for some purposes (mobile testing, regional content verification).
  • The site competes in markets where competitors have local hosting. A site served from a Singapore edge competes more effectively against Asian competitors than the same site served from US.

The configuration question is whether the CDN’s edge map matches the audience’s geography. A site that’s mainly serving European users via a CDN with strong North American presence and thin European coverage will underperform.


Bot management and Googlebot:

Modern CDNs include bot management features: blocking known bad bots, rate-limiting suspicious traffic, challenging visitors with CAPTCHAs or similar friction. These features can inadvertently block Googlebot.

The patterns that cause problems:

  • Aggressive bot challenges (Cloudflare’s “Under Attack Mode,” similar features) can block Googlebot if it doesn’t pass the challenge. Googlebot doesn’t solve CAPTCHAs.
  • Rate limiting that doesn’t whitelist Googlebot. Hitting the same path many times in a short window can trigger rate limits; Googlebot’s crawl can look like that pattern to the rate limiter.
  • User-agent allowlists that don’t include all of Googlebot’s variants. GoogleOther, AdsBot, Google-Extended, and Googlebot’s mobile and desktop variants all need to be allowed.
  • IP-based blocks that don’t verify against Google’s published IP ranges. Random IP blocks can include Googlebot IPs.

The configuration that works: verify Googlebot through reverse DNS lookup (the standard method Google documents), allow verified Googlebot regardless of other rules, monitor logs for accidental blocks.

Cloudflare and most other major CDNs have built-in “verified bot” categories that include Googlebot. Enabling those categories correctly is the simplest path to avoiding accidental blocks.


Edge logic and where redirects happen:

CDNs increasingly support edge functions: small bits of code that run at the edge before requests reach origin. Cloudflare Workers, Fastly Compute@Edge, AWS Lambda@Edge, and Akamai EdgeWorkers all provide this capability.

For SEO, edge logic affects:

  • Where redirects happen. A 301 redirect implemented at the edge fires before origin sees the request, saving the origin hit and reducing user-perceived latency.
  • Header manipulation. Edge logic can add hreflang headers, security headers (CSP, HSTS), or canonical hints without origin changes.
  • A/B testing without cloaking. Edge can route different users to different page versions, but the implementation has to comply with Google’s policies (consistent treatment for Googlebot, no fundamentally different content shown).

The risk: edge logic that treats Googlebot differently from real users can produce cloaking violations. The safe pattern: edge logic that affects all users equally, or edge logic that varies by signals other than user agent.


Cache invalidation and the staleness tradeoff:

Caching rules determine when content gets refreshed; invalidation determines what happens when it needs to refresh sooner than the TTL allows. The two interact every time content changes between scheduled refreshes.

The basic tradeoff: long TTLs maximize cache hit rates and reduce origin load, but they extend the window when stale content gets served. Short TTLs keep content fresh but reduce caching benefit. Invalidation lets a site keep long TTLs while still pushing updates when they matter.

Three invalidation patterns dominate:

Purge by URL. The site tells the CDN to remove a specific cached response. The next request triggers a fresh fetch from origin. Used for content updates: a published article gets edited, an out-of-stock product becomes available, a price changes. Cloudflare, Fastly, and Akamai all support single-URL purge through API calls; deployment pipelines that update specific pages typically trigger purges as part of the release.

Purge by tag or surrogate key. The site tags related content during caching, then purges all responses with a tag in a single operation. Used when one change affects many pages: a product update should invalidate the product page, the category page, the search index, the homepage. Fastly’s surrogate keys and Cloudflare’s cache tags both implement this pattern. The discipline is tagging discipline at the application layer; without consistent tagging, purges miss content that should refresh.

Stale-while-revalidate. The CDN serves the stale cached response immediately, then fetches a fresh version from origin in the background. The next request gets the fresh version. The user never waits; the cache stays current. The directive is Cache-Control: max-age=60, stale-while-revalidate=86400 (a 60-second cache, but stale responses are acceptable for 24 hours while revalidation happens). Modern browsers and most CDNs support this; the tradeoff is brief windows where users see content from before the most recent update.

The SEO-specific concerns:

  • Googlebot caching staleness. If HTML is cached for 24 hours and a page changes, Googlebot may crawl the cached version. The index reflects the old content until Googlebot crawls again after the cache expires or gets purged. Sites with active content changes should either keep HTML TTLs short or purge on every meaningful update.
  • Sitemap caching. If the sitemap is cached for hours and new URLs get added, Googlebot may not discover them promptly. Sitemap responses should generally have short TTLs (5-15 minutes) or no caching at all.
  • Stale 404 caching. If a URL temporarily returns 404 (due to a deploy error or origin issue) and the 404 gets cached, real visitors and Googlebot continue to see 404 even after origin recovers. Configure CDNs to either not cache 4xx responses or cache them for very short durations.
  • Inconsistent cache state across regions. A purge issued from one region may take seconds to propagate globally. During that window, different users in different regions see different content. Most CDNs converge quickly (under a minute), but high-traffic sites should design for brief inconsistency.

The discipline that prevents most invalidation problems: tag content during caching, automate purges from the CMS or deployment pipeline, monitor cache hit rates and TTFB by region to catch propagation issues, and treat the cache as a system state that needs explicit management rather than a passive optimization.


Common CDN-related SEO problems:

The recurring patterns:

  • HTML caching too long. Updates take hours to propagate; Googlebot sees stale versions.
  • Cache rules that vary by cookie or session. Cache hit rate drops to near zero; CDN provides little benefit.
  • No cache headers on dynamic responses. Every Googlebot request hits origin; crawl rate is limited by origin capacity.
  • Bot management blocking Googlebot. Crawl rate drops, indexation suffers, diagnosis takes weeks.
  • Geographic edges missing in target markets. Performance complaints from users in those markets; TTFB high in regional monitoring.
  • Origin shield in wrong location. TTFB for cache misses unexpectedly high.
  • Different content served from cache vs origin. Stale caches show old content while origin shows new; Googlebot sees inconsistent versions on different fetches.

Most of these are configuration issues that the CDN dashboards expose. Periodic review of the SEO-relevant settings (every quarter or after major site changes) catches them before they compound.


Using the CDN to enforce AI crawler policy:

Robots.txt is a declaration; CDN-layer enforcement is the action. Sites with serious AI crawler policy concerns increasingly enforce blocks at the edge rather than trusting crawler compliance with declared rules.

Cloudflare’s AI Audit feature (general availability 2024) categorizes incoming traffic by AI crawler identity and provides one-click blocking for known crawlers. The block happens at the edge, before the request reaches the origin. Fastly’s Bot Management offers similar functionality with more granular rules. Akamai Bot Manager handles enterprise-scale bot identification including AI crawler categorization. AWS WAF managed rules for bots include AI crawler signatures, configurable through standard WAF rule deployment.

The implementation patterns:

  • User-agent string matching at the edge. The simplest approach: WAF or firewall rules that block requests with user agents matching known AI crawler patterns. Effective for crawlers that identify honestly; bypassed by spoofed user agents.
  • IP range blocking. Where AI operators publish their IP ranges (OpenAI for GPTBot, Anthropic for ClaudeBot, Perplexity for PerplexityBot, Common Crawl for CCBot), CDN rules can block by IP. More reliable than user-agent matching because IPs are harder to spoof.
  • Reverse DNS verification at the edge. Some CDN platforms perform reverse DNS lookups at the edge, validating that the requesting IP resolves back to the declared crawler’s domain. Spoofed traffic fails this check.
  • Behavioral fingerprinting. Bot management platforms identify crawler behavior patterns even when user agents and IPs aren’t conclusive. Aggressive scraping, unusual request rates, or patterns inconsistent with declared crawler behavior all trigger blocks.

The SEO implication: CDN-layer AI crawler blocking interacts with traditional SEO. Configurations that aggressively block bots can accidentally catch Googlebot, Bingbot, or other legitimate search crawlers if rules aren’t carefully scoped. The verification pattern (allow verified search crawlers, block verified AI crawlers, challenge unverified traffic) requires more configuration than a blanket allow or deny.

The reporting matters too. CDN dashboards show which AI crawlers are blocked and at what rates, providing the visibility into enforcement that robots.txt declarations alone don’t provide. Sites that monitor this data catch policy drift (new AI crawlers appearing, existing crawlers changing IP ranges) before it accumulates into uncontrolled access.


CDN as invisible SEO infrastructure:

A CDN is infrastructure that affects SEO without being marketed as an SEO tool. The configuration choices made by performance teams, security teams, and DevOps teams all touch SEO outcomes, often without the SEO team being part of the decision.

What keeps the configuration aligned with SEO: involvement in CDN configuration reviews, treating CDN settings as part of the technical SEO baseline, and monitoring for the symptoms of CDN-related problems (TTFB drift, indexation gaps, sudden crawl rate changes) the same way other technical SEO issues get monitored.

The configuration is rarely the dramatic part of technical SEO, but it’s often where the difference between a site that performs and a site that doesn’t gets decided.