Sitemap.xml strategy beyond the basics

A sitemap is a list of URLs the site wants Google to know about:

A sitemap is an XML file that lists the URLs on a site, optionally with metadata about each one, that the site wants search engines to discover and crawl. Google introduced its Sitemaps format in 2005, and the joint Sitemap Protocol standard with Yahoo and Microsoft followed in 2006, formalizing it as an open standard for helping search engines find content they might otherwise miss.

The basic structure is straightforward. The file lives at a known URL (commonly /sitemap.xml), it gets referenced from robots.txt or submitted directly to search engines, and it contains a list of URLs the site considers important enough to be indexed. Search engines treat the sitemap as a hint about what’s on the site, not as a definitive list. Pages not in the sitemap can still get indexed if found through links. Pages in the sitemap can still be ignored if Google’s algorithm decides they shouldn’t be indexed.

The reason to think strategically about sitemaps comes from what they actually do beneath the surface. The sitemap is one of several discovery mechanisms Google uses to find URLs. For most pages on most sites, internal links and external backlinks are the primary discovery path. The sitemap matters most for pages that aren’t well-linked internally, pages on large sites where crawl budget needs guidance, and pages where the site needs to signal “this URL exists, even if you haven’t crawled it yet.”

The mechanical work of creating a sitemap is simple. The strategic work is deciding what to put in it, how to structure it across multiple files for large sites, and how to use the metadata fields to provide useful signals to search engines.

The XML format, in the smallest useful version:

A minimal sitemap contains a list of URLs wrapped in XML. The structure looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-03-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/blog/url-structure</loc>
    <lastmod>2026-02-20</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

The required element is <loc>, the URL itself. Every URL in the sitemap must be absolute, including the protocol and domain. Relative URLs aren’t valid.

Three optional elements appear in the example: <lastmod> (last modification date), <changefreq> (how often the content changes), and <priority> (a relative importance score from 0.0 to 1.0).

Of these three, only <lastmod> matters in practice. Google’s documentation and statements from Google search representatives have made clear that <changefreq> and <priority> are largely ignored, since the metadata was widely abused (every site declaring every page as priority 1.0 with changefreq “always”). The values still validate against the schema but don’t influence Google’s behavior.

<lastmod> is useful when it’s accurate. Google uses the field to prioritize re-crawling pages that have actually changed. Pages with a recent <lastmod> get re-crawled sooner than pages with older modification dates. The catch is that the field needs to be accurate to be useful. Sites that update <lastmod> to the current date on every sitemap regeneration, regardless of whether the page actually changed, eventually train Google’s algorithm to ignore the signal.

The format supports a few additional URL types via extensions. Image sitemaps use <image:image> tags inside each URL entry to list images associated with the page. Video sitemaps use <video:video>. News sitemaps use <news:news> for Google News submissions. Each extension has its own schema namespace and is documented at sitemaps.org.

The sitemap index, for sites that need multiple files:

A single sitemap file has limits. The protocol specifies a maximum of 50,000 URLs per file and a maximum file size of 50 MB uncompressed. Most large sites need multiple sitemap files to cover their full URL set.

The sitemap index file solves this. The index is itself an XML file that lists multiple sitemap files. Search engines fetch the index, then fetch each individual sitemap from the URLs the index lists.

The structure of a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-03-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-03-10</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-03-15</lastmod>
  </sitemap>
</sitemapindex>

For large sites, splitting sitemaps by content type produces more useful structure than splitting by arbitrary file size. A site with 200,000 product URLs, 5,000 blog posts, and 500 static pages might use one sitemap per content type. Each sitemap stays under the URL limit. Each is easier to debug when issues arise.

The benefit shows up in Search Console reporting. Search Console reports indexing status per sitemap. With content-type-separated sitemaps, the operator can see which content type has indexing problems. A drop in indexed products without a drop in indexed posts immediately points at the products section as the problem area. A single combined sitemap masks this kind of differentiation.

Sitemap indexes themselves have the same limits: 50,000 entries and 50 MB. Sites that exceed this need multiple sitemap indexes, though almost no real-world sites approach this scale.

What to include, what to leave out:

The sitemap should list URLs that are valuable, canonical, and indexable. Three categories of URL deserve specific attention.

URLs to include without question. The site’s primary content pages (homepage, category pages, product pages, blog posts, service pages, anything that should appear in search results) belong in the sitemap. These are the URLs the site is trying to get indexed.

URLs to include strategically. Pages that exist on the site but aren’t well-linked from internal navigation may benefit from sitemap inclusion to ensure discovery. Pages that were recently published or significantly updated also benefit, since the sitemap helps signal the change to Google.

URLs to exclude. Non-canonical URLs (variants, filtered versions, paginated component pages 2 and beyond, tracking-parameter URLs) should not appear in the sitemap. Pages with noindex directives should not appear. Pages blocked by robots.txt should not appear (Google can’t crawl them anyway). Redirected URLs should not appear (the sitemap should list the destination, not the source).

The pattern: the sitemap should list the URLs the site wants in search results. Including non-indexable URLs in the sitemap signals conflicting information to Google (“we want this URL indexed, but we also told you not to index it”) and dilutes the strategic value of the file.

A common mistake is leaving every URL in the sitemap even after they should be excluded. A site removes a product, adds noindex to a page, or sets up a redirect, but the sitemap generator doesn’t catch the change. Google then sees URLs in the sitemap that conflict with the page-level signals. The fix is regenerating the sitemap after content changes that affect indexability.

For most CMS-driven sites, sitemap generation happens automatically. WordPress with Yoast or RankMath generates a sitemap that updates when content changes. Shopify, Wix, and similar platforms include sitemap generation in their built-in SEO features. The automated generation usually does the right thing, excluding non-canonical URLs, noindex pages, and redirected URLs. Manual review catches the cases where the automation gets it wrong.

Submission and discovery: how Google finds the sitemap:

A sitemap exists at a URL on the site, but Google needs to be told (or discover) where to find it. Three discovery mechanisms cover the common cases.

Submission through Google Search Console. The Sitemaps section of Search Console accepts the sitemap URL and queues it for processing. Submission is the most reliable discovery method since it explicitly tells Google what to fetch. After submission, the report shows processing status, error counts, and the number of URLs Google discovered through the sitemap.

Reference from robots.txt. Adding the sitemap URL to robots.txt makes it discoverable by any crawler that respects the protocol:

Sitemap: https://example.com/sitemap.xml

The Sitemap directive can appear anywhere in the robots.txt file. Multiple Sitemap entries are allowed, listing multiple sitemap files (though for sites with many sitemaps, a sitemap index file is cleaner than multiple robots.txt entries).

Standard location convention. Crawlers often check /sitemap.xml and /sitemap_index.xml even without explicit reference. Placing the sitemap at one of these conventional locations makes it discoverable by default.

For Bing, the Bing Webmaster Tools provides a similar submission interface to Google’s. For sites that care about Bing visibility, submitting to both is worth doing.

The submission relationship is one-way: telling Google about the sitemap doesn’t guarantee Google will process every URL in it. Google decides which URLs to crawl based on its own priority calculations. The sitemap helps Google find URLs but doesn’t override the algorithm’s judgment about which ones merit crawling and indexing.

Sitemap as crawl prioritization signal:

Beyond discovery, the sitemap functions as a hint about which URLs the site considers important. Google uses several signals from the sitemap to prioritize crawl behavior.

The <lastmod> field directly influences re-crawl priority. URLs with recent modification dates get re-crawled sooner than URLs with older dates. For sites with frequently updated content (news sites, e-commerce with changing inventory, blogs with regular publishing), accurate <lastmod> values produce faster indexing of changes.

The presence of a URL in the sitemap is itself a signal. Google’s documentation describes sitemap inclusion as a weak canonicalization signal. When multiple URLs serve the same content, the URL appearing in the sitemap is favored as the canonical version, alongside other signals like <link rel="canonical"> and internal linking patterns.

The size and stability of the sitemap influences how Google treats it. A sitemap that grows steadily and predictably builds trust. A sitemap that fluctuates wildly (doubling in size one week, halving the next) signals instability that may reduce how much weight Google places on it.

The relationship between sitemap and crawl budget is direct but nuanced. The sitemap doesn’t grant additional crawl budget, but it directs the existing budget toward URLs the site considers important. For small sites, this rarely matters; Google crawls everything regardless. For large sites with tens or hundreds of thousands of URLs, the sitemap’s directing influence becomes meaningful.

A pattern worth noting: the sitemap doesn’t help with pages that are otherwise undiscoverable. A page with no internal links pointing to it and no external backlinks may technically be in the sitemap, but Google’s algorithm often deprioritizes URLs that exist only in the sitemap with no other signals supporting them. The sitemap is one input among several, not a substitute for internal linking strategy.

Multilingual and international sitemaps:

Sites serving content in multiple languages or for multiple regions face a sitemap structure question. The hreflang annotations that signal language and region targeting can live either in HTML or in the sitemap.

For sites with many language variants, declaring hreflang in the sitemap is often cleaner than maintaining the HTML annotations across every page. The XML sitemap supports hreflang through the xhtml:link extension:

<url>
  <loc>https://example.com/page</loc>
  <xhtml:link rel="alternate" hreflang="en" href="https://example.com/page" />
  <xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/page" />
  <xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/page" />
</url>

Each URL entry lists every language variant of the page, with hreflang values declaring which language each variant targets. The same entries appear on the corresponding variant URLs (the German URL’s entry also lists English, French, and German variants).

The advantage of sitemap-based hreflang is centralization. A single sitemap update propagates language relationships across all pages. The HTML alternative requires updating every page’s header when a new language variant is added.

The disadvantage is that errors in sitemap hreflang can be harder to debug than HTML errors, since the relationships aren’t visible when inspecting the page directly. Google deprecated the dedicated International Targeting report in Search Console in September 2022, so hreflang errors no longer surface in a centralized report. Diagnosis now relies on tools like Screaming Frog’s hreflang report, the URL Inspection tool, and third-party hreflang validators that surface mismatches between declared variants.

For most multilingual sites, the choice between HTML and sitemap hreflang comes down to scale. Small numbers of pages with simple language structures fit better in HTML. Large catalogs with many language variants benefit from sitemap centralization. The sitemap’s role in either case is providing a centralized location for the annotations when that fits the site’s structure.

Special sitemap types beyond standard URLs:

The Sitemap Protocol supports several specialized formats for content types that benefit from structured discovery.

Image sitemaps list images embedded in pages, helping Google discover images that might be missed by standard crawling. The format uses image-specific tags inside each URL entry:

<url>
  <loc>https://example.com/blog/post</loc>
  <image:image>
    <image:loc>https://example.com/photos/featured-image.jpg</image:loc>
    <image:caption>Caption text for the image</image:caption>
    <image:title>Image title</image:title>
  </image:image>
</url>

Image sitemaps matter most for sites where images are primary content (photography sites, stock photo libraries, product catalogs with multiple images per product) or where images are loaded through JavaScript and might not be discovered by standard crawling.

Video sitemaps list video content with structured metadata. Each video entry includes the video URL, thumbnail, title, description, and other fields that help video search results.

News sitemaps are specifically for Google News submissions. They list recent news articles (typically from the past 48 hours) with publication timestamps and other news-specific metadata. Sites approved for Google News inclusion use news sitemaps to signal new article publication for immediate indexing.

Each specialized sitemap type works alongside the standard URL sitemap rather than replacing it. A news site might have a standard sitemap covering all pages, plus a news sitemap covering recent articles, plus an image sitemap covering article photographs. All three reference different content types and serve different discovery purposes.

For most sites, the standard URL sitemap covers what’s needed. Specialized sitemaps add value only when the content type warrants the additional structure.

Seven sitemap anti-patterns:

The mistakes that turn a useful sitemap into noise are mostly the same across CMS platforms. Auditing existing sitemaps usually surfaces several of these at once.

Sitemap includes non-canonical URLs. Variants, paginated component pages, tracking-parameter URLs all appear in the sitemap alongside their canonical versions. Fix: the sitemap generator should exclude any URL that isn’t the canonical version. Audit the output for URLs with parameters or non-canonical paths.

Sitemap includes noindex pages. URLs marked noindex appear in the sitemap, signaling conflicting instructions to Google. Fix: filter the sitemap generator to exclude URLs with noindex meta tags or X-Robots-Tag headers.

Sitemap includes redirected URLs. Old URLs that redirect to new ones still appear in the sitemap. The crawler follows the redirect, but the listing is wasted. Fix: sitemap should contain only final destination URLs, not redirect sources.

<lastmod> updated to current date for every URL on every regeneration. Every page appears to have just been modified, even when content didn’t change. Google’s algorithm eventually discounts the signal. Fix: update <lastmod> only when the page content actually changed. Most CMS-driven sites with proper sitemap generators handle this correctly.

Sitemap exceeding size limits. A single sitemap file with more than 50,000 URLs or larger than 50 MB. Fix: split into multiple sitemaps grouped by content type, linked through a sitemap index file.

Sitemap URL not declared in robots.txt or submitted to Search Console. The sitemap exists but Google has to discover it through fallback patterns. Fix: add the sitemap URL to robots.txt and submit it through Search Console.

Sitemap returns 404 or other error. The submitted URL points at a sitemap that doesn’t exist or has moved. Search Console reports the error but it may go unnoticed. Fix: verify the sitemap URL returns 200 with valid XML content. Monitor the Sitemaps report in Search Console.

An eighth pattern worth flagging: stale sitemaps from old plugins or generators. A site changes its sitemap generator (switches from Yoast to RankMath, for example), and the old sitemap URL still exists alongside the new one. Both get submitted, with one of them growing increasingly out of date. Fix: delete the old sitemap when switching generators. Keep only the current sitemap reachable and submitted.

Working backward from what the search engine needs:

The sitemap is most easily understood not as a list of pages but as an instruction set for a system that’s trying to crawl the site efficiently. Working backward from what the search engine is trying to do makes the design choices clearer.

The search engine needs to know what URLs exist on the site. The sitemap provides the list. The benefit is direct: pages that might be missed through link-following get included explicitly.

The search engine needs to know which URLs to prioritize for crawling. The sitemap provides signals through <lastmod> (recently changed URLs are higher priority) and through inclusion itself (URLs in the sitemap are likely more important than URLs found only through link discovery).

The search engine needs to manage crawl budget across the site. The sitemap helps direct the budget toward URLs the site considers valuable. For large sites with limited budget per crawl, this directing function becomes meaningful.

The search engine needs to disambiguate between URL variants. The sitemap’s inclusion of canonical URLs provides one signal (among several) for which version of a page should rank.

Each of the design choices follows from these needs. Multiple sitemaps split by content type help the search engine see indexing status separately for each type. Accurate <lastmod> values help the search engine prioritize re-crawl correctly. Exclusion of non-canonical URLs helps the search engine avoid noise. The sitemap is doing the same job throughout: making the search engine’s work easier, in ways that benefit the site by getting content indexed faster and more accurately.

The reverse-engineered view also explains what the sitemap doesn’t do. It doesn’t grant the site additional crawl budget. It doesn’t override the algorithm’s judgment about page quality. It doesn’t substitute for internal linking that signals which pages matter. It does what it does (helps discovery and prioritization) and stops there. Sites that expect more from the sitemap (better rankings, more traffic, faster indexing for low-quality pages) end up frustrated because they’re asking for outcomes the sitemap was never designed to deliver. Sites that understand what the sitemap is actually doing (providing structured input to a crawl-prioritization system) get the benefits the design intended without expecting the ones it doesn’t provide.

Related posts: