OnPage SEO

What is the difference between robots.txt and meta robots tag?

robots.txt tells crawlers where not to go:

robots.txt is a plain text file that lives at the root of a website (e.g., https://example.com/robots.txt) and tells search engine crawlers which paths on the site they’re allowed to access. It follows the Robots Exclusion Protocol, one of the oldest conventions on the web, dating back to 1994. Cooperative crawlers read this file before crawling anything else on the domain. Non-cooperative crawlers and malicious bots ignore it — the protocol is advisory, not enforcement.

A minimal robots.txt looks like this:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml

The structure is direct. User-agent names which crawler the rules apply to (* means all crawlers). Disallow lists paths the crawler should skip. Allow overrides a broader Disallow for specific subpaths. Sitemap points to the site’s XML sitemap. These four cover the core syntax. Some crawlers honor additional extensions like wildcards (* for any character sequence, $ for end-of-URL), Crawl-delay, or Host, but support varies and the core four work everywhere. The whole file is plain text, and it works because cooperative crawlers honor the same protocol.

The important thing to understand is what robots.txt actually controls. It controls crawling — whether a crawler is allowed to fetch the URL at all. It does not control indexing. A URL blocked by robots.txt can still appear in search results, with no description, if Google learns about the URL from external links. The block prevents the crawl, not the listing.


Meta robots tells crawlers what to do once they’re on the page:

The meta robots tag is an HTML element placed in the <head> of an individual page. It tells crawlers how to handle that specific page after they’ve fetched it.

<head>
  <meta name="robots" content="noindex, nofollow">
</head>

The two pieces of that tag. The name="robots" attribute targets all crawlers (or a specific bot if named, like name="googlebot"). The content attribute lists the directives, comma-separated.

The most common directives are simple. noindex keeps the page out of the search index entirely. nofollow tells the crawler not to follow any links on the page. noarchive prevents Google from showing a cached copy. nosnippet prevents description text from appearing in search results. There are more — noimageindex, unavailable_after, max-snippet — but the first four cover most actual use cases.

A meta robots tag is read by a crawler only when the crawler actually fetches the page. That’s a small but consequential detail. If robots.txt has already blocked the page from being crawled, the meta tag never gets seen. The crawler never gets to the page. The directive doesn’t apply.


Crawl controls discovery, index controls visibility:

The cleanest way to keep robots.txt and meta robots straight is to separate two different jobs the search engine does. Crawling is when the search engine fetches a page to read its content. Indexing is when the search engine decides to store the page and consider it for search results.

robots.txt sits at the crawl layer. It either grants or denies permission to fetch a URL. Meta robots and X-Robots-Tag sit at the index layer. They either grant or deny permission to include the page in search results after fetching it.

This is why noindex doesn’t work in robots.txt. Google officially dropped support for noindex as a robots.txt directive in September 2019, after years of unofficial use. The reason follows from the layer split: noindex is an indexing instruction, and robots.txt operates before indexing happens. Putting noindex in robots.txt was always semantically wrong, even though some crawlers used to honor it informally. Bing never supported it. Google no longer supports it. Other major crawlers have followed the same direction.

The practical consequence is direct. If a page needs to stay out of search results, robots.txt alone won’t do that — and using robots.txt to block the page actually prevents Google from seeing the meta robots noindex tag that would do the job. The page gets crawled less, the noindex never gets read, and the URL still appears in search results because external links pointed to it. The crawl block produced the opposite of the intended result.


Site-level file versus page-level tag:

The most visible difference between the two tools is where they live and what they affect.

Aspect robots.txt Meta robots tag
Location Root directory (<!–INLINECODE21–>) <!–INLINECODE22–> of each HTML page
Scope Entire site or specific path patterns One page at a time
Format Plain text file HTML <!–INLINECODE23–> element
Layer Crawl control Index control
File types Affects any URL pattern HTML pages only (for non-HTML, use X-Robots-Tag)
Visibility Public, anyone is able to read it In page source, viewable through page source
Granularity Path patterns, wildcards, user-agent targeting Per-page directives, user-agent variants
Discovery First file every crawler reads Read when the page is crawled

A site uses robots.txt to handle broad rules. Block an entire admin section. Disallow query parameter combinations that produce duplicate views. Point crawlers to the sitemap. The file’s strength is reach — one entry controls thousands of URLs.

A site uses meta robots to handle individual pages. A thank-you page after a form submission should be noindex’d because it has no search value. A test page should be noindex’d until ready. A staging environment needs every page noindex’d. Meta robots covers what robots.txt won’t address granularly.


Same directives, different transport:

Meta robots and robots.txt don’t share much vocabulary, because they operate on different layers. But the directives each tool supports illuminate what each tool is actually for.

Directive robots.txt Meta robots / X-Robots-Tag
Disallow path from crawling
Allow override within disallowed path
Sitemap location reference
User-agent targeting
noindex (drop from index) — (Google deprecated 2019)
nofollow (don't follow page links)
noarchive (no cached copy)
nosnippet (no description in SERP)
noimageindex (don't index images on page)
unavailable_after (expire from index)
max-snippet, max-image-preview, max-video-preview

The split is informative. robots.txt has three real directives: Disallow, Allow, Sitemap. Plus User-agent for targeting. That’s the whole vocabulary because the file does one job (crawl control) and doesn’t need more.

Meta robots and X-Robots-Tag together support roughly twelve commonly-used directives. The richer vocabulary reflects the wider range of indexing decisions a site might want to make about a specific page.


Block at the door, not after the visit:

robots.txt is the right tool when the goal is preventing the crawl itself.

A staging environment that shouldn’t be touched by Googlebot at all. A /wp-admin/ section with no search value and significant server cost if crawled repeatedly. A /search/?q= URL pattern that generates thousands of internal search results pages. An e-commerce filter system that produces millions of parameter combinations. All of these benefit from robots.txt because the right move is “don’t visit,” not “visit and then I’ll tell you what to do.”

Crawl budget is the operational reason. Every URL Googlebot fetches on a site costs against the crawl budget Google allocates to that domain. For small sites with a few hundred pages, this is academic — Google crawls everything. For larger sites with tens of thousands or millions of URLs, what Googlebot doesn’t crawl matters as much as what it does. Sending the crawler to filter-combination URLs and internal search results burns budget that should go to product pages, articles, or category pages that need re-indexing.

Server load is the second reason. Each crawl request hits the server. A poorly configured site with crawler-accessible admin tools, dynamically generated reports, or search pages receives significant traffic from crawlers. robots.txt prevents the request before it happens.

The thing to remember is that robots.txt doesn’t hide the URL. The URL can still appear in search results if external links point to it — without a description, but visible. If hiding is the goal, robots.txt is the wrong tool by itself.


Visit, then decide whether to show:

Meta robots is the right tool when the goal is controlling what shows up in search results after the page has been crawled.

A noindex meta tag is the cleanest way to keep a page out of search results without blocking access to it. The page remains accessible to users and crawlers. The crawler reads the noindex directive on each visit. The page never appears in the index. Examples that fit this pattern: thank-you pages after form submissions, internal user dashboards reachable from public links, low-value tag archive pages on a blog, and navigation pages with no search value of their own.

For sensitive cases, the combination matters. If a page has truly private content, neither robots.txt nor meta robots is sufficient — both are advisory. Cooperative crawlers like Googlebot follow them. Malicious actors and non-cooperative bots don’t. Authentication is the right tool for actually private content. Meta robots is the right tool for “this page is fine to access but shouldn’t surface in Google.”

Meta robots also handles smaller decisions. nosnippet removes the description text from Google search results for the page. noarchive prevents Google from showing a cached version. unavailable_after lets a page drop out of the index automatically on a specified date — useful for time-limited promotions or event pages.

The granularity is meta robots’ strength. A site with thousands of pages applies different rules to each page based on what that specific page should do in search. robots.txt won’t reach that granularity without becoming unmanageable.


X-Robots-Tag: the third tool most articles skip:

A complete picture of robot controls needs a third element: the X-Robots-Tag HTTP response header. It does the same job as the meta robots tag, but it travels in the HTTP response instead of inside the HTML.

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
X-Robots-Tag: noindex, nofollow

Any directive supported by meta robots works in X-Robots-Tag — noindex, nofollow, noarchive, the full vocabulary. The functional difference is what kinds of resources each tool controls.

Meta robots requires HTML. The tag has to sit inside a <head> element, which means it works only on HTML pages. PDFs, images, videos, JSON files, RSS feeds, plain text files — none of these have an HTML head. The meta tag won’t apply.

X-Robots-Tag has no such limitation. It rides on the HTTP response, which is sent for every kind of resource the server delivers. A PDF, an image, a video, an Excel spreadsheet — all accept X-Robots-Tag rules. The directive sits in the HTTP header that travels alongside the file itself.

A common implementation: an .htaccess directive (Apache) or a nginx.conf block (Nginx) that adds X-Robots-Tag: noindex to every PDF on the site. This keeps PDFs out of search results without requiring any change to the PDF files themselves and without listing each file individually in robots.txt.

X-Robots-Tag also enables conditional rules. A site applies different indexing rules to different user-agents from the same server configuration: X-Robots-Tag: googlebot: noindex keeps Google out while leaving Bing access. Meta robots handles this too with multiple meta tags, but X-Robots-Tag concentrates the logic in the server configuration rather than in the HTML.


Conflicts are usually contradictions in disguise:

When robots.txt and meta robots are used together on the same site, four patterns of conflict show up.

The first and most common conflict: robots.txt blocks a page that has a noindex meta tag. The intent is usually “don’t show this page in search.” The result is the opposite of intent. Google won’t crawl the page, so it never reads the noindex directive. The page still appears in search results from external links, without a description, because the URL is known to Google but the content isn’t. The fix is to remove the robots.txt block for that page, allow Google to crawl it, and let the meta noindex do its job.

The second conflict: a meta robots noindex on a page Google already had indexed, with no associated change in robots.txt. This works correctly, but slowly. Google has to recrawl the page to see the new noindex directive. For high-priority pages, requesting recrawl through Google Search Console accelerates this. For lower-priority pages, the noindex takes effect on Google’s natural recrawl schedule, sometimes weeks later.

The third conflict: combining meta robots with canonical tags. If a page has both <meta name="robots" content="noindex"> and a canonical tag pointing elsewhere, the signals contradict each other. Noindex says “drop this page from the index.” Canonical says “this is a duplicate of another page, treat that other page as the authoritative version.” Google’s behavior in this case isn’t fully predictable. The cleanest approach is to pick one mechanism based on the actual goal. If the page is a real duplicate, use canonical alone. If it shouldn’t be in search at all, use noindex alone.

The fourth conflict: conflicting directives within a single meta robots tag or X-Robots-Tag. index, noindex on the same tag, or follow, nofollow, leave the crawler with ambiguous instructions. In the case of X-Robots-Tag, the more restrictive rule applies. In the case of meta robots, the behavior depends on the crawler. The fix is straightforward: don’t write contradictory directives.


Testing what’s actually happening:

Diagnosing robot-control problems starts with checking what crawlers actually see, not what the configuration intends. Three tools cover most cases.

Google Search Console’s URL Inspection tool reports whether a specific URL is indexed, what robots directives Google detected on it, and whether the page is blocked by robots.txt. The results reflect Google specifically — Bing Webmaster Tools offers a parallel inspection tool for Bing’s index. For any page where indexing behavior doesn’t match expectation, URL Inspection is the first check. The tool surfaces conflicts directly — for example, a “Blocked by robots.txt” status on a page with a noindex meta tag exposes the classic mistake without needing further investigation. Google Search Console also offers a separate robots.txt report that flags syntax errors, unreachable rules, and recent fetch failures.

The second tool is browser-based HTTP response inspection. Opening DevTools, loading the page, and checking the Network panel reveals the full HTTP response headers including any X-Robots-Tag directives. This is essential for non-HTML resources like PDFs and images, where there’s no meta tag to view in page source. Command-line tools like curl -I produce the same information for headless verification.

The third tool is a site crawler like Screaming Frog or Sitebulb. These tools crawl the site the way Googlebot would, surface every page’s indexability status (indexable, noindex’d, blocked by robots.txt, redirect chain, etc.), and produce a complete inventory of robot-control directives across the site. For larger sites, a crawler is the only practical way to find inconsistencies — a single page audit catches one problem; a site crawl catches the pattern.


Seven robot-control anti-patterns:

Most robot-control problems on real sites cluster into a small set of patterns.

  1. Using robots.txt to “noindex” a page. Adding Noindex: /private-page to robots.txt. Google dropped support for this in 2019; the directive is now ignored. Fix: allow crawling in robots.txt, add <meta name="robots" content="noindex"> to the page itself, or use X-Robots-Tag: noindex in the HTTP header.
  1. Blocking a page in robots.txt that has a noindex meta tag. The crawler can’t reach the page to see the noindex directive. The URL may still appear in search results. Fix: remove the robots.txt block, let Google crawl the page, and let the meta noindex apply.
  1. Disallowing critical resources like CSS and JavaScript. Older SEO guidance suggested blocking these to “save crawl budget.” Modern Google needs to render the page like a browser would, which means accessing CSS and JS. Blocking them causes Google to misunderstand the page. Fix: allow CSS, JS, and image files in robots.txt.
  1. Blanket-disallowing the entire site by accident. A misplaced Disallow: / blocks every URL on the domain. This happens most often when staging configurations get pushed to production. Fix: check robots.txt before any deployment. Audit it periodically with a crawler tool like Screaming Frog.
  1. Forgetting that robots.txt is public. Listing sensitive directory paths in Disallow rules reveals their existence to anyone reading the robots.txt file (which is publicly accessible at /robots.txt). Fix: for genuinely sensitive paths, use authentication. Don’t list private locations in a public file.
  1. Mixing meta robots and X-Robots-Tag with contradictory rules. A page has <meta name="robots" content="index"> in the HTML and X-Robots-Tag: noindex in the HTTP header. The more restrictive rule wins, but the inconsistency suggests a configuration mistake somewhere. Fix: pick one mechanism per page. Audit any page that gets indexing rules from multiple sources.
  1. Trusting robots.txt to hide content. robots.txt is advisory. Cooperative crawlers follow it. Malicious bots, scrapers, and crawlers that don’t respect the protocol can fetch whatever they want from accessible URLs. Fix: for actually-private content, use authentication. For “shouldn’t be in search” content, use noindex on the page itself.

An eighth pattern worth flagging: applying noindex to a staging environment without remembering to remove it before launch. The staging-environment noindex tag gets accidentally promoted to production, and the live site quietly drops out of search results. Most CMS environments offer settings that toggle this automatically based on the deployment environment, but manual checks at launch remain worth doing.


The decision in three questions:

Picking between robots.txt and meta robots comes down to a short sequence of questions.

First: does the URL need to be reachable by crawlers? If the answer is no, robots.txt is the right tool. The page might be wasteful to crawl, infinite in variations, or under heavy server load. Disallow the path. The crawler won’t request it.

Second, if the URL needs to be reachable but shouldn’t appear in search results: meta robots with noindex is the right tool. The crawler fetches the page. It reads the noindex directive. It drops the page from the index. The URL stays accessible to direct visitors and works normally for users who arrive by other means.

Third, if the URL is non-HTML — a PDF, an image, a video — and shouldn’t appear in search results: X-Robots-Tag is the right tool. The HTTP header travels with any resource type. The directive applies whether or not the resource has an HTML head element.

These three tools cover essentially every legitimate use case. They aren’t interchangeable. A site that confuses them usually ends up with the opposite of what was intended. Pages appearing in search that shouldn’t, or pages dropping out of search that should still be there. The directive choice has to match the layer the problem actually lives on. Crawl problems get crawl tools. Index problems get index tools.