Sometimes one or more pages on a website have little value to the user and the search engine, often because the content quality is poor.
Blocking the indexing of these pages allows internet users no longer to have the opportunity to see them appear on search result pages (if their content is of poor quality, there is little chance that it will happen, but you never know) in response to a search query.
Preventing the indexing of certain pages on a website also allows the search engine to save crawl time (crawl budget) and focus only on pages whose content is of higher quality.
In this article, I want to talk to you about blocking indexing for search engines. I will explain the benefits of forbidding a webpage from being in the index and give you the two ways to do it. At the end of this post, you will be able to de-index a page or send the right directives to your technical teams.
Why de-index a web page?
4 reasons exist to justify blocking the indexing of a web page:
- avoid internal duplicate content
- set aside pages that bring nothing to your organic SEO
- avoid legal prejudice
The <noindex> rule
Noindex is a set of rules represented in the form of meta or HTTP header that forbids the search engine from indexing the content of a web page. When a URL displays a noindex and the crawl bot passes over it, it detects the meta noindex tag or the HTTP header and knows that the content within the page should not be indexed. Note that Google will exclude the explored page from its index even if the page receives internal or external links.
Preventing the indexing of a web page’s content
You can apply the noindex rule by using:
- a <meta> tag
- an HTTP response header
You can also combine noindex with other rules that control indexing. For example, you can associate a nofollow indicator with a noindex rule: <meta name=”robots” content=”noindex, nofollow” />.
Prohibiting indexing of content with the <meta> tag
To prevent the content of your web page from being indexed by search engines, you simply need to place the <meta> tag in the <head> section of your page.
<meta name=”robots” content=”noindex”>
To prevent a particular search engine from indexing the content of your page, replace “robots” with the name of the search engine’s crawl bot.
<meta name=”googlebot” content=”noindex”>
Note that search engines have their interpretation of the noindex rule. Your page content may remain indexed despite the presence of the <noindex> meta tag, meaning that its content is displayed on search result pages.
Prohibiting indexing of content with the HTTP Header
The HTTP response header X-Robots-Tag is the alternative to the <noindex> meta tag. This HTTP header can contain the value noindex or none in its response. The X-Robots-Tag can be used for other types of resources other than HTML pages. Thus, the HTTP response header X-Robots-Tag can be used to forbid the indexing of PDF files, videos, and image files.
Here is an example of an HTTP response containing the X-Robots-Tag rule:
HTTP/1.1 200 OK
(…)
X-Robots-Tag: noindex
(…)
Prohibiting the indexing of a page is not very complicated. However, it is wise to think carefully about the consequences of de-indexing before applying it. I also remind you that noindex is considered “active” only from the moment the search engine’s crawl bots visit your page.
The importance of the page – a web page that ranks very well, receives link juice and traffic will be considered more important than a page receiving no visits or not positioned on any keyword – will determine how quickly or not the crawl bots will pass over your page.