Can a page that is blocked by robots.txt get indexed?
Updated: Dec 18, 2019
By Dr. Marie Haynes
This was an interesting experiment. The MHC team has run a few tests to determine whether we could get a page indexed even though it was blocked by our robots.txt file.
Here are the steps we followed to run this experiment.
1. We created the page
2. We blocked the page in our robots.txt file
We then added this line to our robots.txt file:
What that line should do is tell search engines not to crawl this page.
We left this page lay dormant for several weeks to see if Google would find it. They did not, which is not unexpected as it was blocked by robots.txt.
Site: searches did not find the page.
Neither did searches for text that is on that page return the page we wanted:
3. We submitted the URL to Google
Next, we used GSC to request indexing for the URL. Not surprisingly, the request threw an error as the page we wanted Google to index was hidden behind our robots.txt file.
We waited several days after requesting indexing, and still could not find this page in Google’s index.
4. We linked to this page internally
One of the most popular pages on our website is our Google algorithm update page. We snuck this link into this page.
Within 24 hours of adding this link, when searching for the url, we can now see that it is indeed indexed:
However, if you search for text from this page, the page does not surface. This makes sense as, although Googlebot has recognized the page exists, it has followed our directives in our robots.txt file and has not crawled the content of this page.
From this experiment, we can conclude that disallowing a url in a robots.txt file will only keep that page from being indexed by Google if no one is linking to the page. We could not find our robotted page on Google even several weeks after submitting it to the index. Yet, once we added an internal link to this page, the page was included in the index within 24 hours.
What to do if you have pages that are blocked by robots and still appearing in the search results?
If you have a page that you do not want to have indexed on Google, the best option to keep this page out of the index is to use a meta-noindex tag on the page. A robots.txt block will keep the content from being indexed, but not the url, especially if someone links to the page. Please note that if you do add a meta noindex tag, you will need to remove the robots.txt block in order for Googlebot to be able to see the tag. Once the page has been recrawled and removed from the index, you can once again re-add the robots.txt block if you would like.