Posted 28 September, 2017 by in Screaming Frog SEO Spider

Google Search Console robots.txt Tester Inconsistencies

We’ve had a few customers notice that our robots.txt tester tool, in version 8.1 of the SEO Spider, gives different results than the robots.txt tester in Google Search Console .

This variation in behaviour is around the handling of non-ASCII characters, as highlighted below.

Giuseppe Pastore blogged about this issue back in 2016, but we wanted to perform our own tests using case variations and Googlebot itself, not just the robots.txt Tester.

The example URL http://www.example.com/✓.html contains a check mark, which is a non-ASCII character. When an HTTP request is made for this URL it must be percent-encoded as http://www.example.com/%E2%9C%93.html. So if you want to prevent Googlebot from crawling this page on your site, what do you put in your robots.txt file?

The UTF-8 character?

Disallow: /✓.html

Or the percent encoded version?

Disallow: /%E2%9C%93.html

Looking at Google’s robots.txt reference, we see the following:

“Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-encoded UTF-8 encoded characters per RFC 3986.”

Which suggests that both methods should be valid.

We first tested using the UTF-8 character in the robots.txt file. We checked this against the UTF-8 character, as well as both upper and lowercase percent-encoded versions (%E2%9C%93.html and %e2%9c%93.html).

As you can see in the screenshots below, the results are as expected and the rule blocks all 3 forms of the URL.




We then updated the rule to block using the uppercase percent-encoded form. The results here are not what we were expecting, the rule does not match in any of the cases.




For completeness, we also tested using the lower case percent-encoded version of the URL. Again, we expected this to block each form of the URL, but all were allowed.




These results suggest that we should be using the raw UTF-8 characters in our robots.txt files, and not their percent-encoded equivalents.

Testing Googlebot Itself

The whole point of the robots.txt Tester is to check how Googlebot will interpret your robots.txt file when crawling your site. So we performed another test to see if Googlebot followed the same rules during regular crawling.

We first created the following 3 Disallows in our robots.txt file:

User-agent: *
Disallow: /lower_case_%e2%9c%93_page.php
Disallow: /upper_case_%E2%9C%93_page.php
Disallow: /utf8_✓_page.php

We then created a seed page with a link to each of these pages as well as another page, ‘normal.php’, which is not blocked in any way. Each of the 4 linked-to pages creates an entry in a log file so we know it’s been requested, along with all the details about the request so we can verify it’s actually from Googlebot.

We submitted this seed page, containing the 4 links, to Google using the Fetch as Googlebot tool. We then requested it was indexed and its direct links crawled.

Very quickly we saw a request for the only page not disallowed in any way, normal.php. A week after running these tests we still had no logs of the other 3 pages being requested.

Conclusion

The robots.txt tester tool should not be relied upon, it follows different rules to Googlebot itself – Googlebot accepts the UTF-8, upper and lowercase percent-encoded forms of URLs in a robots.txt file.

The SEO Spider has a couple of issues that will be addressed in a future release. We currently only look at the percent-encoded form of a URL, we also need to use the non-encoded version. The SEO Spider uses a consistent uppercase form of percent-encoding, so we need to make this case insensitive when matching robots.txt rules. With these fixes in place the SEO Spider’s robots.txt tester tool will still produce different results to the GSC robots.txt Tester, but will be in line with our understanding of Googlebot itself.

If you have anymore questions about how to use the Screaming Frog SEO Spider, then please do get in touch with our support team.