Robots.txt Testing In The SEO Spider
How To Test A Robots.txt Using The SEO Spider
A robots.txt file is used to issue instructions to robots on what URLs can be crawled on a website. All major search engine bots conform to the robots exclusion standard, and will read and obey the instructions of the robots.txt file, before fetching any other URLs from the website.
Commands can be set up to apply to specific robots according to their user-agent (such as ‘Googlebot’), and the most common directive used within a robots.txt is a ‘disallow’, which tells the robot not to access a URL path.
You can view a sites robots.txt in a browser, by simply adding /robots.txt to the end of the subdomain (www.screamingfrog.co.uk/robots.txt for example).
While robots.txt files are generally fairly simple to interpret, when there’s lots of lines, user-agents, directives and thousands of pages, it can be difficult to identify which URLs are blocked, and those that are allowed to be crawled. Obviously the consequences of blocking URLs by mistake can have a huge impact on visibility in the search results.
You can follow the steps below to test a site’s robots.txt which is already live. If you’d like to test robots.txt directives which are not yet live or syntax for individual commands to robots, then read more about the custom robots.txt functionality in section 3 of our guide.
1) Crawl The URL Or Website
Open up the SEO Spider, type or copy in the site you wish to crawl in the ‘enter url to spider’ box and hit ‘Start’.
If you’d rather test multiple URLs or an XML sitemap, you can simply upload them in list mode (under ‘mode > list’ in the top level navigation).
2) View The ‘Response Codes’ Tab & ‘Blocked By Robots.txt’ Filter
Disallowed URLs will appear with a ‘status’ as ‘Blocked by Robots.txt’ under the ‘Blocked by Robots.txt’ filter.
The ‘Blocked by Robots.txt’ filter also displays a ‘Matched Robots.txt Line’ column, which provides the line number and disallow path of the robots.txt entry that’s excluding each URL in the crawl.
The source pages that link to URLs that are disallowed in robots.txt can by viewed by clicking on the ‘inlinks’ tab, which populates the lower window pane.
Here’s a closer view of the lower window pane which details the ‘inlinks’ data –
They can also be exported in bulk via the ‘Bulk Export > Response Codes > Blocked by Robots.txt Inlinks’ report.
3) Test Using The Custom Robots.txt
The feature allows you to add multiple robots.txt at subdomain level, test directives in the SEO Spider and view URLs which are blocked or allowed immediately.
You can also perform a crawl and filter blocked URLs based upon the updated custom robots.txt (‘Response Codes > Blocked by robots.txt’) and view the matched robots.txt directive line.
The custom robots.txt uses the selected user-agent in the configuration, which can be tweaked to test and validate any search bots.
Please note – The changes you make to the robots.txt within the SEO Spider, do not impact your live robots.txt uploaded to your server. However, when you’re happy with testing, you can copy the contents into the live environment.
How The SEO Spider Obeys Robots.txt
The Screaming Frog SEO Spider obeys robots.txt in the same way as Google. It will check the robots.txt of the subdomain(s) and follow (allow/disallow) directives specifically for the ‘Screaming Frog SEO Spider’ user-agent, if not Googlebot and then ALL robots.
URLs that are disallowed in robots.txt will still appear and be ‘indexed’ within the user interface with a ‘status’ as ‘Blocked by Robots.txt’, they just won’t be crawled, so the content and outlinks of the page will not be seen. Showing internal or external links blocked by robots.txt in the user interface can be switched off in the robots.txt settings.
It’s important to remember that URLs blocked in robots.txt can still be indexed in the search engines if they are linked to either internally or externally. A robots.txt merely stops the search engines from seeing the content of the page. A ‘noindex’ meta tag (or X-Robots-Tag) is a better option for removing content from the index.
The tool supports URL matching of file values (wildcards * / $), just like Googlebot, too.
Common Robots.txt Examples
A star alongside the ‘User-agent’ command (User-agent: *) indicates directives apply to ALL robots, while specific User-agent bots can also be used for specific commands (such as User-agent: Googlebot).
If commands are used for both all and specific user-agents, then the ‘all’ commands will be ignored by the specific user-agent bot and only it’s own directives will be obeyed. If you want the global directives to be obeyed, then you will have to include those lines under the specific User-agent section as well.
Below are some common examples of directives used within robots.txt.
Block All Robots From All URLs
Block All Robots From A Folder
Block All Robots From A URL
Block Googlebot From All URLs
Block & Allow Commands Together
If you have conflicting directives (i.e an allow and disallow to the same file path) then a matching allow directive beats a matching disallow when it contains equal or more characters in the command.
Robots.txt URL Wildcard Matching
Google and Bing allow the use of wildcards in robots.txt. For example, to block all crawlers access to all URLs that include a question mark (?).
You can use the dollar ($) character to match the end of the URL. For example, to block all crawlers access to .html file extension.
You can read more about URL matching based path values in Google’s robots.txt specifications guide.