Robots.txt

si digital

Posted 27 November, 2015 by si digital in

Robots.txt

Configuration > Robots.txt

By default the SEO Spider will obey robots.txt protocol and is set to ‘Respect robots.txt’. This means the SEO Spider will not be able to crawl a site if its disallowed via robots.txt.

This setting can be adjusted to ‘Ignore robots.txt’, or ‘Ignore robots.xt but report status’.

Ignore robots.txt

The ‘Ignore robots.txt’ option allows you to ignore this protocol, which is down to the responsibility of the user. This option actually means the SEO Spider will not even download the robots.txt file. So it also means all robots directives will be completely ignored.

Ignore robots.xt but report status

The ‘Ignore robots.txt, but report status’ configuration means the robots.txt of websites is downloaded and reported in the SEO Spider. However, the directives within it are ignored. This allows you to crawl the website, but still see which pages should be blocked from crawling.

Show Internal URLs Blocked By Robots.txt

By default internal URLs blocked by robots.txt will be shown in the ‘Internal’ tab with Status Code of ‘0’ and Status ‘Blocked by Robots.txt’. To hide these URLs in the interface deselect this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view internal URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show the robots.txt directive (‘matched robots.txt line’ column) of the disallow against each URL that is blocked.

Show External URLs Blocked By Robots.txt

By default external URLs blocked by robots.txt are hidden. To display these in the External tab with Status Code ‘0’ and Status ‘Blocked by Robots.txt’ check this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view external URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show robots.txt directive (‘matched robots.txt line column’) of the disallow against each URL that is blocked.

Custom Robots

You can download, edit and test a site’s robots.txt using the custom robots.txt feature which will override the live version on the site for the crawl. It will not update the live robots.txt on the site.

This feature allows you to add multiple robots.txt at subdomain level, test directives in the SEO Spider and view URLs which are blocked or allowed. The custom robots.txt uses the selected user-agent in the configuration.

During a crawl you can filter blocked URLs based upon the custom robots.txt (‘Response Codes > Blocked by robots.txt’) and see the matching robots.txt directive line.

Please read our featured user guide using the SEO Spider as a robots.txt tester.

Please note – As mentioned above, the changes you make to the robots.txt within the SEO Spider, do not impact your live robots.txt uploaded to your server. You can however copy and paste these into the live version manually to update your live directives.